Cutting data warehouse costs

Last updated:

|

Use incremental or append only sync

Full table syncs are the most expensive way to sync data into the data warehouse, but aren't always necessary.

Both incremental and append only syncs reduce the number of rows synced as both account for existing data.

Sync less frequently

Not all synced data needs to be up to date to be useful. Think about what sort of data freshness you need and change the sync frequency to match.

PostHog offers sync options from every 5 minutes to every 12 hours to monthly. For incremental and full table syncs, decreasing the sync frequency can reduce the number of rows synced.

Audit and disable unused tables

Sources and tables continue to sync even if you aren't actively using them. This can cause unnecessary costs. You should regularly audit your sources and tables to see if you can disable ones you aren't using.

Beyond just looking at the source synced, you can use PostHog's query_log table. This contains details on all queries run in your project. You can use this to see what tables are (or are not) being queried and how often.

For example, you can use the following query to identify potential unused or underused tables:

SQL
WITH cleaned AS (
SELECT
query,
-- Normalize SQL for easier parsing
replaceRegexpAll(replaceRegexpAll(replaceRegexpAll(lowerUTF8(query), '--[^\\n]*', ' '), '/\\*[^*]*\\*+(?:[^/*][^*]*\\*+)*/', ' '), '\\s+', ' ') AS q
FROM query_log
),
tail AS (
SELECT
query,
-- Take everything after the *last* " from "
arrayElement(splitByString(' from ', q),
length(splitByString(' from ', q))) AS after_from
FROM cleaned
),
tokens AS (
SELECT
query,
-- Extract the first token after FROM
trim(BOTH ' ' FROM replaceAll(replaceAll(arrayElement(splitByChar(',', arrayElement(splitByChar('(', arrayElement(splitByChar(' ', after_from), 1)), 1)), 1), '`',''), '"','')) AS tok
FROM tail
)
SELECT
tok AS table_after_from,
count() AS hits
FROM tokens
WHERE tok != ''
GROUP BY table_after_from
ORDER BY hits ASC

This query fetches your most queried tables by parsing the FROM clause from every query. It's a bit complex, we know.

Query to identify most queried tables

It effectively lists how often each table is referenced. This means it also includes views, subqueries, and other queries that reference tables, so it's not a perfect one-to-one-map, but it might be able to help you identify unused or underused tables and disable syncs for them.

Questions? Ask Max AI.

It's easier than reading through 794 pages of documentation

Community questions

Was this page useful?

Next article

Tutorials and guides

Got a question which isn't answered below? Head to the community forum to let us know! Use cases How we found our activation metric (and how you can too) CDP vs data warehouse: Which should you use and why How we use the data warehouse at PostHog SQL The basics of SQL for analytics Using SQL for advanced time and date filters Working with external sources How to set up Stripe reports How to set up Hubspot reports How to set up Zendesk reports How to set up Google Ads reports How to query…

Read next article