At PostHog, we store arbitrary payloads users send us for further analysis as JSON. As such, it's critical we do a good job at storing and analyzing this data.
This document covers:
- Storing JSON in
Strings and operations on them
- Why and how to compress this data
- Materialized columns
- Alternative solutions: JSON data type, arrays
At PostHog, we store JSON data as
Relevant properties are then parsed out from the
String columns at query-time using
This has the following problems:
- These columns end up really large even after compression, meaning slow I/O
- It requires CPU to parse properties
- Data is not stored optimally. As an example, JSON keys are frequently repeated and numbers are stored as strings.
Luckily, JSON compresses really well, speeding up reading this data from disk.
By default our JSON columns are compressed by the
LZ4 algorithm. See
benchmarks for more information and benchmarks.
ClickHouse has support for Materialized columns which are columns calculated dynamically based off of other columns.
We leverage them to dynamically create new columns for frequently-queried JSON keys to speed up queries as each materialized column is stored the same way as normal columns and requires less resources to read and parse.
Read more in our blog and in this guide for PostHog specific details.
After adding a materialized column, it is only populated for new data and on merges. When querying old data, this can introduce performance regressions, so forcing the column to be written to disk, even for historical data, is recommended.
Materialized columns may cause issues during operations - e.g. they can make copying data between tables painful. It's sometimes worth considering dropping them before large operations.
Uber published an article on their logging, popularizing the idea to store JSON data as arrays: one for keys, one for values.
However internal benchmarking showed that in our use-case the improvement wasn't big enough to be worth the investment (yet)..
Semi-structured data / JSON data type
In 2022, ClickHouse released support for semi-structured data.
However after testing we encountered several fundamental problems which make this feature unusable in our case until they are resolved: 1, 2, 3, 4, and 5
Next in the ClickHouse manual: Query performance