How to run migrations on PostHog Cloud
Contents
This document outlines how to do large-scale data migrations on PostHog Cloud without using Async Migrations.
Background
Start of 2022 we wanted to change events table schema to better support our querying patterns.
Doing this migration on cloud took several months and several false starts.
Migration strategy
Read guide to event ingestion before this.
Desired goals on the migration:
- It should be correct - no duplicated or missing data
- It should be timely - can be completed within a reasonable amount of time
- It should cause minimal ingestion delay
- It should keep materialized columns
The rough migration strategy looks like this:
1. Create a new staging table _without_ materialized columns on 1 node on each of the shards.
Note that zookeeper path needs to be unique for this to work.
2. `INSERT` data from the old table to the new staging table (using settings to enable fast copying) on each of the shards
3. Attach a _new_ kafka topic + materialized view + distributed table to catch up with the main table
Note that the kafka consumer group name needs be different from the previous one to make sure everything gets consumed
4. Create the correct materialized columns on the staging table
The following commands worked for me during this migration, this will need to be adjusted for the next migration
5. Remove duplicates from the dataset and materialize columns
Run this on each of the shards.
6. Verify the copy results
Some sample queries used to drill into issues:
7. Replicate the new staging table onto each of the nodes on all shards.
Get the create_table_query
for the new table from system.tables and run it on all the remaining nodes.
8. Stop ingestion, swap the staging and main table names
Take care that consumer group names are correct for the migration
9. Drop old table once all is OK.
Click on any of the sections to see relevant SQL or commands run during the previous migration.
How were the migrations run?
In a tmux session on each of the nodes. Metabase isn't the ideal tool for this due to a lack of progress bars.
Why copy this way?
Some benchmarking was done to find the most efficient copying data.
Copying in medium-sized chunks, not touching the network and avoiding re-sorting won out at roughly 1M rows per second. Including materialized columns or immediately replicating also would have slowed the overall time down.
The settings used during copy were:
Why not clickhouse-copier?
We initially attempted the copy using clickhouse-copier, but ran into issues:
- Copy speed was low (~50000 rows per second)
- Errors during operations - copier copies tables in chunks and these chunks exceeded 50GB (max_table_size_to_drop setting), causing errors
- Hard to ensure correctness due to events being ingested from Kafka
- clickhouse-copier always requires setting
sharding_key
, which slowed down copying - Issues with materialized columns (due to the old version of ClickHouse) we were on
Why not use async migrations?
We created a similar migration for self-hosted that copied data across.
However at the time of writing, schemas on cloud and self-hosted were diverged, making the migrations require different strategies. Also we assumed in async migrations that the amount of data being migrated was less than on cloud.
That said, learnings from here will help future async migrations.