Making sure PostHog operates fast at scale is key to our success.
This document outlines some tools we have to discover and fix issues.
Tools of the trade
1. Grafana "Clickhouse queries - by endpoint" dashboard
This dashboard gives a breakdown of how things are looking reliability and performance-wise.
Highly used and slow/unreliable endpoints often indicate issues with queries.
2. Instance status dashboard
Under https://app.posthog.com/instance/status/internal_metrics you will find various metrics and query logs.
Note that if you are a staff user you can also analyze queries by clicking on them (or copying your own queries).
This analysis will output:
- Query runtime
- Number of rows read / Bytes read
- Memory used
- Flamegraphs for CPU, time and memory
These can be useful for figuring out why certain queries are performing slow.
3. Metabase - finding slow queries
Need more granular access to queries than these dashboards provide? Take a look at this metabase query. Clickhouse
system tables (e.g.
system.query_log) provide a lot of useful information for identifying and diagnosing slow queries.
4. ClickHouse query benchmarks
If you have a fix, you need to make sure it works and doesn't regress.
Under the main repo, you can find a query performance benchmark suite at
ee/benchmarks. These are run nightly on master.
After fixing an issue:
- Make sure there's an appropriate benchmark for the fix
performancelabel to your PR to get information about the performance changes
Only changes > 20% are reported. Clickhouse queries are run on a private clickhouse node from our clickhouse chart and pre-populated with data.