On-call and escalation
Contents
At PostHog, every engineer is responsible for maintaining their team's services and systems. That includes:
- Tracking and visualizing key performance metrics
- Configuring automated alerting
- Documenting runbooks for on-call resolution
- First-responder to team-owned services during working hours
In addition, every engineer regardless is part of the global follow-the-sun on-call rotation.
Escalation schedules
Team schedules
Every team has 2 schedules in incident.io
On call: {team}- This is the working-hours rotation. Each engineer should have their working hours in place here Mon-Fri with a sensible working day
- For example 8:00-17:00 for EU based engineers is likely preferable as there will be US engineers who can take 17:00 onwards
- Each member is responsible for ensuring this is up-to-date with PTO. You can create an override for your schedule simply assigned to "No one".
Support: {team}- This is a weekly or bi-weekly rotation (teams can decide) that covers both who is assigned to the support hero rotation as well as the out of-hours-escalation for the extreme case
Global on-call schedule
💡 You can use @on-call-global in Slack to reach out to whoever is on call! This syncs automatically with the incident.io schedule. This group is also automatically added to all incidents.
PostHog Cloud doesn't shut down at night (whose night anyway?) nor on Sunday. As a 24/7 service, our goal is to be 100% operational 100% of the time. The global on-call is the last line of defense and is escalated to:
- if nobody at the
On call: {team}level is available - if the alert is critical but has no team assignment (for whatever reason)
This schedule has 3 week day layers:
- Europe (06:00 to 14:00 UTC) - (8 hours)
- Americas East (14:00 to 22:00 UTC) - (8 hours)
- Americas West (22:00 to 06:00 UTC) - (8 hours)
And 2 weekend layers:
- Europe Weekend (06:00 to 18:00 UTC) - (12 hours)
- Americas Weekend (18:00 to 06:00 UTC) - (12 hours)
Why is the on-call rotation spread across all engineers?
If you're in a product team, it's tempting to think that service alerts don't apply to you, or that when you're on call you can just hand everything off to the infrastructure team. That's not the case, because it's important that every engineer has a basic understanding of how our software is deployed, where the weak points in our systems are, and what the failure modes look like. This understanding should be all that's needed to follow the runbooks, and if you follow the causes of alerts, ultimately you'll be less likely to ship code that takes PostHog down.
Besides knowledge, being on call requires availability – including weekends. If teams had their own separate rotations, there would be more people on call in total, and each would have to stand by 24/7 as our teams aren't big enough to follow the sun. This would be more stressful because of availability constraints, while being less productive because of the rare alerts being spread across multiple people.
Before going on call
Be prepared
Because the stability of production systems is critical, on-call involves weekends too (unlike Support Hero). More likely than not, nothing will happen over the weekend – but you never know, so the important thing is to keep your laptop at hand.
Before going on call, make sure you have the Incident.io mobile app Android / iOS installed and configured. This way it'll be harder to miss an alert.
TRICKY: We use Slack auth for incident.io and Slack really doesn't like you using the mobile web version. Make sure to choose
Sign in with Slackand then use your email to login to Slack, not google auth as that seems to cause redirect issues for some people.
To get a calendar with all your on-call shifts from incident.io go to the schedules section, select Sync calendar at the top right and copy the link for the webcal feed. In google calendar, add a new calendar from URL and paste the link in there.
Make sure your availability is up-to-date
If you are unavailable for any of your schedules you need to act!
- For your
On call: {team}schedule simply click on your name in your layer, clickcreate an overrideand then remove yourself from the list so it showsNo one - For your
Support: {team}schedule orOn call: {global}schedules clickRequest coverat the top right. This will notify selected team members automatically to find someone to cover you (you should probably do a shout out in #ask-posthog-anything as well). You can trade whole weeks, but also just specific days. Remember not to alter the rotation's core order, as that's an easy way to accidentally shift the schedule for everyone.
Make sure you have all the access you might need
To be ready, make sure you have access to:
- PostHog Cloud admin interfaces (🇺🇸 US / 🇪🇺 EU) - post in #ask-posthog-anything to be added
- Grafana (🇺🇸 US / 🇪🇺 EU)
- ArgoCD - this is where 99% of cluster operations take place such as restarting pods, scaling things up and down etc.
- Metabase (🇺🇸 US / 🇪🇺 EU) - post in #ask-posthog-anything to be invited
More advanced access
If you are part of a team that looks after more critical infrastructure such as infra, ingestion, workflows, error-tracking etc. then you are expected to dive deeper than the usual on-call engineer.
As well as the above access you should ensure you have access and feel comfortable working with:
- EKS over
kubectl/k9s, in case you need to run Kubernetes cluster operations (such as restarting a pod) – follow this guide to get access - Our tailnet, which gates our internal services (such as Grafana, Metabase, or runbooks) – follow this guide to join
Responding to alerts when on-call

Critical alerts will trigger per-team escalation policies which go like this:
- If available, a member of the team associated with the alert is paged first
- If nobody is available or nobody responds within the configured time then the
On call: globalschedule is paged
If at any point you get paged - always respond! Even if you are unavailable you should respond as such (either via the app or the personal Slack notification). That way the escalation can continue to the next available person.
By default if you are being paged, especially as the global on-call, the alert is considered critical, meaning it almost definitely requires attention.
Every alert should have associated Grafana and Runbook links allowing you to quickly get more visual details of what is going on and how to respond.
When an alert fires, find if there's a runbook for it. A runbook tells you what to look at and what fixes exist. In any case, your first priority will be to understand what's going on, and the right starting point will almost always be Grafana.
Sometimes alerts are purposefully overly-sensitive and might already be fixing themselves by the time you see them. Use your best judgement here. If the linked graph has a spike that is clearly coming down, watch it closely and give it time for the alert to auto-resolve.
If the alert is starting to have any noticeable impact on users or you are not sure whether to raise an incident - go raise an incident. It's that simple.
If you're stumped and no resource is of help, get someone from the relevant team to shadow you while you sort the problem out. The idea is that they can help you understand the issue and where to find how to debug it. The idea is not for them to take over at this point, as otherwise you won't be able to learn from this incident.