Bayesian statistics
Contents
PostHog's Bayesian analysis engine provides a statistically rigorous yet user-friendly approach to experimentation. Instead of p-values and confidence intervals, it calculates the probability that one variant is better than another and directly answers the question: "Should I ship this change?"
For example, instead of saying "statistically significant at p < 0.05," you get intuitive statements like:
"There's a 96% chance that variant B increases conversion rate."
Input data: What goes into the analysis
The Bayesian engine processes three types of metrics:
Funnel metrics
Track whether users complete a specific action like "Did they complete checkout?" These are yes/no outcomes that we analyze as conversion rates. Learn more about funnel metrics.
Mean metrics
Measure numeric values for each user, such as revenue per user or session duration. We calculate the average value across all users in each variant. Learn more about mean metrics.
Ratio metrics
Combine two metrics by dividing one by another, like revenue per order or clicks per session. These help understand efficiency and relationship between different user behaviors. Learn more about ratio metrics
What the experimentation pipeline does
The engine transforms raw experiment data through a clear 5-step pipeline:
- Aggregate user-level data into sufficient statistics
- Validate data quality and sample size requirements
- Calculate effect size and its uncertainty
- Update beliefs using Bayesian inference
- Generate intuitive probability-based results
Step 1: Data aggregation
Aggregation into sufficient statistics
The raw event data is aggregated into sufficient statistics for each variant:
These sufficient statistics contain all the information needed to calculate means, variances, and covariances without storing individual user data.
Outlier handling (winsorization)
For mean metrics, extreme outliers can skew results. PostHog optionally applies winsorization by trimming values below a lower percentile (e.g., 1st percentile) and capping values above an upper percentile (e.g., 99th percentile). This makes results more robust to data entry errors or extreme edge cases. The percentile bounds are configured per metric when setting up the experiment.
Step 2: Data quality validation
Before analysis proceeds, the engine validates the data quality based on metric type:
All metrics:
- Minimum sample size: Each variant needs at least 50 exposures
Funnel metrics only:
- Minimum conversions: At least 5 conversions per variant
- Normal approximation validity: Both n × p > 5andn × (1-p) > 5must be satisfied, wherenis the number of users andpis the conversion rate. This ensures the normal approximation to the binomial distribution is accurate.
Mean and ratio metrics only:
- Non-zero baseline: The control variant must have a non-zero mean (needed for relative difference calculations like "20% increase")
If any validation fails, the analysis stops and returns appropriate error messages instead of potentially misleading results.
Step 3: Calculate effect size and variance
Effect size calculation
The effect size quantifies how much better (or worse) the treatment variant performs compared to control. PostHog uses relative differences by default, which express changes as percentages:
For example, if control has a 10% conversion rate and treatment has 12%, the relative difference is +20%. This is what PostHog displays in the UI (e.g., "Variant B: +20% conversion rate"). Relative differences are intuitive for business metrics because they show the proportional change, making it easy to understand the business impact.
Variance calculation
Variance measures the spread or uncertainty in our data. Think of it as quantifying "how sure are we about this number?" A high variance means the data points are spread out and we're less certain about the true average. A low variance means the data is consistent and we can be more confident.
Variance is calculated from the actual experiment data. When users in your experiment convert at different rates or have wildly different revenue values, that creates variance. The formulas differ by metric type:
For funnel metrics:
Where p is the conversion rate and n is the sample size. The variance is highest when p=0.5 (50% conversion rate) and lowest when p is close to 0 or 1.
For mean metrics:
This measures how spread out individual user values are from the average.
For ratio metrics:
This complex formula accounts for uncertainty in both the numerator and denominator, plus how they vary together.
The effect variance (comparing treatment vs control) combines the variances from both groups:
This combined variance is what determines the width of our credible intervals and our confidence in the results.
Step 4: Bayesian posterior update
The Bayesian approach combines prior beliefs with observed data to create an updated (posterior) distribution of the effect size. This posterior distribution represents our best understanding of the true effect, accounting for all uncertainty.
The prior distribution
PostHog uses non-informative priors with a mean of 0 (no expected effect) and very large variance (highly uncertain). This essentially means "let the data speak for itself" - our results are driven entirely by the observed data, not by any preconceived notions about what the effect should be.
The posterior distribution
With a non-informative prior, the math simplifies:
The posterior distribution is approximately:
This normal distribution represents our updated belief about the true effect size, incorporating all the uncertainty from our finite sample.
Step 5: Generate results
From the posterior distribution, we calculate the key metrics shown in the PostHog UI.
Chance to win
The probability that the treatment variant is actually better than control:
This uses the cumulative distribution function (CDF) of the normal distribution to calculate the probability that the effect size is positive (improvement) or negative (degradation).
Credible interval
The 95% credible interval gives a range where the true effect size likely lies:
You can directly interpret this as: "There's a 95% probability the true effect lies in this range."
Significance (decisiveness)
A result is marked as "significant" (or decisive) when we have strong evidence:
This means we're at least 95% confident that one variant is better than the other.
Mathematical formulas reference
Posterior update with Gaussian conjugate priors
For those interested in the mathematical details, here's how Bayesian updating works using Gaussian conjugate priors (a mathematical framework where prior and posterior distributions have the same form):
The key insight is that precision (inverse of variance) represents how "confident" we are - higher precision means lower uncertainty. The posterior precision is simply the sum of prior and data precisions.
With non-informative priors (τ₀ → 0), this simplifies to μₙ ≈ x̄ and σₙ² ≈ σ²/n.
Delta method for ratio metrics
For a ratio R = M/D where M and D are random variables, the delta method provides a linear approximation for the variance:
This first-order Taylor approximation works well when the coefficient of variation of D is small (< 0.3), meaning the denominator doesn't vary too much relative to its mean.
What about multiple variants?
When testing multiple variants (A/B/C/D tests):
- Each variant is compared to control independently
- The chance to win is calculated for each variant vs. control
- No correction for multiple comparisons
- This is philosophically consistent: each comparison stands on its own