Frequentist statistics
Contents
PostHog's frequentist analysis engine provides a statistically rigorous and widely understood approach to experimentation. It uses t-tests and confidence intervals to determine whether differences between variants are statistically significant and directly answers the question:
"Can we be confident this difference isn't due to chance?"
Input data: What goes into the analysis
The frequentist engine processes three types of metrics:
Funnel metrics
Track whether users complete a specific action like "Did they complete checkout?" These are yes/no outcomes that we analyze as conversion rates. (Learn more about funnel metrics)
Mean metrics
Measure numeric values for each user, such as revenue per user or session duration. We calculate the average value across all users in each variant. (Learn more about mean metrics)
Ratio metrics
Combine two metrics by dividing one by another, like revenue per order or clicks per session. These help understand efficiency and relationship between different user behaviors. (Learn more about ratio metrics)
What the experimentation pipeline does
The engine transforms raw experiment data through a clear 5-step pipeline:
- Aggregate user-level data into sufficient statistics
- Validate data quality and sample size requirements
- Calculate effect size and its uncertainty
- Perform statistical hypothesis testing
- Generate confidence intervals and significance results
Step 1: Data aggregation
Aggregation into sufficient statistics
The raw event data is aggregated into sufficient statistics for each variant:
These sufficient statistics contain all the information needed to calculate means, variances, and standard errors without storing individual user data.
Outlier handling (winsorization)
For mean metrics, extreme outliers can skew results. PostHog optionally applies winsorization by trimming values below a lower percentile (e.g., 1st percentile) and capping values above an upper percentile (e.g., 99th percentile). This makes results more robust to data entry errors or extreme edge cases. The percentile bounds are configured per metric when setting up the experiment.
Step 2: Data quality validation
Before analysis proceeds, the engine validates the data quality based on metric type:
All metrics:
- Minimum sample size: Each variant needs at least 50 exposures
Funnel metrics only:
- Minimum conversions: At least 5 conversions per variant
- Normal approximation validity: For proportions, both
np ≥ 5andn(1-p) ≥ 5are required for the t-test to be valid
Mean and ratio metrics only:
- Non-zero baseline: The control variant must have a non-zero mean (needed for relative difference calculations like "20% increase")
If any validation fails, the analysis stops and returns appropriate error messages instead of potentially misleading results.
Step 3: Calculate effect size and variance
Effect size calculation
The effect size quantifies how much better (or worse) the treatment variant performs compared to control. PostHog uses relative differences by default, which express changes as percentages:
For example, if control has a 10% conversion rate and treatment has 12%, the relative difference is +20%. This is what PostHog displays in the UI (e.g., "Variant B: +20% conversion rate"). Relative differences are intuitive for business metrics because they show the proportional change, making it easy to understand the business impact.
Variance calculation
Variance measures the spread or uncertainty in our data. Think of it as quantifying "how sure are we about this number?" A high variance means the data points are spread out and we're less certain about the true average. A low variance means the data is consistent and we can be more confident.
Variance is calculated from the actual experiment data. When users in your experiment convert at different rates or have wildly different revenue values, that creates variance. The formulas differ by metric type:
For funnel metrics:
Where p is the conversion rate and n is the sample size. The variance is highest when p=0.5 (50% conversion rate) and lowest when p is close to 0 or 1.
For mean metrics:
This measures how spread out individual user values are from the average.
For ratio metrics:
This complex formula accounts for uncertainty in both the numerator and denominator, plus how they vary together.
The pooled variance (comparing treatment vs control) combines the variances from both groups:
For absolute differences:
For relative differences (using delta method):
This pooled variance determines the width of our confidence intervals and the sensitivity of our statistical test.
Step 4: Statistical hypothesis testing
The frequentist approach tests a specific hypothesis about the difference between variants using a two-sided t-test. This follows the classical statistical framework of null hypothesis significance testing.
The null hypothesis
The null hypothesis (H₀) states that there is no difference between the treatment and control:
- H₀: effect_size = 0 (no difference)
- H₁: effect_size ≠ 0 (there is a difference)
The t-test
PostHog uses Welch's t-test, which handles unequal variances between groups:
T-statistic calculation:
Degrees of freedom (Welch-Satterthwaite approximation):
Where s₁² and s₂² are the sample variances, and n₁ and n₂ are the sample sizes.
P-value calculation
The p-value represents the probability of observing a difference this large or larger if there truly was no effect:
This uses the cumulative distribution function (CDF) of the t-distribution to calculate the two-sided p-value.
Step 5: Generate results
From the statistical test, we generate the key metrics shown in the PostHog UI.
Point estimate
The best estimate of the true effect size - this is simply the observed difference between treatment and control, expressed as a relative percentage by default.
Confidence interval
The 95% confidence interval gives a range of plausible values for the true effect:
Where t_critical is the critical value from the t-distribution at the 95% confidence level.
You can interpret this as: "If we repeated this experiment many times, 95% of confidence intervals would contain the true effect size."
Statistical significance
A result is marked as "statistically significant" when we have strong evidence against the null hypothesis:
This means there's less than a 5% probability that we'd observe such a large difference if there truly was no effect. The 0.05 threshold (alpha level) is the standard significance level used in PostHog.
Mathematical formulas reference
T-test with Welch-Satterthwaite degrees of freedom
For those interested in the mathematical details, here's how the t-test works with unequal variances (Welch's method, which doesn't assume equal variances between groups):
The key insight is that Welch's method accounts for different variances between groups by adjusting the degrees of freedom, making it more robust than the traditional equal-variance t-test.
Delta method for relative differences
For relative differences, we use the delta method to approximate the variance:
Since treatment and control are independent, the covariance term equals zero, simplifying the calculation.
What about multiple variants?
When testing multiple variants (A/B/C/D tests):
- Each variant is compared to control independently
- The p-value and confidence interval are calculated for each variant vs. control
- No correction for multiple comparisons is applied by default
- Each comparison uses the same α = 0.05 significance level