Skip to content
Free Calculator

Statistical Significance Calculator

Enter your A/B test results and get a p-value, confidence interval, and a clear pass/fail verdict in seconds.

What Is Statistical Significance?

Statistical significance is the answer to one question: given how my variant performed, how likely is it that the difference from control was caused by random chance? If that probability (the p-value) is below your threshold (typically 5%), the result is significant — the difference is real, not noise.

Reading the result

A green “Significant” badge means your variant beat control by enough to rule out random chance at your chosen confidence level. Red “Not significant” means you can't distinguish the variant from the control — yet.

P-value

The probability you'd see a difference at least this large if the variant had no real effect. Lower is better. p < 0.05 is the most common threshold.

Confidence interval

The range of plausible values for the true lift. If the interval doesn't cross zero, the result is significant. A wide interval means more uncertainty — usually a sign you need more data.

How Statistical Significance Is Calculated

For binary conversion metrics (signed up / didn't, paid / didn't), the standard test is a two-proportion z-test. The z-score measures how many standard errors apart your two rates are. The p-value comes from the z-score via the normal distribution.

z = (pV − pC) / √(p̄(1−p̄) × (1/nC + 1/nV))

pC, pV — Observed rates

Your control conversion rate (conversions ÷ visitors) and variant conversion rate. The numerator pV − pC is the absolute lift in percentage points.

p̄ — Pooled rate

The combined rate across both groups, treating them as one sample. Used in the standard error because under the null hypothesis (no effect), both groups have the same true rate.

nC, nV — Sample sizes

Visitors in control and variant. Larger samples shrink the standard error, which makes smaller lifts detectable. This is why traffic is the most reliable lever for reaching significance.

p-value

Derived from z by integrating the normal distribution. Two-tailed: p = 2 × P(Z > |z|). One-tailed: p = P(Z > z). The calculator handles both.

What Counts as Significant?

A p-value alone doesn't tell the whole story. Pair it with the confidence interval and your business context.

p < 0.01

Strong evidence

Less than 1% chance the result is random. Confident enough for high-stakes shipping decisions.

p < 0.05

Standard threshold

The default for most A/B tests. ~1 in 20 results will be a false positive at this threshold — keep that in mind across many tests.

p ≥ 0.05

Inconclusive

Not significant at 95%. Doesn't mean the variant doesn't work — it means you can't rule out random chance with this data.

How to Reach Significance Faster

More traffic

The biggest lever. Doubling sample size shrinks standard error by ~30%. Use the MDE calculator to plan how much traffic you need.

Lower confidence threshold

Going from 99% to 95% — or 95% to 90% — reduces the bar for significance. Trade off: more false positives.

One-tailed test

If you only care about improvement, switch to one-tailed. Reduces sample size requirements by ~20% but loses the ability to detect harm.

Bigger effect

Subtle changes need huge samples. If your test is genuinely meant to be a small tweak, accept that you may never reach significance — and decide on a different basis.

Statistical Significance vs Sample Size

They're tightly connected — but they answer different questions. Use both together when planning and analyzing tests.

This calculator (post-test)

You ran the test. Now check whether the observed difference is statistically significant. This calculator answers: given my data, can I trust the result?

MDE Calculator (pre-test)

Before you launch, what's the smallest effect your traffic can detect? This is the planning side. If your MDE is bigger than the effect you expect, the test is doomed before it starts — find more traffic or skip the test.

Significance for Subscription Businesses

Subscription funnels span very different traffic volumes. Significance bars look different at each stage. Use this calculator with the right expectations:

High-traffic top of funnel

Sign-up rate, landing page conversion. Plenty of volume — significance achievable in 1-3 weeks for 5%+ relative lifts.

Trial-to-paid, free-to-paid

Mid-funnel, lower volume. Plan for 4-8 weeks. If your variant is a small copy tweak, expect inconclusive results — that's normal.

Cancellation, payment recovery

Often too low-volume for traditional A/B testing. Consider before/after analysis or matched-group comparisons instead.

Pricing changes

High stakes. Use 99% confidence and run for at least one full billing cycle. Don't stop early even if you peek and see significance.

Frequently Asked Questions

What is statistical significance in A/B testing?

Statistical significance means the difference between your control and variant is unlikely to be caused by random chance. At a 95% confidence level, a result is significant when the p-value is below 0.05 — meaning there's less than a 5% chance the observed difference is a fluke.

How do I calculate statistical significance for an A/B test?

Use a two-proportion z-test. Enter your control conversions, control visitors, variant conversions, and variant visitors into the calculator above. It computes the z-score, p-value, and confidence interval, and tells you whether the result is significant at your chosen confidence level.

What does a p-value mean?

The p-value is the probability that you'd see a difference at least as large as yours if the variant had no real effect. A p-value of 0.03 means there's a 3% chance the result is just noise. The lower the p-value, the more confident you can be that the difference is real.

What confidence level should I use for A/B testing?

95% is the standard. For high-stakes tests like pricing changes, use 99%. For early exploration where you want fast directional reads, 90% is acceptable. Higher confidence reduces false positives but requires more traffic.

Should I use a one-tailed or two-tailed test?

Use two-tailed by default. It detects changes in either direction, so you'll catch a variant that hurts as well as one that helps. One-tailed only detects improvement and needs less data, but you won't see negative effects — only use it when you're certain the variant can't do harm.

Why are my A/B test results not statistically significant?

Three common reasons: not enough traffic, the effect is smaller than expected, or your test ran for too short a time. Use the MDE Calculator to check what effect size you can actually detect with your traffic. Often the test isn't broken — your variant just isn't different enough to show up against statistical noise.

How long should I run an A/B test for significance?

Long enough to reach the sample size needed for your expected effect. Stopping early — even when p < 0.05 — inflates false positives because of repeated peeking. Pre-commit to a sample size or test duration before launch, and don't call results until you hit it.

What is the difference between statistical significance and practical significance?

Statistical significance means the difference is real. Practical significance means the difference is big enough to matter. With huge sample sizes, you can detect a 0.1% lift as statistically significant — but if it costs $50K to ship the change, that lift may not pay for itself. Always ask: significant AND material?

Need Help Designing Better Tests?

Get personalized guidance on experimentation strategy, test prioritization, and interpreting results for your subscription business.

Get the playbook every Thursday

Weekly strategies for subscription businesses. Real companies, real numbers, tactics you can steal. From the operator who grew Codecademy from $10M to $50M ARR.