Experiments

What does "statistical significance" mean for AI experiments?

A graphic of a bar chart with an arrow pointing upward.

Running statistical significance on an AI experiment the same way you'd run it on a button color test is one of the most common ways teams end up drawing confident conclusions from noise.

The core problem is structural: traditional A/B testing assumes a fixed treatment, but AI models are probabilistic by design. Every LLM call is a sample from a distribution, not a deterministic output. That adds a second layer of variance on top of normal user behavior — and the statistical frameworks most teams use were never built to handle two noise sources at once.

This article is for engineers, PMs, and data teams who are running or planning AI experiments and want to understand why standard testing instincts break down. Whether you're testing a new prompt configuration, a different model version, or an LLM-powered feature, the same core issues apply. Here's what you'll learn:

  • Why AI outputs are noisier than traditional feature tests — and what that means for sample sizes
  • How to choose metrics that actually reflect user outcomes, not just model quality
  • The validity traps that silently corrupt results: p-hacking, peeking, and multiple metrics
  • Why statistical significance alone isn't enough, and how to define practical significance before your experiment runs

The article moves in order from the statistical foundations to the practical decisions you make when designing and interpreting AI experiments. Each section builds on the last, so if you're new to significance testing in this context, reading straight through will give you the clearest picture.

What statistical significance actually means (and why AI experiments make it harder)

If you've run A/B tests before, you've encountered the phrase "statistically significant" — probably as the threshold between "we can ship this" and "we need more data." But the concept is frequently misapplied even in traditional testing environments, and AI experiments introduce a structural complication that makes misapplication far more costly.

Before getting into what makes AI experiments different, it's worth being precise about what statistical significance actually means.

The core definition: p-values, null hypotheses, and the 0.05 threshold

Statistical significance is a measure used in hypothesis testing to determine whether observed results are unlikely to have occurred due to chance alone. The mechanism is the p-value: a number that answers a specific question — if there were truly no difference between your control and treatment, how likely would it be to see results this large just by chance? A p-value of 0.05 means there's only a 5% chance you'd see this result if nothing real were happening.

When the p-value falls below a predetermined threshold — conventionally 0.05 — you reject the null hypothesis, which is simply the assumption that there is no real effect between your control and treatment groups. That threshold acts as the gatekeeper between "this might be noise" and "this is probably real."

Two things are worth internalizing here. First, the 0.05 threshold is a convention, not a law of nature. It's a reasonable default that the field has coalesced around, but it carries no special mathematical authority.

Second — and this matters enormously for AI experiments — statistical significance only tells you whether an effect is real. It says nothing about whether that effect matters. As one practitioner put it on Hacker News: "Significance testing only tells you the probability that the measured difference is a 'good measurement'... Whether the measured difference is significant in the sense of 'meaningful' is a value judgement." A result can be statistically significant and practically irrelevant. Keeping those two questions separate is essential.

What traditional A/B tests assume about determinism

Standard A/B testing was designed around a specific noise structure. Once a user is assigned to a variation, their experience is fixed. A button is blue or it isn't. A headline reads one string or another. The treatment is deterministic — it doesn't change between exposures, across users, or over time.

This means the only meaningful source of variance in a traditional experiment is user behavior. People click or they don't. They convert or they don't. That behavioral variance is real and substantial, but it's bounded and well-understood.

The statistical framework — power analysis, minimum detectable effects, sample size calculations — was built to handle exactly this structure: one noise source, controlled through randomization and sufficient sample size.

How AI outputs break the determinism assumption

AI models, particularly large language models, are probabilistic by design. Temperature settings, sampling parameters, and model versioning mean that the same prompt from the same user in the same context can produce meaningfully different outputs across calls. The treatment is not fixed. It varies.

This introduces a second, independent source of variance that traditional A/B testing frameworks were never designed to accommodate. In a standard experiment, variance comes from users. In an AI experiment, variance comes from users and from the model's outputs. These two sources compound.

When you're measuring whether a new LLM configuration improves task completion rates, you're trying to detect a signal through two layers of noise simultaneously — the natural variation in how different users behave, and the natural variation in what the model produces for any given user.

This isn't a minor technical footnote. It's a structural difference that changes the statistical bar before you've written a single line of experiment configuration.

What compounded variance means for sample sizes

Variance and sample size are directly linked in hypothesis testing. When variance increases, you need more data to detect the same effect at the same confidence level. Another way to say the same thing: if your sample size is fixed, higher variance means only larger effects will show up as significant — smaller real improvements become invisible to the test.

That's what minimum detectable effect (MDE) means in practice: the smallest true difference between control and treatment that your experiment can reliably surface given your significance threshold and statistical power.

Teams that size AI experiments using the same calculators they use for button color tests will systematically underpower their studies. The calculator assumes one noise source. The experiment has two. The result is experiments that run to completion, show no significant effect, and get interpreted as "the AI feature doesn't work" — when the real problem is that the study was never adequately powered to detect the effect in the first place.

CUPED and post-stratification exist precisely to address this: by controlling for pre-experiment variance, they effectively shrink the noise floor and let experiments reach significance faster without requiring proportionally larger sample sizes.

The practical implication is straightforward: AI experiments require either larger samples, longer runtimes, tighter metric definitions, or variance reduction techniques — and ideally some combination of all four. Teams that don't account for this upfront will spend cycles on underpowered experiments and draw conclusions from noise.

Why AI outputs are noisier than traditional feature tests

When you run an A/B test on a button color, every user in the treatment group sees the same button. The only variance in your results comes from differences in how users respond. That's one source of noise, and it's the source that classical statistical frameworks were built to handle.

AI experiments introduce a second source of noise that sits upstream of the user entirely — the model output itself — and when these two sources compound, the sample sizes and run times that feel adequate for traditional tests become systematically insufficient.

Output stochasticity: the noise layer that doesn't exist in button tests

In a deterministic feature test, "treatment" means something precise: every user in the treatment group received the same intervention. In an LLM-powered feature, treatment group members don't share a common experience — they share a distribution of experiences. The model produces different outputs for different users, and often different outputs for the same user on repeated queries.

This means you're not measuring the effect of a feature; you're measuring the average effect of a range of outputs, some of which may be helpful, some neutral, some actively counterproductive. That output-level variance sits on top of normal user-level behavioral variance, and the two sources multiply rather than average out.

Delayed user outcomes and the learning curve problem

The METR study on AI-assisted developer productivity illustrates this failure mode with unusual clarity. In a controlled within-subject design — where the same 16 experienced open-source developers alternated between AI-assisted and non-AI-assisted tasks — developers perceived a 20–24% speedup from using Cursor. Controlled measurements showed actual slowdowns. Three-quarters of participants saw reduced performance; only one quarter improved.

Notably, the one developer with more than 50 hours of prior Cursor experience was among the top performers, suggesting that AI tool benefits, if they exist, accrue over time as users climb a steep learning curve. This creates a specific measurement problem: the costs of an AI feature (slower task completion during the adaptation phase) are immediate and show up cleanly in your experiment window, while the benefits may only materialize weeks later as users internalize new workflows.

As Simon Willison noted in discussing the study, "getting a significant productivity boost from LLM assistance and AI tools has a much steeper learning curve than most people expect." A short experiment window captures the noisiest, least representative phase of user behavior — which is exactly when most teams are measuring.

Metric attribution lag: when the outcome you care about happens later

Even setting aside the learning curve, there's a separate problem with causal chain length. A coding assistant might improve output quality in ways that only surface in code review rates or production bug counts weeks after the experiment closes. A customer support AI might reduce churn in ways that take a full billing cycle to register.

This isn't the same as the learning curve problem — it's about the distance between the AI's output and the business outcome you actually care about. The longer that causal chain, the more opportunity there is for confounding variables to enter the measurement, and the less your experiment window reflects the true steady-state effect.

How these three sources interact to inflate required sample sizes

Each of these noise sources — output stochasticity, user adaptation lag, and metric attribution lag — independently increases within-group variance. When they operate simultaneously, the effect on required sample sizes is compounding. Higher variance raises the MDE threshold, which means you need proportionally larger samples to achieve the same statistical power at the same confidence level.

A sample size calculation done using traditional A/B test variance assumptions will underpower an AI experiment, often by a significant margin.

The practical corrective is to treat variance reduction as a first-class concern in AI experiment design. Techniques like CUPED — which controls for pre-experiment covariates like prior tool experience, exactly the confound the METR study identified — and post-stratification can meaningfully reduce the sample sizes required to reach significance. That's not a complete solution to AI-specific noise, but it's a concrete lever that teams can pull while they work on the harder problems of metric selection and experiment duration.

Eval scores measure the model; these metrics measure whether users are better off

Accounting for AI-specific variance gets you to a valid experiment. Choosing metrics that actually reflect user outcomes means you're measuring the wrong thing with statistical rigor — which is its own kind of failure. Metric selection is the single highest-leverage decision you make in an AI experiment.

Get it wrong and you'll either miss a genuinely effective feature or declare a win on something that doesn't move the needle for users. This problem is more acute in AI experiments than in traditional A/B tests precisely because the outputs are probabilistic and the outcomes you care about are often delayed — which means the gap between what you can easily measure and what you should measure is unusually wide.

The hierarchy: downstream behavioral outcomes over model scores

The temptation in AI experiments is to reach for metrics that are close to the model: eval scores, BLEU scores, accuracy benchmarks, response quality ratings. These are measurable, fast to compute, and feel rigorous. The problem is they answer the wrong question. Evals tell you whether the model is better. A/B tests should tell you whether users are better off — and those two things frequently diverge.

The right hierarchy puts downstream behavioral metrics at the top: retention rate, task completion rate, session depth, revenue per user. These are the outcomes that reflect whether the AI feature is actually delivering value. Intermediate model-quality scores can serve as guardrails — useful for catching regressions — but they shouldn't be your primary success metric.

Landon Smith, Head of Post-Training at Character.AI, put it directly: the goal is to compare modeling techniques "from the perspective of our users," guiding research in the direction that best serves the product, not the benchmark.

The North Star Metric framework is a useful anchor here. A well-chosen NSM reflects the core value delivered to users, predicts long-term revenue, and is actionable by the team. Spotify's "time spent listening" is the canonical example — it captures engagement in a way that model-quality scores never could.

Where AI metric design goes wrong

Three failure modes show up repeatedly in AI experiments. The first is measuring model outputs instead of user outcomes — the "vibe check" problem. A model that scores higher on internal evals may not improve task completion at all, and treating eval improvement as a proxy for user value is where many teams go wrong.

The second failure mode is choosing metrics with very low base rates. A metric that converts at 0.5% requires an enormous sample to detect a meaningful effect. The Minimum Detectable Effect (MDE) framework makes this concrete: if the expected effect size is smaller than the MDE given your sample size and significance threshold, the test cannot surface a real difference even if one exists. Low-baseline proxies don't just require more data — they often make experiments practically infeasible.

The third failure mode is metric proliferation. Adding model-quality metrics alongside behavioral metrics feels thorough, but it inflates your false positive rate in ways that compound quickly. With five unrelated metrics evaluated at a 0.05 significance threshold, the probability of at least one false positive reaches 41%. With two metrics, it's still 19%.

The practitioner community has flagged this as one of the most common misuses of significance testing in industry — and AI experiments, with their natural temptation to track both model and user metrics simultaneously, are especially vulnerable.

Variance reduction as a practical lever

Even with the right metrics, AI experiments tend to have high within-group variance — which means longer runtimes and larger required sample sizes. Variance reduction techniques are the practical response to this problem. They don't fix bad metric selection, but they make good metric selection more efficient.

CUPED works by using data you already have about users before the experiment starts — things like their prior usage frequency, historical purchase rate, or tenure — to filter out variance that has nothing to do with the treatment. If heavy users always convert at higher rates regardless of what you test, CUPED controls for that, leaving a cleaner signal. Post-stratification achieves a related reduction through stratified analysis after the fact.

GrowthBook's experimentation platform includes both natively, and post-stratification has been shown to run experiments approximately 20% faster — a meaningful gain when you're waiting on downstream behavioral metrics that take days or weeks to accumulate.

Sequential testing is the third tool worth understanding here. When downstream behavioral metrics take time to mature, teams face pressure to peek at results early. Sequential testing methods allow continuous monitoring without inflating false positive rates — which is particularly relevant when your primary metric is something like 30-day retention rather than an immediate click.

The underlying principle across all three techniques is the same: reduce the noise in your measurement so the signal from a real effect can emerge faster. None of them substitute for choosing a metric that reflects actual user outcomes in the first place.

P-hacking, peeking, and multiple metrics: the validity traps that undermine AI experiment results

Even when you've chosen the right metrics and accounted for AI-specific noise, your experiment results can still be meaningless — not because of bad data, but because of how you ran the test. The validity threats in this section aren't exotic edge cases.

They're the default behavior of fast-moving product teams under pressure to ship, and they're especially dangerous in AI experiments where iteration cycles are fast and the temptation to keep tweaking until something works is structurally baked into the workflow.

Every metric you add compounds your false positive rate

Every time you add a metric to an experiment and evaluate it at a 0.05 significance threshold, you're not just measuring one more thing — you're compounding your family-wise error rate. GrowthBook's documentation puts this plainly: if you test the same hypothesis across 20 different metrics at the 5% significance level, the probability of finding at least one statistically significant result by chance alone climbs to around 64%. That's not a failure of execution. That's arithmetic.

The five-metric version of this problem is less dramatic but more common. Testing five independent metrics at α = 0.05 yields a false positive probability of roughly 22.6%. The math: each metric has a 95% chance of not producing a false positive (that's what 0.05 significance means). For five independent metrics, the chance that all five avoid a false positive is 0.95 × 0.95 × 0.95 × 0.95 × 0.95, or about 77%. Which means the chance that at least one produces a false positive is the remaining 23%.

And that calculation assumes the metrics are independent, which they almost never are. Page views correlate with funnel starts. Session length correlates with engagement scores. When metrics move together, the theoretical false positive rate becomes an approximation rather than a ceiling.

The practical implication: teams that track ten guardrail metrics plus three primary metrics without applying any correction aren't running a rigorous experiment. They're running a lottery.

P-hacking in AI prompt and model iteration

AI experiments create a specific p-hacking loop that traditional A/B testing rarely encounters. When you're testing a button color, there's one button. When you're testing a prompt configuration, there are hundreds of plausible variants — different phrasings, different context windows, different temperature settings — and the natural workflow is to iterate until something looks good, then declare a winner.

GrowthBook's documentation names the underlying fallacy directly: the Texas Sharpshooter Fallacy, where a marksman shoots at a barn and then paints the target around the bullet holes. Testing many prompt variants and retroactively identifying the "significant" one is structurally identical. The target was drawn after the shooting.

What makes this insidious is that it doesn't require bad intent. GrowthBook's documentation notes that p-hacking occurs "either consciously or unconsciously" — analysts exploring different subgroups or time windows until something surfaces. In AI development, where prompt iteration is a legitimate engineering practice, the line between exploration and data dredging is easy to cross without noticing.

The fix is procedural: specify the exact prompt or model configuration being tested before running the experiment, not after reviewing preliminary results.

The peeking problem and sequential testing

Peeking — checking results before the pre-specified sample size is reached and stopping early when significance appears — is one of the most common ways experiments get invalidated. It inflates Type I error rates because the probability of crossing the significance threshold at least once during an experiment is higher than the probability of crossing it at the planned endpoint. Stop early enough times on enough experiments, and your 5% false positive rate quietly becomes something much larger.

AI experiments are particularly vulnerable here because delayed outcome metrics create pressure to look early. If your primary metric is 30-day retention, waiting for full data collection feels costly. The temptation to check at day 10 and ship if things look promising is real.

Sequential testing is the technical solution. It allows valid inference at any interim look by adjusting the significance threshold dynamically — so you can check results mid-experiment without inflating your error rate. Within GrowthBook, sequential testing is available as part of the core experiment workflow, which means teams don't need to implement the math themselves.

Correction methods and the A/A baseline

For teams running multiple metrics, two correction approaches are worth knowing. Bonferroni correction divides the significance threshold by the number of tests (α/n), making it the most conservative option — appropriate when any false positive is costly.

Benjamini-Hochberg takes a different approach: instead of asking "what's the chance of any false positive at all?" (which Bonferroni answers), it asks "among the results I'm calling significant, what fraction are probably false positives?" That's a more useful question when you're tracking many metrics — it's less likely to dismiss real effects as noise, at the cost of accepting that some small fraction of your "significant" results may not be real. GrowthBook's experimentation platform includes both correction methods natively.

Before running any high-stakes AI experiment, an A/A test — where both variants receive identical treatment — serves as a calibration check on your measurement infrastructure. A healthy A/A test will occasionally surface a single metric as significant due to the inherent 5% false positive rate; that's expected.

What signals a broken setup is three or four metrics simultaneously showing 99%+ or sub-1% probability to win. That pattern indicates a tracking problem, a randomization failure, or a data pipeline issue — and it's far better to discover this before your experiment than after.

The broader context here matters: industry estimates suggest roughly one in three experiments produces a genuine improvement — though the exact figure varies by domain and team maturity. When your baseline success rate is that low, inflating false positive rates through multiple testing, peeking, or prompt-iteration p-hacking doesn't just distort individual results — it systematically corrupts your team's ability to learn what actually works.

Statistical significance is not enough: why practical significance matters more in AI tests

There are two questions at the heart of every experiment, and they are not the same question. The first: Is this result real? The second: Does this result matter? Statistical significance answers the first. It tells you whether the effect you observed is likely to be genuine rather than a product of random variation. It says nothing — nothing at all — about whether that effect is large enough, valuable enough, or meaningful enough to act on.

"Statistical significance helps establish whether a result is reliable, while practical significance helps determine whether it is worth acting on." These are sequential questions, not interchangeable ones. Passing the first does not mean passing the second.

This distinction is widely misunderstood. A Hacker News commenter who works in industry described it as "one of the most common fallacies I observe in industry and a lot of science" — the tendency to read "statistically significant" as synonymous with "notable" or "meaningful." It isn't.

And for AI experiments specifically, where shipping a new model version or prompt configuration carries real costs in inference, latency, maintenance burden, and regression risk, conflating the two questions is an expensive mistake.

The scale problem: when statistical significance becomes automatic

The mathematical relationship between sample size and p-values creates a trap that's easy to miss. As your user base grows, even vanishingly small effects produce very low p-values. At the volumes typical of AI product deployments — millions of sessions — almost any difference between variants will clear the p < 0.05 threshold.

One Hacker News commenter illustrated this directly: "This intervention causes an uplift in [metric] with p<0.001. High statistical significance! The uplift: 0.000001%. Meaningful? Probably not."

This isn't a hypothetical edge case. It's the operating reality for teams running AI experiments at scale. The probabilistic output layer of AI systems means there are always some measurable differences between variants. The question is never whether a difference exists — at sufficient scale, you will always find one. The question is whether the difference is large enough to justify the cost and risk of shipping.

Consider the NN/g example: Design A has an 85.0% checkout completion rate; Design B has 85.2%. A test returns p < 0.05 — statistically significant. But a 0.2 percentage point difference is almost certainly not worth acting on. The result is real. It is not worth shipping for.

Define your MDE before the experiment runs, not after

The practical solution is to define your minimum detectable effect — the smallest improvement that would actually change a business decision — before the experiment launches. Not after you see the results. After you see the results, you're rationalizing.

GrowthBook's documentation defines MDE as the minimum difference between control and treatment that can be detected given a specific significance threshold, statistical power, and sample size. The critical discipline is ensuring that the effect size you expect to see exceeds the MDE you've defined as meaningful. If a 0.2% improvement in task completion wouldn't change whether you ship the feature, don't declare victory when you detect it.

Define the threshold that would change the decision, and treat anything below it as a null result regardless of p-value.

For AI features specifically, this bar needs to reflect the full cost of shipping. A new model version isn't free — it carries inference cost, latency implications, and the ongoing maintenance burden of supporting a more complex system. The practical significance threshold should be calibrated against those costs, not just the upside metric.

If a 1% improvement in retention doesn't cover the cost of running a heavier model, then 1% is not a meaningful win. Define that threshold upfront, write it into your experiment plan, and hold to it when results come in.

Experiment platforms that build MDE into experiment design and track cumulative impact across experiments do so precisely because individual p-values don't tell the full story. A single experiment might show a statistically significant but tiny effect; the question of whether that effect compounds into meaningful business outcomes requires a different lens entirely.

The mindset shift is this: statistical significance is a filter for noise, not a mandate to ship. It tells you the signal is real. You still have to decide if the signal is worth acting on — and that decision should be made before you ever look at the data.

Where AI experiments break down: setup, execution, and interpretation

The through-line of this article is simple, even if the mechanics aren't: AI experiments fail not because teams are careless, but because they apply frameworks built for one noise source to problems with two. Output stochasticity, user adaptation lag, and metric attribution lag compound in ways that make traditional sample size intuitions unreliable, traditional metric choices misleading, and traditional result interpretation actively dangerous.

The fix isn't exotic — it's disciplined setup before the experiment runs and honest interpretation after it closes.

Lock in your metric and MDE before a single user is bucketed

The most consequential decisions in an AI experiment happen before a single user is bucketed. Lock in your primary metric — a downstream behavioral outcome, not an eval score — and define the minimum effect size that would actually change your shipping decision, factoring in the real costs of the new model or configuration.

If you can't articulate what "meaningful" looks like before you see data, you're not running an experiment; you're running a search for confirmation.

Mid-experiment discipline: peeking, metric proliferation, and variance control

Resist the pull to check results early, especially when your primary metric takes weeks to mature. If you need to monitor mid-experiment, use sequential testing — it's built for this. Keep your metric count tight, apply Bonferroni or Benjamini-Hochberg correction if you're tracking multiple outcomes, and treat guardrail metrics as guardrails, not additional success criteria.

GrowthBook ships sequential testing and both correction methods natively, so the infrastructure isn't the barrier — discipline is.

A significant result is not a mandate to ship

A p-value below 0.05 tells you the effect is probably real. It does not tell you the effect is worth acting on. Ask both questions in sequence: Is this result real? And is it large enough to justify the cost of shipping? If the detected effect sits below the MDE threshold you defined before the experiment, treat it as a null result regardless of what the p-value says. That threshold exists precisely so you don't rationalize small wins into shipping decisions.

The honest truth is that most AI experiments — like most experiments generally — won't produce a clear win. That's not failure. That's how you build a reliable signal over time instead of a catalog of confident mistakes. This article was written to give you a clear picture of where AI experimentation actually breaks down, and what to do about it. If even one section changes how you design your next experiment, it's done its job.

The highest-leverage move depends on where you are in the process

Where are you in the process?

About to launch an AI experiment: Write down your primary metric and your MDE threshold before you configure anything else. If you cannot articulate what "meaningful" looks like before seeing data, stop and define it now. Those two decisions will determine whether your results are interpretable.

Already ran an experiment and unsure whether to trust results: Check two things: (1) Did you define your success criteria before or after you saw the data? (2) Did you apply any correction for multiple metrics? If the answer to either is "after" or "no," your result may not be trustworthy regardless of the p-value.

Earlier in the process, not yet running experiments: Run an A/A test on your measurement infrastructure before anything else. A broken tracking setup invalidates every experiment downstream — and it's far cheaper to find that out now than after you've drawn conclusions from corrupted data.

Related insights

Sign up for free

Take Growthbook for a spin, no credit card required.

Create my account

Table of Contents

Related Articles

See All Articles
Experiments

How much traffic do you need to test AI features reliably?

Jun 8, 2026
x
min read
Experiments

Why traditional A/B testing breaks down for AI products

Jun 8, 2026
x
min read
Experiments

How to measure "quality" in AI outputs (beyond accuracy)

Jun 7, 2026
x
min read

Ready to ship faster?

No credit card required. Start with feature flags, experimentation, and product analytics—free.

Simplified white illustration of a right angle ruler or carpenter's square tool.White checkmark symbol with a scattered pixelated effect around its edges on a transparent background.