Experiments

Data Science

Why traditional A/B testing breaks down for AI products

Jun 8, 2026

min read

A graphic of a bar chart with an arrow pointing upward.

Standard A/B testing breaks when you apply it to AI features — not because the tools are bad, but because the core assumptions no longer hold.

A traditional experiment works because every user in variant B gets the same experience. An LLM doesn't work that way. The same prompt can produce a different response on every call, "better" has no single definition when a model can be accurate but unhelpful, and a prompt change that took 30 seconds to write can spawn dozens of variants that your statistical framework was never designed to handle.

This article is for engineers, PMs, and data teams who are shipping AI features and running into the limits of their existing experimentation setup. If you've ever looked at an inconclusive experiment result and wondered whether the model was the problem or the test design was, this is written for you. Here's what you'll learn:

Why non-deterministic outputs break the stable-treatment assumption that A/B testing's statistical machinery depends on
Why defining a success metric for AI quality is a product design problem, not a measurement problem
Why the near-zero cost of creating AI variants makes the classic two-arm test model statistically dangerous
Why AI products need to instrument both model performance and user behavior — and why those two layers don't automatically correlate
When to use evals, when to use A/B tests, and why you need both in sequence

The article moves through each of these problems in order, from the most foundational (what a "variant" even means for an AI product) to the most practical (how to structure a deployment pipeline that uses evals and A/B tests together). Each section is self-contained, so if one of these problems is more urgent for your team right now, you can jump straight to it.

AI outputs are non-deterministic, which breaks the core assumption of A/B testing

There's an assumption so foundational to A/B testing that most practitioners never bother to state it explicitly: the treatment is stable. Every user assigned to variant B receives the same experience. The code path is identical, the UI change is identical, the copy is identical.

That consistency is what makes the comparison valid — you're measuring the effect of a specific, repeatable intervention, not a distribution of possible interventions.

AI-powered features break this assumption at the infrastructure level. And if you're applying standard experimentation infrastructure to an AI feature without recognizing this, you're not running a clean experiment. You're measuring a moving target and calling it a controlled test.

What A/B testing's statistical machinery actually requires

The entire statistical framework behind A/B testing — power analysis, minimum detectable effect calculations, Type I and Type II error rates — is built on the premise that the treatment is consistent across all users assigned to it.

When you calculate sample size requirements or set significance thresholds, you're implicitly assuming that "variant B" means the same thing for user 1 as it does for user 10,000.

This assumption is so deeply embedded in standard testing infrastructure that it rarely surfaces in documentation. It doesn't need to, because for traditional software — a UI change, a pricing experiment, a copy variant — it's always true. The code path either renders the new button or it doesn't. There's no variance in the treatment itself, only variance in how users respond to it.

How LLMs break the stable-treatment assumption

Generative AI components don't work this way. Microsoft's platform documentation states it plainly: "AI outputs can vary between runs, even with identical inputs." The specific mechanisms are worth naming — temperature settings and sampling methods, minor variations in natural language processing, confidence scores that fluctuate within normal ranges, and context-dependent reasoning that can take different paths on different invocations.

Even the parallelization of floating-point operations during inference introduces non-determinism, because adding whichever two values finish computing first produces different intermediate results.

There's a genuine technical debate about whether this non-determinism is intrinsic to LLMs or an engineering choice. OpenAI has offered a seed parameter for more reproducible outputs; theoretically, quantized weights could preserve determinism. But as practitioners in the HN discussion on defeating LLM non-determinism converged on: in real production systems, non-determinism is almost always present, and even where theoretical determinism is achievable, it doesn't solve the context-sensitivity problem.

The same input routed to the same "variant" can produce meaningfully different outputs on every invocation — not because of targeting or personalization, but because of inherent output variance. The variant isn't a fixed treatment. It's a distribution.

Non-determinism makes A/B test results uninterpretable over time

Non-determinism doesn't just add noise to your experiment — it can make results uninterpretable. If you can't reproduce the specific output a user received, you can't verify what you actually tested.

A negative result becomes ambiguous: was the treatment genuinely worse, or did output variance happen to produce poor responses for the treatment group by chance during your measurement window?

This problem compounds over time. If the model's output distribution shifts — due to a model version update, a change in upstream context, or even infrastructure-level changes in how inference is parallelized — results from last month may not be comparable to results this month.

One commenter in the HN non-determinism thread put it sharply: "You can't use 'correct' unit tests or evaluation sets to prove anything about inputs you haven't tested." That observation was made in the context of software testing generally, but it applies with particular force to A/B testing: if your treatment isn't stable, your test results don't accumulate into reliable knowledge.

What "stable variant" even means for an AI product

If a variant is a distribution of possible outputs rather than a fixed experience, the concept of a "variant" needs to be reframed. What you're actually specifying when you define an AI variant is a parameterized system — a prompt configuration, a model version, a temperature setting, a context window policy. The user experience that results from that configuration varies.

Microsoft's own guidance for testing AI features recommends "tolerance-based validation" — asserting that an output meets criteria within acceptable thresholds rather than matching a specific expected value. That's a reasonable adaptation for unit testing, but it doesn't resolve the AI A/B testing challenge.

You can validate that your AI variant produces outputs within an acceptable quality range; you can't guarantee that the distribution of outputs your treatment group received was consistent enough to support causal inference.

Teams running experiments on AI features need to internalize this reframing early. The question isn't "did variant B outperform variant A?" — it's "did the parameterized system we called variant B produce a distribution of outputs that, in aggregate, drove better outcomes than the distribution produced by variant A?"

That's a harder question, and it requires different instrumentation, different statistical approaches, and a more honest accounting of what you're actually measuring.

Output quality is subjective, making AI A/B testing success metrics hard to define

Non-deterministic outputs make the treatment unstable. But even if you could stabilize the treatment, you'd face a second, separate problem: there is no clean metric to measure the outcome.

Traditional A/B testing was designed around a clean measurement contract — something either happened or it didn't. A user converted, clicked, or churned. The metric is binary, the signal is unambiguous, and the statistical machinery works cleanly on top of it.

AI output quality doesn't fit that contract. A response can be accurate but poorly toned. It can be concise but incomplete. It can be safe but genuinely unhelpful. There is no single event to instrument, no binary outcome to count — and that's not a tooling problem. It's a structural mismatch between what AI quality means and what A/B testing infrastructure was built to measure.

Why binary metrics break down for AI

When you're testing a checkout flow, "better" has a defensible operational definition: more users completed the purchase. When you're testing an AI assistant's response style, "better" requires you to specify better along which dimension, for which users, in which context.

As the AI evaluation platform Getmaxim puts it, AI agent quality "accuracy, relevance, safety, latency, cost, and user satisfaction simultaneously" — and critically, "optimizing one dimension may inadvertently degrade another." A model tuned for conciseness may sacrifice completeness. A model tuned for engagement may produce responses that feel satisfying in the moment but erode trust over time.

Adding more metrics doesn't resolve this. It just surfaces the tradeoffs more explicitly. You still have to decide which dimensions matter most for your specific use case, how to weight them relative to each other, and what threshold constitutes a meaningful improvement. Those are product design decisions, not measurement decisions — and most A/B testing frameworks have no mechanism for encoding them.

The gap between naming a quality dimension and measuring it

"Helpfulness" sounds like a metric until you try to operationalize it. For a coding assistant, helpful might mean syntactically correct, idiomatic, and well-commented. For a customer support bot, helpful might mean resolving the issue without escalation. These aren't just different thresholds on the same scale — they're different constructs entirely. Quality criteria are use-case-specific in a way that resists standardization.

Context dependency compounds this. AI response quality isn't just a function of the prompt — it depends on conversation history, user attributes, and domain-specific requirements. A response that's perfectly calibrated for an expert user may be confusing to a novice.

Averaged metrics across a heterogeneous user population can obscure exactly the quality variation that matters most. GrowthBook's experimentation documentation makes this point directly: "experimentation results work on averages, and this can hide a lot of systemic biases." For AI quality measurement, that warning is especially sharp.

The "vibe check" default and why it fails

In the absence of a rigorous quality metric, teams default to what's easy to instrument. Thumbs up/down ratings. Session length. Retry rate. Response acceptance rate. These proxies are seductive because they're real signals — users are doing something measurable — but they don't reliably track what you actually care about.

A model that produces longer, more verbose responses may increase session time without improving quality. A model that generates confident-sounding but subtly wrong answers may suppress retry rates while degrading user outcomes.

GrowthBook's AI testing playbook frames the goal explicitly as going "beyond vibe checks to maximize customer outcomes" — which implies that informal quality judgment is the naive default teams fall into, not an edge case. The same documentation warns that optimizing for short-term engagement metrics can "exploit user trust" at the expense of long-term retention.

That risk is amplified when the metric doesn't actually measure what you care about. Garbage in, garbage out applies to experimental design just as much as it applies to data pipelines.

Human evaluators and automated scorers — and their limits

When binary metrics fail, teams typically turn to one of two alternatives: human evaluation or automated scoring. Human raters can assess nuanced quality dimensions that no click event captures, but they're expensive, slow, and inconsistent — especially on subjective dimensions like tone or appropriateness.

Inter-rater agreement tends to degrade precisely on the dimensions that matter most. Automated scorers, including LLM-as-judge approaches, can scale but introduce their own systematic biases and may not align with actual user preferences in your specific context.

Neither approach produces a clean, single signal equivalent to a conversion event. Landon Smith, Head of Post-Training at Character.AI, captures the underlying challenge: the goal is to "compare different modeling techniques from the perspective of our users" — framing quality measurement as inherently user-perspective-dependent rather than model-metric-dependent.

That framing is correct, but it also means there's no universal quality metric waiting to be discovered. Every AI product has to define what "better" means for its users, instrument that definition deliberately, and resist the pull toward convenient proxies that measure something adjacent to quality rather than quality itself.

AI eliminates the cost of creating variants, making the two-variant test model obsolete

Non-determinism and metric ambiguity affect the validity of individual experiments. Variant explosion is a different kind of problem — it affects the statistical architecture of your entire testing program.

Classical A/B testing wasn't designed around two variants because two is a statistically ideal number. It was designed around two variants because building a third one was expensive. A new landing page required design time, engineering cycles, QA, and deployment.

That real-world cost naturally constrained experiment scope, and the two-arm test became the default — not by mathematical necessity, but by economic convention. AI removes that constraint entirely, and the consequences for experiment design are more serious than most teams realize.

The economic assumption classical testing was built on

When early experimentation platforms defined A/B testing, a "variant" meant something concrete: a redesigned button, a new checkout flow, a different headline. Creating one took days at minimum.

That friction wasn't a bug — it was a forcing function that kept experiment scope manageable and gave statistical frameworks a stable target to measure. Fixed sample size calculations, significance thresholds, and traffic allocation all assume that the thing you're testing stays the same for the duration of the experiment. The economic cost of building variants enforced that stability by accident.

How AI breaks the cost constraint

For an AI product, a "variant" might be a different system prompt, a temperature setting, a model version, a fine-tuning configuration, or a combination of all four. Any of these can be generated and deployed programmatically in milliseconds. There is no design review, no engineering sprint, no QA cycle. The marginal cost of a new variant approaches zero, which means the natural brake on experiment proliferation disappears entirely.

This isn't a hypothetical. GrowthBook's own AI testing guidance acknowledges the need to "tune AI responses across thousands of use cases" — a framing that implicitly concedes the variant space for AI products is orders of magnitude larger than anything classical two-arm testing was designed to handle.

Character.AI's head of post-training, Landon Smith, describes using experimentation infrastructure specifically to "compare different modeling techniques from the perspective of our users" — not UI variants, but model-level behavioral variants at production scale.

The statistical consequences of variant explosion

This is where the problem stops being a workflow inconvenience and becomes a statistical architecture failure. Three specific failure modes emerge when variant counts scale beyond two.

The first is multiple comparisons inflation. GrowthBook's own experimentation documentation is explicit: if you test the same hypothesis at 5% significance across 20 different variants, the probability of finding at least one false positive by chance alone is approximately 64%.

Standard statistical correction methods exist for this problem — they work by raising the bar for statistical significance when you're running multiple comparisons at once, so that the probability of a false positive stays controlled across the full set of tests (Bonferroni, Benjamini-Hochberg, and FDR are the most common approaches). But they share a critical assumption: you must know the total number of comparisons in advance. When variants are generated dynamically, that number is undefined, and the corrections break down.

Underpowered tests are the second failure mode. Sample size calculations for two-arm tests don't scale linearly to N-arm tests. Detecting a meaningful difference across many variants requires proportionally more traffic per arm, which may simply be unavailable — especially for lower-volume AI features where the variant space is large but the user base is not.

Non-stationarity is the third failure mode — and the most structurally damaging. When a model updates its own prompt or behavioral configuration mid-experiment — whether through automated optimization or routine model updates — the treatment is no longer stable.

This is statistically equivalent to the peeking problem: repeatedly looking at results and restarting the experiment inflates false positive rates, a failure mode that experimentation practitioners specifically flag. A variant that updates every 30 minutes isn't a stable treatment; it's a moving target that makes it impossible to isolate what actually caused any difference you observe.

Why continuous evaluation becomes a statistical requirement

The answer isn't to slow AI iteration down to match classical experiment cadence. That trades one problem for another — you preserve statistical validity at the cost of the development velocity that makes AI products worth building. The actual solution is to adopt evaluation frameworks designed for continuous, multi-variant comparison from the start.

Sequential testing methods are specifically designed to handle ongoing data collection without fixed stopping rules, directly addressing the peeking problem that continuous variant updates create. Bayesian approaches with appropriate priors allow for probabilistic comparison across multiple variants without the hard significance thresholds that break under multiple comparisons pressure.

Both approaches are available in modern experimentation platforms, alongside gradual rollout patterns that use live engagement data to monitor AI performance continuously rather than waiting for a fixed experiment window to close.

The shift here is architectural. Two-arm testing with fixed sample sizes was a rational design for a world where variants were expensive. In a world where they're free, the entire experiment design layer needs to be rebuilt around that reality.

AI products require a two-layer metric stack that traditional testing infrastructure wasn't built for

Variant explosion is a problem of scale. The metric stack problem is a problem of instrumentation: even a well-designed, properly scoped experiment will produce misleading results if it's measuring the wrong layer.

Most A/B testing infrastructure was designed around a single instrumentation target: the user. Did they click? Did they convert? Did they return? That architecture made sense when the treatment being tested was a button color or a checkout flow — things that exist entirely in the user-facing layer.

AI features break that assumption because the treatment exists at two distinct layers simultaneously, and changes at one layer don't reliably propagate to the other in predictable ways.

The two layers and why they don't automatically correlate

The model performance layer consists of metrics that describe what the model is doing: latency, token usage, cost per query, and output quality scores. The user behavior layer consists of metrics that describe what users do in response: engagement, retention, conversion, session depth.

These layers are distinct in a meaningful sense — a model can improve on latency while degrading on output quality, or score better on internal quality benchmarks while having no measurable effect on retention. The absence of automatic correlation between the layers is the core problem.

You cannot assume that a model improvement will surface as a user behavior improvement, or that a user behavior regression means the model got worse. Both layers must be instrumented independently before you can reason about the relationship between them.

Why model-level metrics matter on their own — and why they're not enough

Latency and token cost are the easiest model-level metrics to instrument. They're deterministic, numeric, and don't require user interaction to measure — you get them from the inference call itself. This is why teams often start here.

But optimizing on these metrics alone creates a real risk: a model that is faster and cheaper may produce worse outputs, and a model that produces better outputs may have no effect on the business metrics that actually matter.

GrowthBook's framing of the model selection problem captures this directly — evaluating a new model requires comparing both its cost and its impact on users, and those two dimensions require different instrumentation to measure. One lives in your inference logs; the other lives in your user event stream.

Connecting model performance to user behavior — the hard part

The instrumentation gap isn't just about collecting both types of data — it's about connecting them. To understand whether a model change caused a shift in user behavior, you need a data pipeline that can attribute a specific model state (variant A vs. variant B) to downstream user actions, often across a session or multi-day window.

GrowthBook's AI testing framing describes this as the need to "analyze AI performance with data that links directly to user outcomes" — which is a precise description of the join problem.

The goal of simultaneously increasing engagement while lowering token usage, which GrowthBook's playbook names explicitly, is a concrete example: you cannot optimize both signals at once without a unified data model that tracks both per user per experiment arm.

The Character.AI use case illustrates what this looks like in practice. Landon Smith, Head of Post-Training at Character.AI, describes using GrowthBook's experimentation infrastructure to evaluate post-training decisions against real user outcomes — measuring model quality not in isolation, but in terms of what it does to user behavior.

The instrumentation gap in existing tools

Most experimentation platforms were built to ingest user events and join them to experiment assignment tables. They were not built to ingest model-level telemetry — token counts, inference latency, quality scores — and join that data to the same assignment table.

The practical result is two disconnected dashboards: one from whatever LLM observability tooling a team is running, showing latency and cost trends, and one from their experimentation platform, showing conversion rates.

With no systematic connection between them, teams fall back on what GrowthBook's playbook calls "vibe checks" — informal assessments of model quality that don't connect to user outcomes in any structured way. This isn't a workflow problem. It's a metric architecture problem. And it's one that traditional A/B testing infrastructure, built entirely around the user behavior layer, was never designed to solve.

Evals vs. A/B tests for AI products: when to use each — and when you need both

Some in the industry have declared A/B testing obsolete for AI products. Braintrust's September 2025 post put it bluntly: "A/B testing is no longer sufficient for AI product optimization. The future is evals." The recent acquisitions of Statsig by OpenAI and Eppo by Datadog were cited as market signals pointing the same direction.

It's a provocative framing — and it's also wrong, or at least incomplete.

Read past the headline, and even Braintrust's own analysis acknowledges that evals and A/B tests serve different purposes and belong at different stages of the deployment pipeline. The practical question isn't which method to use. It's understanding what each method can actually tell you, and where each one belongs.

What evals are and what they can tell you

Evals are offline, pre-deployment quality assessments of model outputs. Think of them as the AI equivalent of unit tests — but instead of asserting that a function returns the correct value, you're asserting that a model response meets defined quality criteria.

Braintrust uses the term "scorers" for the evaluation functions that operationalize these criteria: does the response respect the user's formatting preferences? Does it answer the question accurately? Is it likely to be accepted or rejected?

The eval hypothesis is concrete and testable. An example from Braintrust: "Users are rejecting responses because of formatting issues, so the format part of our prompt needs to be fixed." The success criterion is measurable in a controlled setting — "AI responses will better respect users' formatting preferences" — without requiring any users to be exposed to the change.

As Mengying Li of Braintrust put it: "Evals should always come before exposing users to AI features — they're your first line of defense against poor quality."

That framing captures both the value and the limitation of evals. They tell you whether a model output meets defined quality criteria in a controlled environment. They cannot tell you whether that quality improvement changes real user behavior at scale.

A response that scores well on every eval dimension might still fail to move retention. A response that scores poorly on formatting might still drive engagement because users find the content useful despite the presentation. Evals validate model quality; they don't validate business impact.

What A/B tests add that evals can't

A/B tests measure what actually happens when real users encounter a change in production. Retention, conversion, task completion, NPS — these are the signals that confirm whether a model quality improvement translates into outcomes that matter to the business. GrowthBook's framing is direct: "evals are just the tip of the iceberg. A/B testing is where real value gets created."

The Props AI Gateway founder made a related point in a Hacker News thread on LLM testing: traditional business metrics like NPS, CSAT, and ticket close rate can and should be applied to AI features in A/B tests. The measurement framework doesn't have to change just because the feature is AI-powered. What changes is the complexity of connecting model-level behavior to those downstream signals.

What A/B tests cannot do is assess model output quality directly. They measure user behavior in response to outputs, not the outputs themselves. A model could produce lower-quality responses that still drive engagement — or higher-quality responses that users don't notice because the quality improvement was in a dimension they don't consciously evaluate.

This is why you need both methods: evals catch quality regressions before users see them; A/B tests confirm that quality improvements actually move the metrics that matter.

The deployment pipeline where both methods fit

The sequencing is straightforward once you accept that both methods are necessary. Evals gate deployment. If a prompt change or model update fails evals, it doesn't reach users. A/B tests validate business impact.

Once a change passes evals, it enters a controlled rollout where user behavior metrics determine whether the quality improvement translates to outcomes. Gradual rollouts, targeted rollouts, and kill switches for underperforming variants all belong to this post-eval phase.

This pipeline isn't entirely new. As one HN commenter noted: "Offline testing, sampled auditing, (limited) online testing, wider online testing, metrics and monitoring throughout. It's always been this way."

What's new is that AI makes the offline testing phase — the evals — substantially more complex and more critical. The number of quality dimensions to evaluate has expanded, the criteria are harder to operationalize, and the outputs are non-deterministic enough that a single eval run may not be sufficient. The pipeline is familiar; the difficulty of running the first stage well is not.

The organizational reality of running both

Running evals and A/B tests in parallel requires different infrastructure and different team ownership, and most organizations aren't set up for both. Evals are typically owned by ML and data teams — they require curated datasets, scorer definitions, and eval frameworks. A/B tests are typically owned by product and growth teams — they require experiment infrastructure, metric definitions, and statistical rigor.

The acquisitions of Statsig by OpenAI and Eppo by Datadog signal that the market is moving toward more integrated platforms, but for most teams today, these are separate toolchains that don't communicate well.

The instrumentation gap is real: most A/B testing platforms were built to track user behavior metrics, not model-level signals like output quality scores or response acceptance rates. Connecting those two layers — model performance and user behavior — is still largely a custom engineering problem.

Character.AI's experience with GrowthBook's experimentation infrastructure points toward what's possible: model decisions made in the post-training process validated against real user outcomes through controlled experiments. Getting there requires both methods, both teams, and infrastructure that can instrument both layers simultaneously.

Four AI-specific failure modes that compound — and where to break the chain

The four problems this article covers — non-deterministic outputs, subjective quality metrics, variant explosion, and the two-layer instrumentation gap — aren't independent. They compound.

A team that hasn't defined what "better" means for their AI feature will instrument the wrong metrics. A team that instruments the wrong metrics will draw false conclusions from their A/B tests. A team running 20 prompt variants through a two-arm statistical framework will accumulate false positives they'll mistake for signal. Each failure mode feeds the next.

Audit your current testing stack against the four AI-specific failure modes

The most useful thing you can do right now is ask whether your current experiment setup was designed for any of these problems — not whether it can be stretched to handle them.

If your variant assignment assumes a stable treatment, if your success metric is a proxy like session length or retry rate, and if your statistical framework doesn't account for multiple comparisons, you're not running clean experiments. You're generating noise and calling it evidence.

Build a two-layer metric architecture before you run your next AI experiment

The instrumentation gap is the most fixable problem on this list, and it's the one most teams skip because it requires coordination between ML and product teams who don't normally share a data model. Before your next AI experiment, make sure you can join model-level signals — latency, token cost, output quality scores — to the same user event stream your A/B tests read from.

GrowthBook's warehouse-native experimentation approach makes this join tractable if your data already lives in a warehouse — rather than requiring you to pipe data through a third-party platform, it runs experiment analysis directly against your existing data warehouse, keeping model telemetry and user behavior data in the same environment where the join is already possible. Without that connection, you're optimizing two disconnected dashboards and hoping they tell the same story.

Adopt a hybrid eval + A/B testing pipeline as your default for AI feature releases

Evals and A/B tests are not competing methods. Evals gate deployment by catching quality regressions before users see them. A/B tests validate whether quality improvements actually move business outcomes. Neither method alone closes the loop.

The sequencing — evals first, then controlled rollout with user behavior metrics — is the same pipeline that mature ML teams have used for years. What's new is that the eval stage is harder, the variant space is larger, and the statistical methods need to match that reality.

The tension worth holding onto: moving fast on AI iteration and maintaining statistical validity pull in opposite directions. Sequential testing and Bayesian methods exist precisely to ease that tension — they let you monitor experiments continuously without inflating false positive rates — but they require deliberate setup, not just a platform switch.

If this article helped clarify why your AI experiments have felt inconclusive, that's exactly what it was written to do.

What to do next: The most useful next action depends on which failure mode is costing you the most right now. If your experiments keep returning inconclusive results, the problem is probably metric definition — write down what "better" means for your specific AI feature before touching your testing infrastructure. If you're running multiple prompt variants simultaneously and your statistical framework doesn't account for multiple comparisons, that's a validity problem that makes every result you have suspect. If you have no connection between model-level telemetry and user behavior data, that's your first infrastructure investment — nothing downstream can be trusted without it. These aren't sequential steps. They're three different entry points into the same underlying problem, and the right one is whichever is most acute for your team today.

Related insights

Sign up for free

Take Growthbook for a spin, no credit card required.

Create my account

Example H2

See All Articles

Experiments

Data Science

T-test vs z-test: Key differences and when to use each

Jul 15, 2026

min read

Experiments

Data Science

Bayesian statistics: What it is and how it applies to A/B testing

Jul 15, 2026

min read

Experiments

Data Science

What is statistical significance? Definition and how to calculate it

Jul 14, 2026

min read

Ready to ship faster?

No credit card required. Start with feature flags, experimentation, and product analytics—free.

Get Started

Book a Demo

Simplified white illustration of a right angle ruler or carpenter's square tool.

White checkmark symbol with a scattered pixelated effect around its edges on a transparent background.

Why traditional A/B testing breaks down for AI products

Standard A/B testing breaks when you apply it to AI features — not because the tools are bad, but because the core assumptions no longer hold.

AI outputs are non-deterministic, which breaks the core assumption of A/B testing

What A/B testing's statistical machinery actually requires

How LLMs break the stable-treatment assumption

Non-determinism makes A/B test results uninterpretable over time

What "stable variant" even means for an AI product

Output quality is subjective, making AI A/B testing success metrics hard to define

Why binary metrics break down for AI

The gap between naming a quality dimension and measuring it

The "vibe check" default and why it fails

Human evaluators and automated scorers — and their limits

AI eliminates the cost of creating variants, making the two-variant test model obsolete

The economic assumption classical testing was built on

How AI breaks the cost constraint

The statistical consequences of variant explosion

Why continuous evaluation becomes a statistical requirement

AI products require a two-layer metric stack that traditional testing infrastructure wasn't built for

The two layers and why they don't automatically correlate

Why model-level metrics matter on their own — and why they're not enough

Connecting model performance to user behavior — the hard part

The instrumentation gap in existing tools

Evals vs. A/B tests for AI products: when to use each — and when you need both

What evals are and what they can tell you

What A/B tests add that evals can't

The deployment pipeline where both methods fit

The organizational reality of running both

Four AI-specific failure modes that compound — and where to break the chain

Audit your current testing stack against the four AI-specific failure modes

Build a two-layer metric architecture before you run your next AI experiment

Adopt a hybrid eval + A/B testing pipeline as your default for AI feature releases

Related insights

Sign up for free

Table of Contents

Related Articles

T-test vs z-test: Key differences and when to use each

Bayesian statistics: What it is and how it applies to A/B testing

What is statistical significance? Definition and how to calculate it

Ready to ship faster?