How do you A/B test an LLM when results aren't deterministic?

A/B testing an LLM is not the same as A/B testing a button color.
The core assumption that makes standard A/B testing work — that your treatment is a stable, fixed thing — breaks the moment you introduce a model that produces different outputs every time it runs, even with the same prompt and the same settings. Research testing five LLMs under nominally deterministic conditions found accuracy variations of up to 15% were observed across runs. That's not noise you can ignore. That's a problem you have to design around.
This guide is for engineers, PMs, and data teams who are shipping AI features and need to know whether their changes are actually working. If you've tried applying standard experiment logic to LLM outputs and gotten results that felt unreliable — or if you're about to run your first LLM experiment and want to avoid the common traps — this is written for you. Here's what you'll learn:
- Why LLM non-determinism breaks traditional A/B testing assumptions, and what that means for your experiment design
- How to define what you're actually comparing — prompts, models, hyperparameters, and configurations — so your results are causally interpretable
- Which metrics can detect real signal through output variability, and which ones produce noise dressed up as data
- When to use offline prompt evaluation versus live A/B tests, and how to sequence them correctly
- How to structure the statistics — power analysis, randomization unit, test duration, and canary rollouts — so your results hold up
The article moves in that order: from why the problem exists, to how to isolate what you're testing, to how to measure it, to how to run the experiment correctly. None of these steps are optional. Each one addresses a specific failure mode that produces confident-looking results that are actually wrong.
Why LLM non-determinism makes traditional A/B testing assumptions break down
Standard A/B testing is built on a quiet assumption that almost never gets stated explicitly: the treatment is stable. When you assign a user to variant B — the green button, the shorter headline, the reordered checkout flow — variant B is the same thing every time that user encounters it.
The output doesn't drift between assignment and measurement. This stability is so fundamental to how experiments are designed and analyzed that most testing frameworks don't bother to document it. It's just assumed.
LLMs violate this assumption at the architectural level. And the violation isn't subtle.
LLM outputs are samples, not artifacts
A UI element is a fixed artifact — it looks the same every time. An LLM response is more like a roll of a weighted die: the model has learned which types of responses are most likely for a given prompt, but it doesn't produce the same response every time. Which specific response you get is decided fresh at the moment the model runs.
This breaks the core requirement for treatment stability. When you assign a user to "variant B — the new prompt," you're not assigning them to a fixed experience. You're assigning them to a distribution of possible experiences, and which specific experience they get is determined at the moment of inference.
The treatment isn't a thing; it's a process. That distinction matters enormously for how you design and interpret experiments.
Temperature is not the fix it appears to be
The instinct to set temperature to zero is understandable. Greedy sampling — always selecting the highest-probability token — is theoretically deterministic. If you remove the randomness, the argument goes, you get a stable output you can test against.
The problem is that this theoretical determinism doesn't survive contact with production infrastructure. Research from Penn State and Comcast AI Technologies (Atil et al., arXiv:2408.04667) tested five LLMs configured to be deterministic across eight common tasks over ten runs. The results are striking: accuracy variations of up to 15% were observed across runs, with a gap between best and worst possible performance of up to 70%. None of the five models consistently delivered repeatable accuracy across all tasks.
The reason is a quirk of how modern GPUs do math. In short: the order in which a GPU performs millions of parallel calculations can vary slightly between runs, and those tiny differences in rounding accumulate into meaningfully different outputs — even when the model settings are identical.
The root cause is structural. Floating-point arithmetic on GPUs is non-associative — (a + b) + c does not necessarily equal a + (b + c) due to finite precision and rounding. When attention scores and logits are computed in parallel across GPU threads, the execution order can vary, and those rounding differences compound through the computation.
Add to this the co-mingling of data in input buffers across concurrent requests — a property that researchers note may be "essential to the efficient use of compute resources" — and you have non-determinism that is architectural, not incidental. Even self-hosted inference with open-source libraries like vLLM or SGLang doesn't eliminate it.
The Thinking Machines Lab analysis puts it plainly: the concurrency-plus-floating-point explanation is "not entirely wrong" but incomplete. The full picture is messier, and no major hosted LLM currently guarantees deterministic output even at temperature zero with a fixed seed.
Why comparing raw outputs between variants doesn't work
Even if you accept the variance, you might think you can still compare outputs statistically — run enough samples, measure quality scores, see which variant wins. The problem is that there's no stable ground truth for either variant. Variant A doesn't have "an output" — it has a distribution of outputs. Variant B has a different distribution.
Comparing a sample from one distribution to a sample from another, without accounting for within-variant variance, produces results that are easy to misread.
A practitioner observation from the Hacker News discussion on this topic captures the failure mode precisely: "You can't use correct unit tests or evaluation sets to prove anything about inputs you haven't tested." This generalizes directly to A/B tests. A result on a sampled set of outputs doesn't generalize to the full input distribution, especially when context sensitivity means the same user query can produce different outputs depending on what preceded it in the session.
Ignoring output variance compounds every other experimental error
The downstream consequences are predictable once you understand the mechanics. Teams that apply traditional A/B testing logic to LLM outputs without accounting for output variance will tend to attribute sampling noise to the treatment, under-power experiments because they've underestimated variance, and produce false positives from comparing unstable distributions.
A single-run evaluation of any LLM variant — the kind of quick check that's perfectly reasonable for a UI change — is an unreliable basis for a shipping decision when that variant's output can swing 15% in accuracy across runs under nominally identical conditions.
The standard pitfalls of A/B testing — Twyman's Law, measurement error, inflated false positive rates from running multiple metrics — don't disappear in LLM experiments. They compound. Higher structural variance means the signal-to-noise ratio is worse to begin with, which means every other source of experimental error matters more, not less. The solution isn't to abandon statistical rigor; it's to adapt the experiment design to account for what LLMs actually are.
Defining what you're comparing when A/B testing LLM outputs: prompts, models, and configurations
When engineers say they want to "A/B test their LLM," they're usually describing a goal, not an experiment. The goal is reasonable — figure out which version of the AI system works better. But "the AI system" is not a single variable.
It's a stack of independently configurable layers, and treating them as one undifferentiated thing to test is how you end up with results you can't act on.
The discipline that makes LLM experimentation valid is the same one that makes any experiment valid: change one thing at a time. In an LLM context, that requires first being clear about what the distinct "things" actually are.
Five independently testable layers in any LLM system
An LLM system has at least five layers that can be independently varied and therefore independently tested. Prompt wording and structure is the most obvious — the specific instructions, examples, and framing you pass to the model. Entirely separate from that is model version or provider: comparing GPT-4o against Claude against Gemini is a different experiment from comparing two prompt variants on the same model.
Braintrust explicitly treats these as parallel test surfaces in their evaluation tooling, run different prompt versions, models, or parameters simultaneously — not as a single combined test, but as distinct comparisons.
Hyperparameters — temperature, top-p, max tokens — constitute a third layer, controlling the sampling behavior of the model in ways that can meaningfully shift output characteristics independent of any prompt change. A fourth layer worth isolating is system prompt configuration and fine-tuning, particularly relevant for teams that have customized model behavior at the training or instruction level. Rounding out the five is the end-to-end user experience — how the AI output is surfaced, formatted, and integrated into the product — which often gets conflated with the others when it deserves its own test.
Latitude's framework for LLM evaluation maps these layers to different measurement concerns: technical performance metrics, user experience metrics, and business impact metrics. The implication is that different test surfaces tend to move different downstream indicators, which is another reason to keep them separate.
The isolation principle is not optional
Conflating these layers doesn't just make your results messy — it makes them causally uninterpretable. Braintrust puts the problem concretely: "The fundamental challenge with prompt development is that changes have unpredictable effects. Adding an example might improve one scenario while breaking another." If you simultaneously change the prompt and swap the model, you cannot know which one drove the change in outcomes.
You've run two experiments at once and produced evidence for neither.
GrowthBook's experimentation documentation states this as a general law: "The fewer variables that are involved in the experiment, the more causality can be implied in the results." In traditional A/B testing, violating this principle is a mistake you can sometimes recover from with more data. In LLM A/B testing, where outputs are already variable by design, it's closer to fatal — the background variation is already high enough that you cannot afford to add more confusion by changing two things at once. You need every methodological advantage you can get, and variable isolation is the cheapest one available.
The Hacker News community has felt this failure mode from the user side. When Anthropic ran undisclosed A/B tests on their API, one practitioner observed: "Any negative result can now be thrown straight into the trash because of the chance Anthropic put you on the wrong side of an A/B test." That's not just a complaint about transparency — it's a precise description of what happens when variable isolation breaks down. The signal becomes untrustworthy for everyone downstream.
From output quality to user behavior
The most important reframe in LLM experimentation is shifting the question from "which prompt produces better text?" to "which configuration produces better user outcomes?" These are not the same question, and optimizing for the first doesn't reliably produce improvements in the second.
Character.AI's use of GrowthBook for model-level comparison illustrates the distinction. Landon Smith, Head of Post-Training at Character.AI, describes the approach as comparing "different modeling techniques from the perspective of our users" — not evaluating output quality in isolation, but measuring how modeling choices translate into user experience. The test surface is the model layer; the measurement layer is user behavior.
This reframe also clarifies how to choose which layer to test first. If your hypothesis is about engagement, you're probably testing UX or prompt presentation. If your hypothesis is about output accuracy or cost, you're probably testing model version or hyperparameters.
The test surface should be determined by which downstream behavior you're trying to move — and that determination has to happen before you design the experiment, not after you look at the results.
Choosing metrics that can actually detect signal through LLM output variability
The measurement problem in LLM experiments isn't just technical — it's conceptual. Most teams approach LLM evaluation the same way they'd approach any software quality check: read some outputs, form an impression, maybe run a few spot-checks. That approach collapses under experimental conditions.
When you're trying to detect whether prompt variant A outperforms prompt variant B across thousands of interactions, impressions don't aggregate. You need metrics that do.
Why output text is a poor primary metric
Because of the output variance established in the first section, comparing text strings directly across variants doesn't produce signal — it produces noise dressed up as data.
Traditional text similarity metrics like BLEU and ROUGE make this worse, not better. They measure surface-level token overlap, which fails to capture semantic equivalence. A response that says "the account balance is insufficient" and one that says "you don't have enough funds" score poorly against each other despite being functionally identical.
Confident AI's evaluation guidance is explicit on this point: semantic nuance in LLM outputs is not captured by statistical scorers, which is precisely why they're unreliable as primary metrics in experiments.
The practical consequence is that if you anchor your experiment to output text quality — even with automated scoring — you're measuring the wrong thing. What you actually want to know is whether one variant produces better downstream outcomes for users. That requires different instrumentation entirely.
Quantitative behavioral and operational metrics worth tracking
The metrics that aggregate cleanly in LLM experiments are the ones that describe what users do in response to outputs, and what the system costs to produce them. Latency, token usage, cost per query, task completion rate, user retention, and conversion events are all measurable, stable, and statistically tractable in ways that output text is not.
The trade-off between these dimensions is real and often non-obvious. Adding a Chain-of-Thought instruction to a prompt might improve factual accuracy while doubling latency and token cost. If you're only tracking accuracy, you'll ship a change that degrades user experience and inflates your inference bill. Tracking both dimensions simultaneously is what makes the trade-off visible before it becomes a production problem.
For teams whose behavioral metrics — retention, conversion, session depth — live in a data warehouse rather than a real-time event stream, warehouse-native experimentation removes the need to re-instrument data collection. GrowthBook connects directly to your existing data warehouse — Snowflake, BigQuery, Redshift, ClickHouse, Databricks, Athena, Postgres, and more — which matters when LLM experiment outcomes are measured in days or weeks, not seconds.
Latitude's framework for mapping metrics to measurement methods is useful here: technical performance metrics like response time and F1 score are best captured through automated evaluation; user experience metrics like satisfaction ratings come from feedback mechanisms and system logs; business impact metrics like retention come from analytics pipelines. Each category requires different instrumentation, and all three are necessary for a complete picture.
One important constraint: tracking too many metrics simultaneously inflates your false positive rate in ways that can make experiments misleading. GrowthBook's A/A test analysis quantifies this directly — with five unrelated metrics in an experiment, the probability of at least one false positive reaches 41%. With two metrics, it's 19%.
The implication for LLM experiments is that metric selectivity isn't just good hygiene; it's a statistical necessity. Pick the metrics most causally connected to your experiment goal, and resist the temptation to track everything because you can.
What quantitative metrics miss that users notice immediately
Quantitative metrics tell you what happened; qualitative feedback tells you why. Thumbs up/down signals, satisfaction ratings, session logs, and targeted surveys are harder to aggregate statistically, but they provide ground truth that automated metrics routinely miss.
Users notice when LLM quality degrades — in tone, in relevance, in the subtle ways a response fails to address what they actually asked — even when task completion rates hold steady.
The Hacker News community's reaction to undisclosed LLM A/B testing makes this concrete: practitioners are acutely aware of quality variation and attribute negative experiences to experimentation when they notice inconsistency. That perception matters, and it won't show up in your latency dashboard.
LLM-as-judge and automated scoring approaches
For teams that need scalable quality assessment without manual review at every step, LLM-as-judge has emerged as the most practically useful approach. The pattern is straightforward: use a separate LLM to evaluate outputs against natural language rubrics — criteria like faithfulness to source material, relevance to the user's query, or absence of hallucination — and produce structured scores that can be aggregated across experiment arms.
Named approaches in this space include G-Eval, Prometheus (the evaluator model, distinct from the infrastructure monitoring tool of the same name), GPTScore, SelfCheckGPT, and QAG Score. These model-based scorers are more semantically aware than statistical alternatives, though they carry higher inference cost and introduce their own consistency questions.
The right metrics also depend on what you're building. For RAG pipelines, the relevant dimensions are faithfulness, answer relevancy, and contextual precision and recall. For agent systems, task completion rate, tool correctness, and step efficiency matter more than fluency. Applying generic quality metrics to specialized systems produces generic, unhelpful results.
The practical synthesis: a measurement stack for LLM experiments should combine operational metrics (latency, cost, token usage) with behavioral outcomes (retention, task completion, conversion), qualitative feedback signals, and automated scoring on the dimensions most relevant to your specific system. Experiment platforms that surface metric correlations are worth noting here — the ability to visualize how experiments jointly affect two metrics simultaneously is particularly useful for surfacing the trade-offs that LLM experiments routinely produce.
Landon Smith, Head of Post-Training at Character.AI, describes the value directly: "GrowthBook has been an invaluable tool for Character.AI, helping us develop our models into a great consumer experience. We can compare different modeling techniques from the perspective of our users — guiding our research in the direction that best serves our product." That framing — comparing modeling techniques from the perspective of users — is the right orientation for LLM measurement generally. The text of the output is an intermediate artifact. What users experience, and what they do next, is the actual signal.
When to use offline prompt evaluation versus live A/B tests in production
Offline evaluations and live A/B tests are not two names for the same thing. They answer different questions, and conflating them is one of the more reliable ways to ship a prompt change that looks good in your notebook and quietly degrades user experience in production.
The practical question most teams face isn't which method to use — it's understanding what each one actually tells you, and in what order to use them.
What offline evaluation is good for — and where it breaks down
Offline evaluation means running your prompt or model against a fixed, labeled dataset and scoring outputs against defined criteria. The appeal is real: it's fast, repeatable, and makes regression testing tractable. When you have a stable evaluation set, you can quickly see whether a new prompt version improves or degrades performance on known cases before any user sees the change.
The limitation is equally real. As Label Studio puts it directly: "A model can look improved offline and still cause new issues in real usage." Your evaluation dataset is always an approximation of production. It underrepresents edge cases, misses novel user behaviors, and reflects historical patterns that may have already shifted.
Offline evals also struggle with subjective quality dimensions — tone, helpfulness, policy adherence — unless you've built a rubric that reliably captures those qualities, which is harder than it sounds. A model that clears every benchmark you've defined can still fail the moment it encounters the actual distribution of things your users ask.
What live A/B testing adds that offline evals cannot
Live tests expose your LLM to production inputs — the full, unpredictable distribution of real user queries, including the long tail of phrasings and edge cases that no curated dataset fully represents. More importantly, they measure downstream user behavior: engagement, task completion, satisfaction signals. Those are the outcomes that actually matter to the product, and they're invisible to offline scoring.
Statsig frames this plainly: "A model can look great in a notebook, then fall apart once it meets production traffic and real users." Offline gains are hypotheses. Live tests are where you confirm whether those hypotheses hold.
The two-stage workflow: validate offline, then route live traffic
Both Label Studio and Statsig independently converge on the same sequencing logic, which makes it worth treating as consensus rather than one team's opinion: use offline evaluation to iterate quickly, and use online testing to confirm that improvements are real.
In practice, the trigger for starting offline evals is any meaningful change — a new model version, a major prompt rewrite, or a change that puts cost, latency, or safety targets at risk. Lock a dataset, define a grader, anchor your decisions on that fixed setup. When offline gains look steady, that's the signal to move — not when they look perfect.
Statsig's guidance here is worth internalizing: "Do not wait for perfection. Move from sandbox to production when gaps narrow, not vanish."
The transition to live testing doesn't mean full exposure. Start with shadow scoring — run both variants against live traffic, score them silently, and compare grades before any user sees a difference. Then open a small traffic flight. Scale as wins persist.
Why controlled rollout requires traffic routing, not just deployment
The two-stage workflow is only practical if you have a reliable mechanism for routing a fraction of live traffic to a new variant without a full deployment. Feature flags are that mechanism. They let you expose a new prompt or model to, say, five percent of users, monitor behavioral metrics in real time, and roll back instantly if guardrail metrics degrade — without touching your deployment pipeline.
This is where the architecture of your experimentation platform matters. GrowthBook SDKs download flag rules as a locally cached payload and evaluate every flag check in-process with zero network latency, which means flag evaluation itself doesn't add latency to LLM inference — a meaningful consideration when you're already managing token generation time.
Consistent user bucketing (the same user always receives the same variant) prevents within-user contamination that would otherwise make behavioral metrics uninterpretable. Platforms that connect guardrail metric monitoring directly to traffic controls close the loop between observation and action — GrowthBook's Safe Rollouts monitor guardrail metrics like error rates and latency, and surface warnings directly in the rollout dashboard, within the same interface used to manage the experiment itself.
The offline-to-live sequence isn't a theoretical best practice. It's the operational response to the non-determinism problem: because you can't fully anticipate what production inputs will look like, no amount of offline testing substitutes for real signal — but shipping without offline validation first is how you expose users to regressions you could have caught in an afternoon.
Five design decisions that determine whether your LLM experiment statistics are valid
Standard statistical principles don't stop working when you're testing LLM outputs — but they do require deliberate adaptation. The core challenge isn't that rigor becomes impossible; it's that the assumptions baked into standard experiment design (stable variance, clean randomization, interpretable output comparisons) quietly break in ways that produce confident-looking results that are actually wrong.
One team ran a two-week A/B test on an improved prompt, saw p=0.03 at +1.2% improvement, called it a win, and rolled out fully — only to discover six months later, through a customer audit, that the prompt had been producing subtly incorrect summaries the entire time. Click-through rates and session lengths looked fine. The semantic quality had drifted in ways the metrics couldn't see.
The fix isn't to abandon statistical testing. It's to make five specific design decisions correctly.
Power analysis when variance isn't constant
Standard power analysis calculations assume roughly constant variance across your input distribution. LLM output variance doesn't work that way — it's heteroskedastic, scaling with task difficulty. Simple lookup queries may show only 5–10% output divergence across runs; complex reasoning tasks can show 40–60% divergence.
When you run a single aggregate power calculation across a mixed workload, your sample size estimate will be accurate for the easy queries and wildly wrong for the hard ones.
The practical implication: treat any standard power calculation as a floor, not a target. GrowthBook's general rule of thumb — at least 200 conversion events per variation — is a reasonable minimum starting point, but in high-variance LLM contexts it's almost certainly insufficient for detecting meaningful differences in complex task performance. Where possible, segment your power analysis by task complexity and size for the hardest segment. If you can't segment, apply a conservative multiplier and accept that you'll need more data than a standard calculator suggests.
Randomize on users, not requests
This is the most common structural mistake in LLM experiment design. If you randomize on individual requests, the same user can receive variant A in one session and variant B in another. That contaminates the experiment and makes your results uninterpretable — you're no longer measuring the effect of a treatment on a person, you're measuring a noisy mixture of both treatments on the same people.
The correct randomization unit is the user. Latitude explicitly recommends this, and the implementation pattern matters: assignment needs to be stable across sessions without requiring stored state. GrowthBook handles this through a consistent hashing algorithm (MurmurHash3) that maps user identifiers to variants deterministically — the same user always gets the same variant as long as experiment settings don't change, with no additional cookies required.
The Hacker News community has flagged exactly this concern from the other side: when LLM providers run silent A/B tests with unstable assignment, any longitudinal research on tool effectiveness becomes uninterpretable. Stable, transparent user-level assignment is what makes results defensible.
Test duration and novelty effects
The minimum duration principle exists in standard A/B testing because traffic patterns vary across days and weeks. In LLM experiments, there's an additional reason to enforce it: novelty effects. Users often behave differently when they first encounter a new AI response style — they may engage more, explore more, or rate responses more generously simply because the experience is new. That early signal doesn't reflect steady-state behavior.
GrowthBook's documentation recommends 1–2 weeks as a typical minimum, with some tests requiring a month or more. For LLM experiments, treat two weeks as the floor and be especially skeptical of results that look strong in the first few days. If your experiment shows a strong positive signal in days one through three and then flattens, you're likely looking at novelty, not a real effect.
Statistical methods for aggregated behavioral outcomes
Because you're measuring user-level behavioral outcomes rather than output-level quality scores, the statistical methods that apply to LLM A/B testing are the same ones that apply to any behavioral experiment — with one important adaptation. Variance reduction techniques matter more here than in typical product experiments, because LLM output variance inflates the noise in your downstream metrics.
GrowthBook supports CUPED (Controlled-experiment Using Pre-Experiment Data) and post-stratification, both of which reduce variance by controlling for pre-experiment user characteristics. In practice, CUPED can meaningfully reduce the sample size required to reach significance — which matters when you're running experiments that need weeks of data to accumulate.
GrowthBook's activation metric feature handles the filtering explicitly: users who were assigned to an experiment but never actually triggered the LLM feature should be excluded from analysis, or they'll dilute your effect estimate.
GrowthBook offers both frequentist and Bayesian engines, with Bayesian as the default. For LLM experiments where you expect small effect sizes and high variance, Bayesian methods have a practical advantage: they let you reason about the probability that one variant is better, rather than forcing a binary significant/not-significant decision on noisy data. Consistent user bucketing and activation metric filtering are worth enforcing here, specifically because they handle the structural mistakes that contaminate LLM experiment results most often.
Canary deployments as a risk management layer
The fifth design decision isn't statistical — it's operational. Before you commit to a full experiment, a canary deployment lets you expose a small slice of traffic to the new variant and monitor guardrail metrics before the experiment proper begins. This is particularly valuable for LLM changes because the failure modes are often qualitative and slow to surface: a prompt that produces subtly worse responses won't show up in latency metrics, but it will eventually show up in retention and satisfaction signals.
Feature flag rollout controls support gradual exposure with percentage-based rollouts, targeted rollout by user segment, and instant kill switches — which maps directly to what a canary pattern requires in practice. Start at one to five percent of traffic, watch your guardrail metrics for 24–48 hours, and only open the full experiment if nothing alarming appears. This doesn't replace the experiment — it de-risks the experiment by catching catastrophic failures before they affect your full user base.
Putting it together: the sequence that makes LLM experimentation defensible
The five sections above each address a distinct failure mode. The non-determinism section explains why standard assumptions break. The variable isolation section explains what you're actually testing. The metrics section explains what to measure. The offline-versus-live section explains the sequencing. The statistical design section explains how to make the numbers trustworthy. None of these is optional, but they're also not independent — they form a sequence, and the sequence matters.
Start with offline evals before you touch production traffic
Before any live traffic is involved, you need a stable evaluation setup: a locked dataset, a defined grader, and a baseline score for your current configuration. This doesn't need to be elaborate — it needs to be stable. The purpose isn't to prove your new variant is better; it's to establish a reference point that makes regression detectable.
The metric selectivity principle discussed earlier — where five simultaneous metrics produce a 41% false positive rate — applies with even more force when you're scaling to multiple sequential experiments. Lock your primary metric before you start, and treat secondary metrics as diagnostic rather than decisive.
When offline gains look steady across multiple evaluation runs, that's the signal to move to live traffic. Not when they look perfect — when they look consistent.
Build a measurement stack that survives output variability
The measurement infrastructure needs to be in place before the experiment starts, not assembled during it. That means instrumentation for behavioral outcomes (task completion, retention, conversion), operational metrics (latency, token usage, cost), and at least one qualitative feedback signal.
For teams running warehouse-native experimentation, this means confirming that your behavioral events are flowing into your data warehouse and that your experiment platform can query them directly — avoiding the need to re-instrument or duplicate data pipelines.
The LLM-as-judge layer, if you're using one, should be validated offline before it's used to score live traffic. An evaluator that produces inconsistent scores is worse than no evaluator, because it adds noise to a signal that's already noisy.
Scale to live experiments with proper statistical controls
The live experiment phase is where the statistical design decisions from the previous section become operational. User-level randomization needs to be confirmed before traffic opens. The activation metric filter needs to be set so that unexposed users don't dilute the analysis. The minimum duration needs to be enforced even if early results look strong.
Canary first, then full experiment. Watch guardrail metrics during the canary phase. If nothing alarming surfaces, open the full experiment and let it run to the pre-specified duration. Resist the temptation to call it early — the novelty effect is real, and results that look strong in the first week often flatten by week two.
Where to begin depending on where you are
The right entry point depends on where your team currently sits.
Teams that have no existing evaluation infrastructure should start with offline evals, not live experiments. Build a dataset of 50–100 representative inputs, define a grader (even a simple rubric scored manually), and establish a baseline score for your current prompt or model. That baseline is what makes future changes interpretable. Without it, you have no reference point, and every experiment result is floating in space.
Teams that have offline evals but have never run a live LLM experiment should focus on the randomization and duration decisions before anything else. Confirm that your experiment platform assigns users — not requests — to variants, and that assignment is stable across sessions. Set a minimum duration before you start, and commit to it. These two decisions prevent the most common structural errors that produce false positives in LLM experiments.
For teams already running live experiments but getting results that feel unreliable, the most likely culprits are metric selection and sample size. Check whether you're tracking more than two or three metrics simultaneously — if so, your false positive rate is almost certainly elevated. Check whether your power calculation accounted for the variance in your specific task distribution, or whether it used a generic formula. Both of these are fixable without changing your experiment infrastructure.
Related Articles
Ready to ship faster?
No credit card required. Start with feature flags, experimentation, and product analytics—free.

