Experiments

Offline vs online evaluation for AI models: what actually matters

A graphic of a bar chart with an arrow pointing upward.

A model that scores well on your eval set and then degrades a production metric isn't a fluke — it's what happens when teams treat offline and online evaluation as two ways of measuring the same thing.

They aren't. Offline evaluation answers whether your model predicts historical data well. Online evaluation answers whether it actually improves outcomes for real users. Conflating them is the most common way AI evaluation programs produce confident answers to the wrong questions.

This article is for engineers, PMs, and data teams who are building or improving evaluation programs for AI models — especially those who have felt the gap between strong benchmark scores and disappointing production results.

Whether you're working with recommendation systems, LLM-powered features, or classification pipelines, the core problem is the same: eval datasets, automated scoring, and human review tell you one thing, and live user behavior tells you another. Here's what you'll learn:

  • Why offline metrics don't guarantee real-world performance — and the specific structural reasons they fail
  • What online evaluation actually costs in infrastructure, risk, and interpretive complexity
  • How to sequence offline and online evaluation as decision gates, not parallel options
  • How to choose the right metrics at each stage, including guardrail metrics that catch wins that aren't really wins

The article moves in that order — from understanding what each method measures, to why offline scores mislead, to the real demands of production experiments, to the sequencing and metric discipline that makes both stages reliable.

By the end, you'll have a clear mental model for when to trust your eval results and when to be skeptical of them.

What offline and online AI evaluation actually measure

Most evaluation debates inside engineering and product teams get stuck because people are arguing about methods when they should be arguing about questions.

Offline evaluation and online evaluation don't measure the same thing with different tools — they answer fundamentally different questions about fundamentally different realities. Until a team internalizes that distinction, their evaluation program will produce confident answers to the wrong questions.

Offline evaluation measures model performance against a fixed past

Offline evaluation measures model performance against a fixed dataset using a defined scoring method. You hold out a slice of historical data, score your model's predictions against labeled ground truth, and compare results across model versions. The question it answers is: does this model predict past data well?

This is the default starting point for most ML projects because it's repeatable, fast, and easy to compare. For tasks where correctness is clearly definable — classification, entity extraction, ranking, regression — offline evaluation works cleanly.

It's particularly valuable for regression testing: once you have a stable evaluation set, you can quickly detect whether a new model version improves or degrades performance before anything touches production. In recommender systems, for example, offline metrics like precision, recall, mean average precision (MAP), and normalized discounted cumulative gain (NDCG) give teams a consistent signal for iterating on model architecture without running live experiments.

The key constraint is baked into the definition: offline evaluation is always operating against historical data. It measures model quality in a controlled, static approximation of reality.

Online evaluation measures whether the model improves outcomes for real users

Online evaluation measures model performance in a live environment against real user inputs and real interaction patterns. The question it answers is: does this model improve outcomes for actual users?

That shift in question is not subtle. Online evaluation captures dynamics that no static dataset can represent: how recommendations influence user behavior and vice versa, how the model performs under real latency conditions, how it handles the long tail of inputs that never appeared in your training or evaluation data.

The canonical online method is A/B testing — one user segment receives the new model, another receives the existing system, and you track behavioral metrics like click-through rate, conversion rate, or time on platform. These metrics connect model behavior directly to user outcomes and business objectives.

Online evaluation requires infrastructure to segment users, track interactions, and ensure statistical validity. It is slower, costlier, and riskier than offline evaluation. But it answers the question that actually determines whether a model ships.

The gap is ontological, not methodological

The gap between offline and online AI evaluation is not methodological — it's ontological. They are measuring different things. As Shaped.ai puts it directly: "predicting a hold-out set of interactions is not the same as predicting what your users will actually interact with."

Here's what that gap looks like in practice. A team trains a new recommendation model, evaluates it on a chronologically split hold-out set, and sees improved NDCG scores. They conclude the model is ready and ship it. In production, CTR drops.

What happened? The model overfit to historical interaction patterns that have since shifted — seasonal behavior changed, the content catalog expanded, user intent evolved. The offline scores were real. The production degradation was also real. Both can be true simultaneously because they're measuring different things.

This failure mode — mistaking strong offline scores for production readiness — is the most common and consequential mistake in AI evaluation programs.

Conflating the two questions fails teams in opposite directions

Teams that conflate these two questions tend to fail in one of two directions. They either stop at offline evaluation and ship prematurely, trusting that good scores on historical data predict good behavior in production. Or they skip offline evaluation entirely and run expensive, risky online experiments on model candidates that could have been filtered cheaply before touching users.

The correct mental model treats offline and online evaluation as sequential decision gates, not interchangeable alternatives. Offline evaluation is the filter that keeps weak candidates from consuming online experiment slots. Online evaluation is the confirmation that improvements actually hold when real users encounter them. Neither works without the other, and neither substitutes for the other.

GrowthBook's framing captures the relationship well: "When it comes to AI, evals are just the tip of the iceberg. A/B testing is where real value gets created." Landon Smith, Head of Post-Training at Character.AI, describes using A/B testing to "compare different modeling techniques from the perspective of our users" — which is precisely the move from model-quality evaluation to user-outcome evaluation that the offline-to-online transition represents.

The rest of this article builds on that distinction. But if you take one thing from this section: offline eval answers whether your model is good at predicting the past. Online eval answers whether it makes things better for the people using your product. These are not the same question, and treating them as such is where evaluation programs go wrong.

Why strong offline metrics don't guarantee real-world performance

If you've ever shipped a model that looked great on your eval set and then watched it land flat — or worse, degrade a production metric — you've already encountered the core problem this section is about.

The frustration is real, and it has specific causes. Understanding them is more useful than simply running more offline tests.

The dataset approximation problem

Every offline evaluation dataset is a historical artifact. It captures what users did interact with, under the conditions that existed when the data was collected, with the system that was running at the time. That's a fundamentally different thing from what your users will interact with under a new model.

The gap isn't a flaw in your data collection — it's structural. Fixed datasets underrepresent edge cases by definition, because edge cases are rare. They omit new user behaviors that didn't exist when the data was collected. And they reflect the decisions of the previous system, not a neutral sample of user intent.

Two biases compound this problem in recommendation systems, but the underlying dynamic applies broadly. First, your eval set only contains interactions with items that were actually shown to users — if the old model never surfaced a particular item, that item has no interaction data, so the new model is being evaluated against a sample the old model already shaped.

Second, users click on things in prominent positions more than things buried lower on the page, regardless of actual quality — which means your historical data overrepresents whatever the previous system ranked highly. Both biases point in the same direction: offline data reflects past system decisions, not neutral ground truth about what users actually want.

Distribution shift over time

Even a well-constructed offline dataset decays. User preferences shift. Content catalogs change. Product contexts evolve. A dataset that accurately represented your production distribution shift six months ago may be meaningfully stale today.

The practical discipline here is chronological hold-out splitting — evaluating on data that comes after your training window rather than randomly sampled from the same period. Many teams skip this, which introduces time-based data leakage and produces optimistic offline scores that don't survive contact with live traffic.

And even teams that do split chronologically face a compounding problem: the gap between the most recent training data and live production widens continuously after deployment. There's no static fix.

The rubric design challenge for subjective outputs

For tasks with clear ground truth — classification, extraction, named entity recognition, ranking against explicit relevance judgments — offline evaluation is on solid footing. The problem sharpens considerably when the output is subjective.

As Label Studio notes: "Offline evaluation also struggles with metrics that require subjective judgment (tone, helpfulness, policy adherence) unless you've designed a rubric and labels that capture those qualities reliably." For chatbot responses, summaries, or any output where quality involves nuance, automated scoring produces numbers that may have no meaningful relationship to what users actually experience.

Practitioners building with tools like Promptfoo, Braintrust, and Opik are actively working on this — but the Hacker News community that uses these tools is candid about the limits. One practitioner reported that Promptfoo "broke down when we wanted evaluations with too many independent variables." Another summarized the state of agentic AI evaluation at many organizations as "vibes and logs." That's not a dismissal of offline eval — it's an honest acknowledgment that the methodology hasn't kept pace with the complexity of modern AI outputs.

When offline metrics are and aren't trustworthy

The practical decision rule is roughly this: offline metrics are reliable when you can define correctness independently of user behavior. Classification tasks, extraction tasks, ranking tasks with stable relevance judgments — these fit well.

Offline metrics become unreliable when the dataset isn't chronologically split, when output quality is inherently subjective, when the deployment environment differs significantly from the training distribution, or when "correct" is itself a function of what users do.

A model can score better on every offline metric and still cause new problems in production — particularly when the deployment environment is noisy, user inputs have shifted, or the evaluation rubric never captured what users actually care about. Offline metrics are necessary. They're not sufficient, and treating them as sufficient is the most common way evaluation programs mislead the teams relying on them.

The real costs and risks of running AI evaluation in production

Online evaluation is not a free upgrade from offline evaluation. It captures things offline cannot — but it does so at a cost in infrastructure, risk, and interpretive complexity that teams consistently underestimate until they've been burned by it.

The gap between a model that looks great in a notebook and one that holds up under production traffic and real users is precisely what online evaluation exists to close, and precisely why it demands more care than most teams initially plan for.

What online evaluation captures that offline cannot

The fundamental advantage of online evaluation is that it operates on real inputs from real users in real time. Offline datasets, no matter how carefully curated, are static approximations of a distribution that is constantly shifting. Online evaluation, by contrast, reflects current user behavior — including long-tail inputs, adversarial edge cases, and interaction patterns that never appeared in your training or evaluation data.

This matters for several reasons that don't show up in benchmark scores. Latency under load behaves differently than latency in a test harness. Users respond to model outputs in ways that create feedback loops — clicking, abandoning, rephrasing — that reveal quality signals no labeled dataset can replicate.

And distribution shift, the gradual drift between the world your model was trained on and the world it now operates in, is only detectable once the model is actually running against live traffic. Offline evaluation is structurally blind to all of this.

Shadow deployments, gradual rollouts, and A/B tests carry different risk profiles

There is no single approach to online evaluation — the right method depends on how much risk you're willing to accept and how much signal you need.

Shadow deployments are the lowest-risk entry point. The new model runs silently alongside the production model, scoring the same inputs without exposing users to its outputs. You compare results across real traffic before any user sees a difference. This is useful for catching gross failures before they reach production, but it doesn't tell you how users would actually respond to the new model's behavior.

Rather than treating gradual rollouts as a middle step to skip, teams should recognize them as the primary mechanism for learning from real behavior while containing risk. A phased rollout exposes the new model to a small, controlled percentage of real users, then expands incrementally as engagement data confirms the model is performing as expected — balancing real-world signal with a limited blast radius for any failure.

A/B tests are the highest-signal method and the highest-risk. A defined treatment group receives the new model while a control group receives the existing one, and you measure the difference in business metrics — impressions, clicks, conversion, task completion rates, depending on your product context.

For recommender systems specifically, this is the standard gate between offline validation and production deployment. A/B tests give you the cleanest causal read on business impact, but they require sufficient traffic, sufficient time, and sufficient discipline to interpret correctly.

Novelty effects, small samples, and confounds make online results easier to misread than they look

Running online experiments cleanly is harder than it looks. Three failure modes recur often enough to treat as defaults rather than edge cases.

Novelty effects distort early results. Users behave differently when they first encounter a changed model — sometimes better, sometimes worse — in ways that don't reflect steady-state behavior. Scaling a rollout too quickly based on early signal is a common way to lock in a misleading result. The guidance to start with small traffic percentages and scale only as wins persist reflects hard-won experience with this pattern.

Insufficient sample sizes are a related problem. The pressure to ship creates pressure to call experiments early. An experiment that looks like a win at 10% traffic may reverse at 50% when the sample becomes more representative of the full user population.

Confounding variables are harder to control in production than in a notebook. Seasonality, concurrent product changes, and external events can all contaminate results in ways that are difficult to detect after the fact. Online experiments work on averages, which means they can also mask harm to specific user subsets even when aggregate metrics look healthy.

Guardrails and rollback capability are what separate responsible online eval from naive online eval

Given these risks, responsible online evaluation requires more than good intentions. Infrastructure load testing before any online eval begins is a prerequisite, not an afterthought — you need to know your system can handle the new model's latency profile under real traffic before users depend on it.

Gradual rollout percentages should start small and expand only when the data supports it. Guardrail metrics — secondary metrics that must not degrade even if the primary metric improves — provide an automated check against the averaging problem. If your primary metric is task completion rate but your guardrail is error rate, a model that improves completions by routing users around failures will trip the guardrail before it ships widely.

Instant rollback capability is the final safety layer. The ability to deactivate an underperforming model immediately, without a deployment cycle, is what makes gradual rollouts genuinely low-risk rather than just incrementally slow.

Teams at companies like Character.AI have used this combination of gradual exposure and real-time model comparison to guide research decisions based on actual user outcomes — not just eval scores. That combination of controlled exposure, live measurement, and fast rollback is what separates online evaluation done responsibly from online evaluation done naively.

The sequence between offline and online evaluation has a directional logic that teams routinely ignore

Offline and online evaluation are not competing philosophies you choose between based on team preference or resource constraints. They are sequential stages of a single evaluation program, and the sequence has a directional logic that matters.

Offline is fast and cheap — it gives you iteration speed and regression safety. Online is slow and expensive — it gives you confirmation that improvements actually hold against real users and real business outcomes. Running them out of order, or treating them as parallel options without a gate between them, wastes resources in both directions.

As Label Studio puts it directly: "Offline testing gives you repeatability and fast iteration. Online testing gives you confidence that those improvements hold in production." That's not a preference statement — it's a description of what each method is structurally capable of delivering.

Treat the sequence as a discipline, not a default

The reason teams collapse offline and online evaluation into a single undifferentiated process is usually speed pressure. A new model version looks promising, the team wants to ship, and the offline eval feels like a formality before the "real" test in production.

This logic inverts the cost structure. Online experiment slots are finite, slow to run, and carry real user risk. Spending them on candidates that could have been eliminated with a cheap offline regression test is a process failure, not a shortcut.

The sequence should be treated as a gate, not a guideline. Candidates that regress on the offline evaluation set don't advance to production testing. Full stop. This discipline is what makes the online stage meaningful — by the time you're running a production experiment, you've already filtered out the noise.

The offline filtering stage: eliminate before you validate

Offline evaluation's primary role in the sequence is elimination. Its value is not in telling you which model is best — it's in telling you which candidates are not worth testing against real users. Because offline scoring is repeatable and fast, you can evaluate multiple model versions in parallel before any user is exposed to any of them.

This is where regression benchmarks earn their keep. A stable evaluation set lets you quickly surface whether a new model version degrades on behaviors the previous version handled correctly — a capability that's genuinely difficult to replicate in production without exposing users to regressions.

The offline stage is also where automated scoring for well-defined tasks (classification accuracy, extraction quality, ranking performance) provides the most reliable signal, because the ground truth is clear and the measurement is consistent across runs.

The key discipline here is defining the offline pass threshold before you start scoring. Teams that define "good enough" after seeing the results are optimizing for confirmation, not evaluation.

The online validation stage: confirm against real behavior

Once candidates pass the offline gate, online experiments answer the question that offline cannot: do these improvements hold when real users, with their actual distribution of inputs and intentions, interact with the model under production conditions?

This stage should begin conservatively — a small initial traffic percentage, expanded incrementally based on engagement data. Gradual rollout tooling is built for exactly this pattern, allowing teams to roll out model changes incrementally while monitoring whether the expected improvements materialize at scale.

Equally important is the ability to roll back instantly if a candidate that looked strong offline starts degrading production metrics — that rollback capability is what makes teams willing to run online experiments at all.

One practical tool for the online stage worth knowing about: sequential testing allows teams to monitor experiment results continuously without inflating false positive rates. In a standard A/B test, you set a required sample size before you start and wait until you hit it — calling the experiment early inflates your false positive rate.

Sequential testing is designed to let you check results continuously and stop as soon as you have enough evidence to decide, without that statistical penalty. For teams running many model experiments, this can meaningfully compress decision timelines.

Define stage-appropriate metrics before you start

The sequencing only works if the metrics governing each stage are defined before evaluation begins. Offline metrics — automated scoring, regression benchmarks, ranking quality measures — govern the filtering stage. Online metrics — task completion rates, engagement signals, conversion — govern the validation stage. These are different questions answered by different instruments, and conflating them at either stage produces noise.

The transition between stages is also the moment to define guardrail metrics for the online experiment: the metrics that must not degrade even if the primary metric improves. A model that lifts average engagement while degrading performance for a user subset, or that improves a headline metric while increasing latency, has not passed online validation — it has passed a partial test. Guardrail metrics are what make the online stage a genuine confirmation rather than a selective one.

As GrowthBook's A/B testing guidance frames it: "When it comes to AI, evals are just the tip of the iceberg. A/B testing is where real value gets created." That framing captures the sequencing logic precisely — offline evals are necessary, but they're the filter, not the verdict.

Metric selection is where AI evaluation programs quietly fall apart

Metric selection is where most AI evaluation programs quietly fall apart. Teams pick metrics that are easy to compute, not metrics that answer the question they actually care about.

The result is a feedback loop that optimizes for measurement convenience rather than model quality or user outcomes. Getting this right requires treating offline vs. online AI evaluation as distinct measurement contexts — each with its own appropriate signals, failure modes, and design constraints.

Offline metrics work best when correctness is definable

Offline evaluation produces reliable signals when you can draw a clear line between right and wrong. Classification, extraction, detection, and ranking tasks fit this profile well: you can build a labeled evaluation set, score model predictions against it, and trust that an improved score reflects a genuinely better model.

The critical qualifier is that "correctness" must be unambiguous. The moment outputs become interpretive — a summary that's technically accurate but poorly structured, a recommendation that's relevant but not personalized — offline metrics require additional scaffolding to remain valid. Without it, you're scoring something that doesn't map cleanly to what you care about.

The rubric design problem for subjective AI outputs

Generative AI outputs break the assumptions that make offline metrics tractable. Tone, helpfulness, and policy adherence can't be scored reliably without a deliberately designed rubric and labeled examples. As Label Studio puts it directly: "Offline evaluation also struggles with metrics that require subjective judgment (tone, helpfulness, policy adherence) unless you've designed a rubric and labels that capture those qualities reliably."

The rubric doesn't emerge from the dataset — it requires intentional investment in defining what "good" looks like, building labeled examples through human review, and validating that raters apply the rubric consistently.

Skip any of those steps and your automated scoring produces a number that feels rigorous but measures something loosely correlated with quality at best. This is the core reason offline benchmarking for LLM outputs remains genuinely hard: the measurement problem isn't solved by having more data, it's solved by having better-defined criteria.

Online metrics must connect to user outcomes

Online metrics should not measure model behavior in isolation. The question isn't whether the model produces better outputs in the abstract — it's whether users accomplish more, engage more meaningfully, or achieve better outcomes because of the model. The right orientation is data that links directly to user outcomes: the metric should be downstream of the model, not a proxy for it.

What that means in practice depends on product context. A search system cares about whether users find what they're looking for — impressions and clicks are a starting point, but task completion and query reformulation rates tell a more honest story. A conversational assistant should be measured on whether users successfully accomplish what they came to do.

A content recommendation system might track session depth or return visits. The specific metric matters less than the discipline of asking: does this number reflect what users actually got out of the interaction?

Landon Smith, Head of Post-Training at Character.AI, described the value of this orientation directly: "We can compare different modeling techniques from the perspective of our users — guiding our research in the direction that best serves our product." That framing — model decisions evaluated through user outcomes — is the right mental model for online metric selection.

The averaging problem and subset harm

Aggregate experiment results are averages, and averages hide a lot. A model change that improves the primary metric across the full user population can simultaneously degrade the experience for a specific subgroup — and that harm won't surface unless you're explicitly measuring for it.

GrowthBook's experimentation documentation is unusually candid on this point: "Experimentation results work on averages, and this can hide a lot of systemic biases that may exist. There can be a tendency for algorithmic systems to 'learn' or otherwise encode real-world biases in their operation, and then further amplify/reinforce those biases."

Short-term metric gains compound this problem. Optimizing for engagement or conversion without tracking longer-term signals can produce what looks like a win but is actually a dark pattern — "where you inadvertently exploit user trust to boost numbers temporarily at the expense of long-term retention." The metric was right; the measurement window was wrong.

The practical response is to segment experiment results by meaningful user dimensions and define in advance which subgroup outcomes you're committed to protecting. This isn't just an ethical consideration — it's a measurement validity concern. A result that improves the average while harming a subgroup is not a reliable signal that the model change was good.

Guardrail metric design

Guardrail metrics exist to catch the cases where a primary metric improves while something else quietly degrades. They're not secondary metrics — they're constraints on what counts as a valid win.

The clearest illustration of why they matter is the revenue-versus-retention pattern: an experiment drives purchasing behavior, the revenue metric goes up, and the team ships the change — but retention is declining in parallel, indicating that the purchasing behavior was extracted at the cost of user trust.

GrowthBook's Metric Correlations feature surfaces exactly this kind of trade-off, flagging when experiments are driving purchasing behavior while simultaneously making the product worse. Without a guardrail metric on retention, that signal gets missed entirely.

Guardrail metrics also play a structural role in safe rollout design. Connecting rollout decisions to warehouse-backed guardrail metrics — with automated rollback triggers when those metrics breach defined thresholds — turns guardrails from a post-hoc analysis concern into an active safety mechanism.

The discipline of defining guardrail metrics before an experiment runs, not after results come in, is what separates evaluation programs that catch problems from those that rationalize them.

Putting it together: the evaluation logic that holds under pressure

The core argument of this article is simple, even if the execution isn't: offline and online evaluation answer different questions, and your program only works if you treat them that way.

Offline eval is a filter — fast, cheap, and essential for catching regressions before they reach users. Online eval is a confirmation — slow, expensive, and the only way to know whether your improvements actually hold against real behavior. The sequence matters. The gate between them matters. And the metrics you define at each stage, before you start scoring, are what determine whether your results are trustworthy or just reassuring.

Most evaluation programs fail because they're rigorous about the wrong question

Before you pick a method or a metric, get clear on what you're actually asking. "Is this model better?" is not a question — it's two questions wearing the same coat. Better at predicting historical data? Better at producing outcomes for real users?

Those require different instruments, different timelines, and different standards of evidence. Most evaluation programs go wrong not because teams lack rigor, but because they're rigorous about the wrong question.

When offline signal is trustworthy and when it isn't: a decision rule

If your output has a clear ground truth and your dataset is chronologically split, offline metrics will give you reliable signal. If your output is generative or subjective, you need a rubric before your offline scores mean anything.

And if your offline metrics look strong but you haven't confirmed against live traffic, you have a promising candidate — not a shippable model. The transition from offline to online isn't a formality; it's the moment you find out whether your eval set was actually measuring what matters.

The minimum viable evaluation stack for AI product teams

You don't need a sophisticated evaluation program to start. You need a stable offline eval set with a defined pass threshold, a way to run gradual rollouts against real users, and at least one guardrail metric that tells you when a headline win is hiding a loss somewhere else.

A combination of feature flags, gradual rollouts, and warehouse-native metric tracking gives teams a practical foundation for the online stage without requiring a bespoke experimentation platform.

This article is meant to be genuinely useful — not just a framework to nod at, but a way of thinking you can apply to the next model decision your team faces.

What to do next: Start by auditing your current offline eval set. Ask two questions: Is it chronologically split, or randomly sampled? And does your pass threshold exist before you run scoring, or does it emerge after you see the numbers? If either answer is uncomfortable, fix that before you touch your online experimentation setup. A well-constructed offline gate is the highest-leverage improvement most teams can make — and it costs nothing to run.

Related insights

Sign up for free

Take Growthbook for a spin, no credit card required.

Create my account

Table of Contents

Related Articles

See All Articles
Experiments

How much traffic do you need to test AI features reliably?

Jun 8, 2026
x
min read
Experiments

Why traditional A/B testing breaks down for AI products

Jun 8, 2026
x
min read
Experiments

How to measure "quality" in AI outputs (beyond accuracy)

Jun 7, 2026
x
min read

Ready to ship faster?

No credit card required. Start with feature flags, experimentation, and product analytics—free.

Simplified white illustration of a right angle ruler or carpenter's square tool.White checkmark symbol with a scattered pixelated effect around its edges on a transparent background.