Experiments

How to measure "quality" in AI outputs (beyond accuracy)

A graphic of a bar chart with an arrow pointing upward.

Your accuracy score is green.

Your users are quietly churning. Both things can be true at the same time — and if you're only watching one metric to measure AI output quality, you probably won't notice until the retention data catches up.

The core problem is that AI output quality isn't a single thing. It's at least seven independent dimensions: usefulness, relevance, hallucination rate, task completion, safety, latency, and user satisfaction. Each one can move in a different direction after a model update.

A change that improves response speed can quietly degrade groundedness. A safety tuning pass can tank task completion rates. A model that scores well on benchmarks can still fail three out of four experienced users in production — which is exactly what a METR study of AI coding tools found. Single-score evaluation doesn't surface these trade-offs. It hides them.

This article is for engineers, PMs, and data teams who are shipping AI features and want a more honest picture of what's actually working. Here's what it covers:

  • Why accuracy and other traditional metrics break down for generative AI
  • The seven quality dimensions that actually matter — and why they move independently
  • The three-layer evaluation stack (automated metrics, human review, and behavioral signals) and what each layer can and can't measure
  • How to close the gap between model-level scores and real user outcomes
  • How to build continuous monitoring so quality degradation is detectable before users report it

Each section builds on the last, moving from the structural problem with single-metric evaluation to the practical infrastructure for measuring AI output quality in production — including where tools like GrowthBook fit into the monitoring and rollback layer. By the end, you'll have a framework you can actually act on.

Why accuracy alone is a misleading proxy for AI output quality

If your team is currently evaluating AI outputs using accuracy, F1-score, precision, or recall, that's a reasonable place to have started. These metrics have decades of research behind them, they're well-understood, and they integrate cleanly into existing ML pipelines.

The problem isn't that accuracy is a bad metric — it's that it was built for a fundamentally different kind of system, and applying it to generative AI creates blind spots that can quietly undermine your product while your dashboards look healthy.

How accuracy metrics were designed — and what they were designed for

Accuracy measures the ratio of correct predictions to total predictions. It was designed for classification tasks: spam or not spam, fraud or not fraud, cat or dog. These are bounded-output systems where every prediction can be checked against a discrete ground-truth label. The math is clean, the interpretation is intuitive, and for the right problem, it works.

Except even in its native context, accuracy is fragile. Consider a fraud detection model that predicts "no fraud" for every single transaction. On a dataset where 99% of transactions are legitimate, that model achieves 99% accuracy — and completely fails at its actual purpose.

This isn't a generative AI problem; it's a structural flaw in accuracy as a metric when class distributions are imbalanced. Generative AI doesn't introduce a new problem so much as it amplifies an existing one into something much harder to ignore.

Why generative AI breaks the assumptions accuracy relies on

Classical accuracy requires a ground-truth label to compare against. Generative AI rarely has one. When you ask a language model to summarize a document, draft a response to a customer complaint, or explain a technical concept, there are thousands of valid outputs — and no single correct answer. The entire premise of the metric collapses.

This isn't just a theoretical concern. A Deloitte survey from mid-2025 found that one-third of generative AI users had already encountered incorrect or misleading answers from models that score well in controlled lab evaluations. The lab-to-production gap is real, and accuracy scores don't surface it.

A model can be "95% accurate" in benchmark conditions and still fail users through slowness, inconsistency, hallucination, or contextual inappropriateness — none of which the accuracy score will flag.

The METR study of experienced open-source developers using AI coding tools reinforces this from a practitioner angle: three out of four participants saw reduced performance when using AI assistance, despite the tool's productivity claims. Model capability, as measured internally, did not predict real-world user outcomes.

The gap between what a model scores and what it delivers is not a rounding error — it's a structural feature of how these systems work.

The hidden cost of relying on a single score

When a single metric becomes the primary signal for AI output quality, it doesn't just fail to capture what it misses — it actively creates a false sense of security. A composite accuracy score can remain stable while latency degrades, hallucination rates climb, and user trust quietly erodes.

The erosion part is particularly dangerous. Most users won't file a bug report when an AI output is subtly wrong, off-tone, or unhelpful. They'll just stop using the feature. By the time that shows up in retention data, the accuracy score has been green for months. The signal you were monitoring was never measuring the failure mode that mattered.

This is the core structural problem: accuracy collapses multiple independent failure modes — safety, relevance, groundedness, task completion, latency — into a single number. Optimizing that number doesn't mean you've addressed those dimensions. It means you've made them invisible.

The implication isn't that accuracy should be discarded. It's that it needs to be understood as one narrow signal within a broader measurement stack. What that stack looks like — and which dimensions it needs to cover — is where the real work of measuring AI output quality begins.

The seven dimensions that actually define AI output quality

The instinct to reduce quality to a single score is understandable. Single scores are easy to track, easy to report upward, and easy to compare across model versions. The problem is that AI output quality is not a single thing.

It is a composite of at least seven independent dimensions — usefulness, relevance, hallucination rate, task completion, safety, latency, and user satisfaction — each of which can move in a different direction at the same time. A model update that improves one dimension can silently degrade another, and if you're only watching one metric, you won't see it happening.

This is not a new insight. In 1996, researchers Richard Y. Wang and Diane M. Strong published "Beyond Accuracy: What Data Quality Means to Data Consumers," identifying fifteen distinct quality dimensions for data and arguing that accuracy alone was an insufficient proxy for the rest.

Thirty years later, teams building AI-powered products are rediscovering the same structural problem: the dimensions are independent, they trade off against each other, and collapsing them into one number destroys the signal you need to make good decisions.

Each quality dimension can degrade without moving the others

Each quality dimension captures something the others don't. Usefulness measures whether the output actually helps the user accomplish what they came to do. Whether the response addresses the specific question asked — rather than a plausible-sounding adjacent one — is what relevance captures. Hallucination rate tracks how often the model generates confident-sounding content that is factually wrong or unsupported.

Task completion is distinct from relevance: a response can be on-topic and still fail to complete the user's underlying goal. Safety captures whether the output avoids harmful, biased, or policy-violating content. Latency measures how quickly the response is delivered. And user satisfaction aggregates the user's subjective experience, which often diverges from any of the objective dimensions above.

The critical point is that these dimensions are not correlated by default. A model tuned aggressively for safety may refuse valid requests, driving task completion rates down. A model optimized for low latency by truncating responses may produce outputs that score well on speed but fail on usefulness. A model with a low hallucination rate on benchmark datasets may still generate contextually inappropriate responses that tank user satisfaction in production.

The trade-off problem in practice

When teams optimize for a single dimension without monitoring the others, they tend to create what looks like improvement but is actually a transfer of degradation to an unmeasured axis. This is the same dynamic that GrowthBook's Metric Correlations feature is designed to surface: the platform explicitly asks whether maximizing one key metric is coming at the cost of another.

The docs describe a scenario where an experiment drives purchasing behavior — revenue goes up — while retention goes down, flagging this as a potential dark pattern. The same logic applies directly to AI quality dimensions. An experiment that improves response speed while quietly reducing groundedness is not a quality improvement; it's a trade-off that hasn't been acknowledged yet.

The METR study of experienced open-source developers using AI coding tools illustrates this at the practitioner level. Three out of four participants saw reduced performance when using AI assistance, despite the expectation of productivity gains. The developers were, in effect, optimizing for one dimension — code generation speed — while degrading others, including actual task quality and accuracy.

The one developer who saw positive results had more than fifty hours of experience with the tool, suggesting that even "task completion" as a dimension is context-sensitive and cannot be read from throughput metrics alone.

Volume metrics as a masking problem

High-volume metrics — tokens generated, responses per second, code lines written — are particularly dangerous proxies because they measure activity, not quality. As Collibra notes in the context of data quality: "You cannot rely on just one metric."

When applied to AI outputs, this means that a team watching throughput while ignoring hallucination rate, task completion, and user satisfaction is not measuring quality at all. They are measuring production volume, which can increase while quality across every meaningful dimension simultaneously declines. The dimensions don't average out. They have to be tracked separately, with explicit attention to the trade-offs between them, or the picture you're building is incomplete by design.

Three evaluation methods, each measuring what the others miss

The instinct to reach for automated metrics when evaluating AI output is correct — but it's incomplete. No single evaluation method covers the full surface area of quality dimensions like groundedness, contextual appropriateness, task completion, and user satisfaction.

What teams actually need is a layered stack where each method is assigned to the quality signals it can genuinely measure, with defined handoffs between layers. The question isn't which method to use. It's understanding what each one captures, where each one breaks, and how they work together.

Automated evaluation frameworks: fast, scalable, and narrow

Automated metrics — precision, recall, F1, BLEU, ROUGE, perplexity — were designed for bounded-output tasks: classification, regression, structured extraction. For those tasks, they provide objective performance data and enable direct comparison between model versions. They run continuously, integrate cleanly into CI/CD pipelines, and scale without human intervention.

The problem is structural. When you ask a generative model to summarize a document or answer a question, there are dozens of valid responses — all slightly different, all potentially correct. That flexibility is what makes generative models useful. It's also what breaks automated scoring.

Most automated metrics work by comparing the model's output against a reference answer and measuring how closely they match. BLEU, for example, counts how many words or phrases the model's response shares with the reference. That tells you something about surface similarity — but it can't tell you whether the response actually makes sense in context, whether it's appropriate for the specific user asking, or whether it would help someone get their job done.

A model that scores well on these metrics might still produce outputs that confuse users, miss the point of the question, or confidently state something false. Automated scoring is fast and scalable, but it's measuring a narrow slice of quality — not the full picture.

Human-in-the-loop review: the calibration layer

Human review is the only method that can reliably evaluate dimensions like tone, nuanced usefulness, safety in context, and whether a response is appropriate for a specific user situation. Automated metrics cannot score these signals. For that reason, human review remains the gold standard for evaluating the quality dimensions that matter most.

The constraint is scalability. Human review cannot scale across large prompt datasets or high-output production systems — it's expensive, slow, and inconsistent without structure. This means human review should function as a calibration layer, not a primary monitoring mechanism. When automated scores shift unexpectedly, human review validates what's actually happening. When a new model variant ships, human review samples outputs across representative prompt categories before full rollout.

When human review is conducted, it should be systematic. The VALID-AI framework from Toronto Metropolitan University offers a practitioner-facing checklist: validate data quality and representativeness, analyze algorithm behavior including bias in reinforcement learning from human feedback, account for legal and ethical considerations, assess interpretability and transparency, and evaluate AI-specific behaviors like refusal patterns and prompt sensitivity.

Ad hoc human review produces inconsistent signals. Structured human review produces calibration data that improves the automated layer over time.

Behavioral and engagement signals: the closest proxy to real value

Behavioral signals — what users actually do after receiving AI output — are the most direct proxy for quality dimensions like usefulness, task completion, and user satisfaction. Session engagement, retry and regeneration rates, feature adoption, task completion rates, and downstream retention metrics all reflect whether AI output is actually working in the context users care about.

This is also the layer where the most dangerous assumptions live. The METR study finding holds here too: high-volume behavioral signals from early adoption periods can be systematically misleading — users unfamiliar with a tool depress performance independently of model quality. Flat engagement metrics also miss task-type segmentation: AI output quality looks very different on well-documented, common tasks versus complex, niche ones.

A feature flag rollout approach to AI evaluation is built around this behavioral layer — using gradual rollouts with engagement data monitoring, A/B testing model variants against user behavior outcomes, and statistical guardrails to trigger deactivation if quality signals degrade.

Landon Smith, Head of Post-Training at Character.AI, describes the value directly: "We can compare different modeling techniques from the perspective of our users — guiding our research in the direction that best serves our product." That framing — quality as experienced by users, not scored by automated metrics — is what the behavioral layer is designed to capture.

Building the stack: assigning each method to what it can actually measure

The practical synthesis is straightforward, even if the execution isn't. Automated metrics run continuously and provide scalable coverage of bounded quality signals — they're the monitoring layer. Human review runs on sampled outputs at defined intervals or when automated scores shift — it's the calibration layer. Behavioral signals captured through production A/B tests and feature flag rollouts provide ground truth on real-world value — they're the outcome layer.

Each layer has a defined role and a defined limitation. Automated scoring catches regressions fast but misses judgment-dependent quality signals. Human review evaluates those signals but can't scale to production volume. Behavioral signals reflect actual user value but can be confounded by adoption curves, task distribution, and user familiarity.

The stack works because each layer compensates for the others' blind spots — not because any one of them is sufficient on its own.

Closing the gap between model-level metrics and real user outcomes

The three-layer evaluation stack described above can tell you a great deal about how a model performs — but there's a measurement crisis hiding inside most AI deployments that even a well-instrumented stack can miss. The problem isn't that organizations lack data — it's that the data they're collecting answers the wrong question. Model-level metrics tell you how the AI system is performing. They don't tell you whether users are actually better off.

The benchmark-to-reality gap

The most clarifying data point on this problem comes from DX's analysis of over 400 companies. Despite a 65% increase in AI tool usage across those organizations, median pull request throughput increased by only 8%. Most companies landed somewhere in the 5–15% range.

That's a real gain — but it's a far cry from the 3x to 10x productivity improvements that some vendors imply in their marketing materials, or the claims that AI is now writing 20–30% of production code.

The gap isn't a fluke. It's structural. Benchmarks are run on controlled tasks with experienced users operating under favorable conditions. Production environments involve fragmented tool use, varied skill levels, and the kind of messy, context-dependent work that controlled evaluations are specifically designed to exclude.

A model that performs well on a benchmark has demonstrated that it can perform well on that benchmark — nothing more.

The METR finding reinforces this directionally: adoption metrics and outcome metrics are different phenomena, and treating one as a proxy for the other is a measurement error.

How operational metrics mislead

Google Cloud has flagged this directly: most organizations are "confusing operational efficiency gains with end state goals". Faster response times, lower token costs, reduced latency — these are real improvements to the AI system's performance. They are not evidence that users are accomplishing more meaningful work.

The DX defect rate finding makes this concrete. Some teams have been shipping up to 50% more defects since AI adoption. At the same time, daily AI users are merging 60% more pull requests than non-users. Both of those statistics are true simultaneously — which is precisely the problem.

Volume metrics and quality outcomes can move in opposite directions, and a measurement stack that only captures throughput will read the situation as a success while the product degrades.

Operational metrics measure how the AI system is running. Outcome metrics measure whether users and the business are better off. These are not the same measurement, and they don't reliably correlate.

Instrumenting for downstream outcomes

The practical question is what to measure instead. DX's framework argues that AI performance requires tracking utilization, impact, and cost together — that any single layer in isolation produces a misleading picture. Glean's approach to measuring AI's impact on decision-making goes further, identifying decision quality, time to decision, user confidence, and business outcomes as the metrics that actually matter, explicitly positioning these as a step beyond simple adoption rates.

Google Cloud's four-layer KPI framework offers a useful organizing structure: model quality, system performance, user engagement, and financial impact. Most teams instrument layers one and two. The argument here is that layers three and four are where the actual accountability lives.

The practical implication is that teams need to define what downstream success looks like before deploying AI features — what user behaviors should change, what business metrics should move, and over what time horizon.

Platforms that enable gradual rollouts tied to engagement data and surface metric correlations across experiments — throughput up, quality down, for instance — give teams the ability to observe the trade-offs that single-metric measurement would obscure. For teams whose outcome metrics live in a data warehouse, warehouse-native experiment analysis allows results to be computed directly against that data, keeping the outcome layer connected to the same platform managing rollouts and guardrails.

Landon Smith, Head of Post-Training at Character.AI, describes the orientation shift directly: "We can compare different modeling techniques from the perspective of our users — guiding our research in the direction that best serves our product." That's the move this section is arguing for: from model-level signals to user-level outcomes, measured in production, with the ability to act on what you find.

Quality is not a one-time eval: building continuous monitoring for AI outputs

Closing the gap between model-level metrics and user outcomes requires not just better measurement at launch, but measurement that persists. Running a thorough evaluation before you ship an AI feature is necessary. It is not sufficient.

The structural problem is that LLM outputs are non-deterministic, model providers push updates without announcement, and the distribution of real user inputs drifts away from your test set the moment you go live. A passing eval score is a snapshot of quality at a specific moment under specific conditions — not a guarantee of what users will experience tomorrow.

Why static evals fail in production

Researchers studying clinical AI deployment found that even after successful integration, ML and AI algorithms are "sensitive to changes in the environment and liable to performance decay". Their recommendation was unambiguous: continuous monitoring and updating, not one-time evaluation. The clinical context is different from a product LLM, but the performance decay logic is identical.

The Claude Code regression in January 2025 is the clearest concrete example of what this looks like in practice. A harness issue was introduced on January 26th and not caught until January 28th — two full days of degraded outputs reaching users before Anthropic's team identified and rolled back the change. The regression was confirmed publicly by an Anthropic team member.

Users lost tokens to lower-quality outputs during that window, and the incident generated significant frustration around both the detection lag and the inconsistent handling of refund requests. This was not a hypothetical failure mode. It was a named model, specific dates, and documented user impact — and it happened to one of the most closely watched AI products in the industry.

The developer community has drawn the obvious conclusion: practitioners are now building daily benchmark pipelines specifically to track model degradation, because they cannot rely on providers to surface regressions proactively. The tooling is still largely custom-built. The need is real and immediate.

Monitoring for model drift and prompt sensitivity

What teams should watch for in production falls into two categories. The first is model drift — gradual or sudden changes in output quality that result from provider-side updates, fine-tuning changes, or shifts in the underlying model behavior. The second is prompt sensitivity at scale: prompts that performed well in evaluation may behave differently when exposed to the full variance of real user inputs, edge cases your test set never covered, or subtle rephrasing patterns that shift output quality in ways that aggregate metrics won't catch.

A monitoring stack for this needs end-to-end tracing across prompts, model calls, retrieval steps, and tool use — not just output-level scoring. It also needs prompt versioning and evaluation management so that changes to prompts are treated with the same rigor as code changes. The goal is to make quality degradation detectable before users report it, not after.

Guardrail metrics and rollback capabilities

Detection without the ability to act quickly is incomplete. The operational complement to monitoring is rollback — and the infrastructure layer that makes rollback practical is guardrail metrics.

Within a feature flagging and experimentation platform, guardrail metrics, gradual rollouts, and rollback capabilities operate as integrated parts of the same workflow — not separate tools bolted together. Guardrail metrics can be configured per-metric with customizable thresholds, so a regression in error rate or response quality triggers an alert before it propagates to your full user base.

Suspicious uplift detection flags when a metric moves by an implausible amount in a single experiment — often a signal of a bug rather than a genuine effect. Gradual rollouts limit blast radius by exposing AI features to a subset of users first, using engagement data to validate quality before full deployment. And when something goes wrong, instant feature deactivation means you can pull a feature without a code deploy.

These capabilities work together because they share the same targeting, segmentation, and metrics infrastructure — which is what makes the detection-to-rollback cycle fast enough to be useful.

Building a feedback loop from user behavior to model evaluation

The final piece is closing the loop between what happens in production and what you test against. User behavior — task completion, engagement patterns, abandonment — is a quality signal that no pre-launch eval can fully replicate. That signal should feed back into your evaluation process, turning production failures into stronger test coverage and surfacing the real-world conditions your benchmarks missed.

Character.AI's head of post-training described this directly: using GrowthBook to compare different modeling techniques from the perspective of users, guiding research in the direction that best serves the product. That framing — model evaluation driven by user outcomes rather than internal benchmarks — is what continuous monitoring is actually for.

As GrowthBook's AI testing playbook puts it, evals are just the tip of the iceberg. The ongoing loop between user behavior and model evaluation is where quality compounds over time.

Continuous monitoring is not just a safety net. It is the mechanism by which teams build organizational confidence in AI outputs — and the only way to know whether the quality you measured before launch is the quality your users are actually experiencing.

Building a measurement program that catches what single scores hide

The argument this article has made is simple, even if the execution isn't: AI output quality is multidimensional, the dimensions move independently, and any measurement program that collapses them into a single score is hiding the failure modes that matter most.

A green accuracy score and quietly churning users are not contradictory — they're what happens when your measurement stack is narrower than the problem it's supposed to catch.

Start with the dimensions most likely to mask risk in your specific use case

Not every quality dimension carries equal risk for every product. A customer-facing support bot has a different risk profile than an internal code assistant — safety and hallucination rate may dominate in one context, while task completion and latency dominate in the other.

Before you build anything, identify which two or three dimensions, if they degraded silently, would cause the most damage. Those are your first monitoring targets, not the ones that are easiest to automate.

Layer your evaluation stack: automated scoring, human review, and behavioral signals

The three-layer stack — automated metrics as the monitoring layer, human review as the calibration layer, behavioral signals as the outcome layer — works because each layer compensates for what the others miss. The temptation is to pick one and call it done. Resist it.

Automated scoring without behavioral signals means you're measuring model performance, not user outcomes. Behavioral signals without human review means you can see that something is wrong but not why. The layers are interdependent, and the stack only works when all three are running.

Instrument for continuous monitoring before you scale AI features

The Claude Code regression — two days of degraded outputs reaching users before detection — is a useful forcing function. Ask yourself honestly: if your model provider pushed a silent update tonight, how long would it take you to notice? If the answer is "when users complain," your monitoring is behind your deployment.

Set up guardrail metrics and gradual rollouts before you scale, not after. Feature flagging and guardrail metric infrastructure exists precisely for this: catch regressions when the blast radius is still small enough to recover from.

The tension worth holding onto is this: building a complete measurement program takes real investment, but the cost of not having one compounds quietly. You won't see it in your dashboards. You'll see it in retention.

This article is meant to give you a framework you can actually act on — not a theoretical model, but a practical starting point grounded in how these systems fail in the real world.

What to do next: Map your current measurement stack against the seven dimensions. If you're only tracking one or two — accuracy, latency, or throughput — identify which unmeasured dimensions carry the most risk for your specific use case. Start there. Add one behavioral signal tied to a downstream user outcome (task completion, retention, regeneration rate) and one guardrail metric with a defined threshold. That combination — one outcome signal, one guardrail — is enough to start catching the failure modes your current stack is making invisible.

Related insights

Sign up for free

Take Growthbook for a spin, no credit card required.

Create my account

Table of Contents

Related Articles

See All Articles
Experiments

How much traffic do you need to test AI features reliably?

Jun 8, 2026
x
min read
Experiments

Why traditional A/B testing breaks down for AI products

Jun 8, 2026
x
min read
Experiments

What does "statistical significance" mean for AI experiments?

Jun 6, 2026
x
min read

Ready to ship faster?

No credit card required. Start with feature flags, experimentation, and product analytics—free.

Simplified white illustration of a right angle ruler or carpenter's square tool.White checkmark symbol with a scattered pixelated effect around its edges on a transparent background.