How to run experiments on AI features without breaking UX

Shipping an AI feature to 100% of your users on day one is the fastest way to find out your model behaves very differently in production than it did in your evals.
The failure mode isn't always dramatic — it's often subtle. A model that performs well on benchmarks starts producing outputs that feel slightly off to real users. Trust erodes quietly, and by the time aggregate metrics surface the problem, the damage is already done.
The core argument of this article is simple: you can experiment on AI features without breaking UX, but only if you treat AI features as fundamentally different from UI changes — because the failure modes are different, the metrics that catch them are different, and the cost of getting it wrong is higher.
This guide is for engineers, PMs, and data teams who are shipping AI features and want a practical system for testing changes safely. It covers the four building blocks you need before any AI experiment goes live:
- Feature flags — how to gate AI features at runtime so you can control exposure without a deploy
- Incremental rollouts — how to ramp traffic gradually and catch quality regressions before they reach your full user base
- Guardrail metrics and fallback behaviors — what to monitor, what to commit to not harming, and how to define a safe fallback before launch
- Measurement — which metrics actually capture AI feature quality, and what to do when you realize mid-experiment you tracked the wrong thing
Each section builds on the last. By the end, you'll have a concrete framework for how to experiment on AI features without breaking UX — one that gives you the speed to ship and the control to pull back fast when something goes wrong.
Why AI features demand a different approach to experimentation
When a Google Pixel Watch update locked core watch functionality behind an AI data-sharing consent screen, users didn't file bug reports — they posted to Hacker News. A thread titled "Don't push AI down our throats" accumulated 431 points and 358 comments, with practitioners describing AI updates that silently broke working features: music playback that stopped responding, assistant behaviors that changed without warning, functionality that disappeared behind consent walls.
None of these failures would have shown up as a visible UI regression in a standard A/B test. All of them eroded trust in ways that a miscolored button never could.
That asymmetry is the starting point for understanding why AI features require a fundamentally different approach to experimentation.
The non-determinism problem
Standard A/B testing rests on a foundational assumption: the treatment is fixed. Variant B is always the same button color, the same headline copy, the same checkout flow. Every user in the variant group sees the same thing, which is what makes the statistical comparison valid.
AI features break this assumption by design. The same user, the same prompt, and the same context can produce meaningfully different outputs across requests. When you run an experiment on a conversational interface or a generative assistance feature, you're not testing a fixed treatment — you're testing a distribution of possible outputs.
Standard A/B frameworks assume the treatment is fixed — Variant B is always the same button, the same copy, the same flow. They measure the average difference between two stable things. AI breaks that assumption: Variant B is a model that might produce ten different responses to the same prompt, depending on context, phrasing, or random sampling.
A standard A/B test can tell you the average difference between the control and the AI variant, but it can't tell you whether that AI variant is behaving consistently across the full range of what it might say — or whether it's occasionally producing outputs that are wrong, off-topic, or damaging to user trust. A statistically significant result on the average can coexist with serious quality problems in the tail — and that's precisely why internal evals alone are insufficient. Internal evals can tell you how a model performs on a benchmark. They cannot tell you how users actually experience the variance in its outputs across millions of real interactions.
Trust erosion is asymmetric and sticky
A layout bug gets noticed and forgiven. A bad AI response gets remembered. The Pixel Watch incident and the broken Android assistant examples from that Hacker News thread share a common structure: an AI feature change degraded a non-AI core experience, with no visible interface regression to catch.
Users didn't experience a broken UI — they experienced a broken product that used to work. The backlash was disproportionate to the technical severity of the failure because it felt like a betrayal, not a bug.
This asymmetry matters for how you think about experiment risk. The downside of a bad UI experiment is a temporary conversion dip you can reverse with a rollback. The downside of a bad AI experiment can be lasting user resentment, public backlash, and churn from users who simply stop trusting the product.
There is no quantified recovery timeline for AI-driven trust erosion, but the community data makes clear that the recovery curve is longer and less predictable than it is for visual regressions.
Standard metrics can mask downstream harm
Even when teams do run controlled experiments on AI features, the metrics they reach for — click-through rate, session length, conversion — are poorly suited to detecting the failure modes that matter most. GrowthBook's experimentation documentation makes this explicit: "Relying solely on short-term metrics can encourage dark patterns in A/B testing, where you inadvertently exploit user trust to boost numbers temporarily at the expense of long-term retention."
With AI, this risk is structural. A model that produces more sensational or emotionally provocative responses may drive short-term engagement while accelerating long-term churn.
Aggregate metrics compound the problem further: a model that performs well on average may be producing degraded or harmful outputs for specific user segments — demographic, behavioral, or contextual — that the aggregate numbers simply don't surface.
The compounding cost of a poorly planned AI experiment
When the feature under test is an AI model, the costs of a bad experiment stack in ways they don't for UI tests. There are inference costs for a model that is underperforming at scale, user churn from trust erosion during the exposure window, and measurement costs when teams discover mid-experiment that they tracked the wrong proxies entirely.
The point is not that A/B testing is the wrong tool for AI features. The point is that standard A/B testing, applied naively to AI features without guardrails, gradual exposure, and metrics designed for probabilistic outputs, is insufficient. And the cost of getting that wrong is not proportional to the cost of getting a UI experiment wrong.
Feature flags are the control layer that makes everything else in AI rollouts possible
Before gradual rollouts, guardrail metrics, or fallback behaviors can do their job, you need a control layer that operates faster than your deploy pipeline. For AI features, that layer is feature flags — and the reason they matter isn't philosophical. It's operational.
A hardcoded branch can't close a four-minute incident window. A feature flag can.
A conditional if statement in your application code is static the moment it ships. Changing it requires a pull request, a review, a build, and a deploy — a cycle that, under normal circumstances, might take an hour. Under incident conditions, it takes longer.
A feature flag is evaluated at runtime, which means the individual feature — not the deployment — becomes the unit of control. You can change who sees what, and how much of your traffic is exposed, from a dashboard, in seconds, without touching the codebase.
For AI features, this distinction is not a convenience. Model behavior can degrade unpredictably. A prompt change, a model version update, or an upstream API shift can produce outputs that are wrong, slow, or actively harmful to user trust. The window between "something is wrong" and "users are affected" is measured in minutes. A deploy cycle cannot close that window.
One practical concern worth naming: GrowthBook SDKs download flag rules as a locally cached JSON payload and evaluate every flag check in-process with zero network latency, handling over 100 billion flag lookups per day. Adding a flag layer to an AI feature that already carries latency variance does not compound that variance — the evaluation overhead is negligible.
Controlled rollout mechanics and JSON flag payloads
Feature flags support more than on/off states. Flag types include Boolean, String, Number, and JSON — and for AI features, JSON flags are particularly useful. A single flag can carry a full model configuration as a structured payload: model name, temperature, retry count, fallback behavior.
That means a team can swap model parameters across user segments without deploying new code, and can version those configurations independently of the application.
Targeting rules let you define exactly who receives a given flag state — by user ID, account type, geography, plan tier, or any custom attribute your application surfaces. This is how you expose an AI feature to internal users first, then a beta cohort, then a controlled percentage of production traffic, each stage gated by explicit criteria rather than a coin flip.
Kill switches and instant rollback
When a GrowthBook flag is disabled, it is excluded from the API response entirely. The feature evaluates to null, targeting rules are ignored, and the SDK renders the fallback value — the safe, deterministic default you defined when you created the flag. No deploy required.
This is the kill switch mechanism in practice, and it is the reason kill switches are a near-universal requirement for AI feature management rather than a premium capability.
The fallback value matters as much as the kill switch itself. If your AI feature flag is off and your application has no defined default, users see a broken experience. If the fallback is a deterministic, non-AI version of the same functionality, users see a degraded but functional one. Defining that fallback before launch is not optional — it is the difference between a contained incident and a visible outage.
Prerequisite targeting for AI variants
GrowthBook's prerequisite flags capability allows you to gate one feature on the state of another. In practice, this means an advanced AI variant — say, a multi-turn reasoning interface — can be restricted to users who already have the base AI feature enabled and have demonstrated stable engagement with it.
You prevent exposing users to a complex model behavior before they've had a stable experience with the simpler version — one of the more common sources of UX breakage in AI rollouts.
Separating deployment from exposure
The principle underlying all of this is straightforward: code can be deployed to production and validated before any user sees it. This separating deployment from exposure is what makes gradual rollouts, guardrail metrics, and fallback behaviors operationally viable. Without it, the team is choosing between shipping to everyone and not shipping at all.
Feature flags create the middle ground — and in GrowthBook, any flag can be converted into a structured A/B test with one click, bridging the gap between safe deployment and rigorous experimentation without requiring a separate instrumentation pass.
Roll out AI changes incrementally to catch quality regressions early
The instinct to ship an AI feature to all users at once is understandable — you've tested it internally, the eval scores look good, and the team is confident. But internal testing doesn't surface the long tail of real-world inputs that will stress your model in production, and by the time a quality regression shows up in aggregate metrics, it may have already damaged trust with thousands of users.
Incremental rollouts are the practical answer: limit initial exposure, monitor what actually happens, and only widen access when the data supports it.
This isn't just operational caution. When done with the right tooling, a staged rollout generates statistical signal at each stage — you're not just hoping nothing breaks, you're running a controlled comparison between the new AI behavior and the baseline.
The staged rollout pattern for AI features
The mechanics are straightforward in principle: expose a small percentage of users to the new AI variant, monitor guardrail metrics, and advance to the next tier only when the data is clean. GrowthBook's Safe Rollout feature formalizes this into a documented ramp-up schedule — 10% → 25% → 50% → 75% → 100% — where the entire ramp completes within the first 25% of the configured monitoring duration.
If you set a four-day monitoring window, traffic reaches 100% by the end of day one; the remaining three days monitor the fully-deployed feature for guardrail metric failures before the rollout is marked complete.
Under the hood, Safe Rollout runs as a short-term A/B test: the control group receives the existing feature value, the rollout group receives the new AI behavior. This means you're not just watching error dashboards — you're getting statistically grounded comparisons between the two experiences at every stage of the ramp.
For teams that want gradual exposure without automated monitoring, a simpler Percentage Rollout rule (available on all plans) supports manual ramp-ups like 10% → 50% → 100%. The tradeoff is that go/no-go decisions become entirely manual, which requires pre-defined criteria and the discipline to actually enforce them.
Defining go/no-go criteria at each stage
The most common question from teams new to staged rollouts is: how do I know when I'm ready to move from 10% to 25%? For teams using automated guardrail monitoring, the answer is built into the tooling.
GrowthBook's Safe Rollout analyzes guardrail metrics using one-sided sequential testing, which means it can flag a regression as soon as statistical significance is reached — without inflating false positive rates from repeated looks at the data. The status dashboard surfaces clear signals: "Ready to ship" means no regressions detected and the monitoring duration has completed; "Guardrails Failing" means a regression has been detected and the rollout should be reconsidered.
For teams on manual rollout rules, the go/no-go decision requires explicit pre-commitment: define which metrics constitute a regression, set the threshold before the rollout begins, and treat a breach as a hard stop rather than a discussion point. The discipline to hold that line — especially under product pressure — is what separates a staged rollout from a staged rollout in name only.
Instant deactivation as a non-negotiable for AI
AI regressions don't always announce themselves immediately. A model that produces subtly lower-quality responses may not spike your error rate — it will surface gradually in session depth, task completion, or return behavior. By the time the regression is obvious in aggregate, the damage is already distributed across a significant portion of your user base.
This is why the ability to instantly deactivate a rollout without a code deploy is a hard requirement for AI features, not a nice-to-have. GrowthBook's Auto Rollback toggle makes this automated: if any guardrail metric fails at statistical significance during monitoring, the rollout rule is disabled automatically. Teams that prefer to retain manual control can leave Auto Rollback off and act on the "Guardrails Failing" status themselves.
Using engagement data to drive rollout decisions
Error rates and latency are necessary signals, but they're not sufficient for AI features. A model can respond quickly and without errors while still producing outputs that users find unhelpful, confusing, or off-topic — and that kind of quality regression will show up in engagement signals before it shows up in infrastructure metrics.
Session depth, task completion rates, and return visit frequency in the days following initial exposure are the leading indicators worth watching as you advance through rollout stages.
GrowthBook's Safe Rollout monitoring surfaces metric boundary trends as a time series — not just a single snapshot — which means you can see whether a guardrail metric is drifting toward failure or stabilizing as the rollout progresses. That trajectory matters as much as the current value, particularly for AI features where output quality can shift as the model encounters a wider distribution of real-world inputs.
Define guardrail metrics and fallback behaviors before you launch any AI experiment
Incremental rollouts tell you when to slow down. Guardrail metrics tell you why. Without the second piece, a staged rollout is just a slower way to ship a feature you don't fully understand.
A team ships a new AI-powered recommendation feature. Primary engagement metrics climb past their goal within the first week. Everyone celebrates. Three months later, someone notices infrastructure costs have tripled — a direct consequence of the feature's unintended behavior at scale. This scenario, as Statsig's engineering team has documented, is not unusual. It's what happens when teams optimize for the metric they're watching and ignore the ones they're not.
For AI features, the stakes compound. A UI change that performs poorly on engagement metrics wastes engineering time. An AI feature that performs well on engagement metrics while silently degrading error rates, inflating latency, or generating harmful outputs can erode user trust before any monitoring system catches it.
The probabilistic nature of AI — where the same input can produce different outputs across sessions, user segments, and time — means damage can be inconsistent and hard to attribute without systematic monitoring in place.
Guardrail metrics are the answer to this problem. They are not the metrics you're trying to improve. They are the metrics you commit to not harming, regardless of what your primary goal metric does.
What guardrail metrics are and why they're non-negotiable for AI
The clearest way to think about guardrail metrics is as a safety net that runs in parallel to your primary success metric. You might be optimizing for task completion rate on an AI assistant feature. Your guardrail metrics are the signals that would tell you the feature is winning in a way that's unsustainable or harmful — error rates spiking, response latency degrading, downstream conversion dropping.
For AI specifically, the guardrail set should include metrics that reflect model health, not just business outcomes. Error rates and API failure rates capture model reliability. Latency tracks inference cost and user experience degradation simultaneously. Conversion rates downstream of the AI interaction capture whether the feature is actually serving users or just capturing their attention. GrowthBook's documentation on Safe Rollouts names exactly these three — error rates, latency, and conversions — as the canonical starting point.
One critical constraint: keep the guardrail set focused. Adding too many guardrail metrics increases false positive rates, which means you'll trigger rollbacks on noise rather than signal. Define the smallest set of metrics that would tell you the feature is genuinely causing harm.
Defining fallback behaviors in code before launch
A fallback is not an error message. It is a pre-engineered alternative experience that activates when the AI component fails, regresses, or gets rolled back — and it must exist in code before the experiment launches, not after a guardrail fires.
testRigor's framing of AI unpredictability is worth sitting with here: in traditional software, 2 + 2 always equals 4. In AI systems, it usually equals 4, but occasionally equals "purple," and every once in a while the model replies that it's not authorized to discuss mathematics. That unpredictability is not a bug you can patch — it's a property of the system you have to design around. The fallback is how you design around it.
The practical implication: your fallback behavior should be the control experience — a known-good, deterministic baseline that users can always be returned to. GrowthBook's Safe Rollout architecture enforces this structurally. The control receives the existing value; the rollout receives the new value.
When Auto Rollback is enabled and a guardrail metric fails, GrowthBook automatically disables the rollout rule and returns all users to the control. The failure threshold is set at zero — as soon as there is statistical certainty that any guardrail metric is being harmed, even by a small amount, the rollout is flagged for rollback. That design choice reflects a deliberate safety-first stance: the cost of a false positive rollback is lower than the cost of a harmful AI output reaching your full user base.
The ethics of short-term metric optimization
There is a practical ethical dimension to guardrail metrics that teams often miss until it's too late. An AI feature can appear to win on aggregate engagement metrics while producing harmful or degraded outputs for specific user subgroups. Probabilistic AI outputs don't fail uniformly — they can fail disproportionately for users with certain query patterns, languages, or contexts. Aggregate metrics won't surface this. Only guardrail metrics designed to catch distributional harm will.
This is not an abstract concern. It's the concrete reason why optimizing for short-term engagement in an AI experiment is a different kind of risk than optimizing for short-term engagement in a button color test. AI models must constantly adapt to evolving policies and ethical guidelines, and experimentation without guardrails is not responsible shipping — it's just fast shipping.
The practical takeaway is straightforward: before any AI experiment launches, define your guardrail metrics, configure automated monitoring against them, write your fallback behavior into the codebase, and ensure your control variant is a stable, deterministic baseline. If a guardrail fires, you want the system to act — not wait for someone to notice.
AI experiments produce the wrong answers when you ask the wrong questions
Getting the rollout mechanics right — flags, gradual exposure, guardrails — only solves half the problem. Once users are actually seeing your AI variant, you still need to know whether it's working. And for AI features, "working" is harder to define than most teams expect.
Why standard metrics fall short
Click-through rate tells you whether a user acted. It doesn't tell you whether the AI response that prompted that click was accurate, helpful, or quietly eroding their trust in your product. Conversion rate has the same blind spot. These metrics measure downstream behavior, but AI quality problems often don't show up in downstream behavior immediately — they accumulate.
A user who gets a misleading AI-generated answer might still complete their session. They just don't come back next week.
There's also the averaging problem. Experiment results are aggregate statistics, and aggregates can hide harm to specific user subsets. An AI feature that performs well on average might be producing poor outputs for users in a particular context, language, or use case — and your top-line metrics won't surface that. GrowthBook's own experimentation documentation flags this directly: standard metrics "might not capture harm that is being done to some subsets of your population."
Metrics that actually capture AI feature quality
The metrics worth tracking for AI experiments are ones that connect model behavior to user outcomes, not just user actions. Task completion rate — did the user accomplish what they came to do? — is more informative than whether they clicked something. Session depth, measured as meaningful engagement rather than raw page views, can signal whether an AI response moved the conversation forward or dead-ended it.
User correction rate is a direct signal of output quality that click-based metrics completely miss. When users frequently rephrase, retry, or override an AI suggestion, that behavior tells you the model's first response wasn't good enough — a signal that aggregate engagement numbers will never surface.
Response relevance, where you can instrument it, captures whether the model's output was appropriate to the query at all.
These aren't exotic metrics. They're the kind of signals that require you to define what "success" means for your specific AI feature before you run the experiment — which is exactly the discipline that AI experimentation demands.
Matching metric types to what you're measuring
Not every AI signal fits a standard "did this event happen?" measurement. Depending on what you're trying to capture, you need different metric structures — and choosing the wrong one will give you a misleading result.
Latency is the clearest example. If you measure AI response time as an average, a handful of fast responses will pull the number down and hide the fact that 5% of your users are waiting four seconds for every reply. A percentile metric — specifically p95 or p99 — shows you what the worst-case user is actually experiencing, which is where trust breaks.
For task completion per session, a ratio metric normalizes for session length, so a user who completed three tasks in a long session and a user who completed one task in a short session aren't treated identically. For AI signals that don't map to any standard event type, SQL-defined metrics let you write the measurement logic directly against your data warehouse — which is the practical escape hatch for novel AI features where the output doesn't fit a button click or a page view.
When you realize mid-experiment you measured the wrong thing
It happens. You launch an AI experiment tracking one metric, and two weeks in you realize the signal you actually needed was something you didn't instrument. On most experimentation platforms, that's a sunk experiment. You either live with the wrong data or restart.
GrowthBook supports retroactive metric addition — you can add metrics to a past experiment and compute results against historical exposure data without re-running the test. Merritt Aho, Digital Analytics Lead at Breeze Airways, put it plainly: "Being able to spin up new metrics mid-experiment is a game changer. This was simply never possible before." For AI experiments, where unexpected behaviors are the norm rather than the exception, this capability is a meaningful safety net.
Tracking cumulative impact across many AI model variants
AI teams don't run one experiment. They run a continuous stream of model comparisons — different prompts, different retrieval strategies, different fine-tuning approaches — and the learning compounds over time. Measuring each experiment in isolation misses the larger picture.
This is exactly the use case Character.AI describes. Landon Smith, Head of Post-Training at Character.AI, explains: "GrowthBook has been an invaluable tool for Character.AI, helping us develop our models into a great consumer experience. We can compare different modeling techniques from the perspective of our users — guiding our research in the direction that best serves our product."
The key phrase is "from the perspective of our users." Model-side metrics — loss, perplexity, benchmark scores — don't tell you whether users are better served. An experimentation layer that surfaces user-outcome data across many model variants does.
Cumulative impact dashboards, win rate tracking across experiments, and a longitudinal learning library let teams find what worked and what didn't over time. For AI teams running frequent model iterations, that longitudinal view is often more valuable than any single experiment result.
The four pieces only work as a stack — pull one out and the others lose their teeth
The four pieces covered in this article — feature flags, incremental rollouts, guardrail metrics, and measurement — aren't independent best practices you can adopt in any order. They're a stack. Flags make gradual rollouts possible. Gradual rollouts make guardrail monitoring meaningful. Guardrail monitoring is only useful if your metrics are actually connected to model quality, not just downstream clicks. Pull one piece out and the others lose their teeth.
Consider the failure mode when guardrail metrics are in place but the fallback behavior was never defined. A Safe Rollout fires Auto Rollback because error rate crosses the threshold. The flag disables. The SDK returns null. Your application has no fallback value configured, so users see a broken interface — not a degraded one, a broken one.
The rollout system worked exactly as designed. The outcome was still an incident. That's what "pull one piece out" looks like in practice: each remaining piece performs correctly and the system still fails.
The minimum viable guardrail stack: flags, fallbacks, and metrics you need before launch
Before any AI experiment goes live, three things need to exist in your system: a feature flag with a defined fallback value, at least one guardrail metric that reflects model health rather than just business outcome, and a control variant that is a stable, deterministic baseline. That's the floor. Everything else — staged ramp schedules, automated rollback, retroactive metric addition — builds on top of it.
Sequencing matters more than speed: what to do before you open traffic
The sequencing matters more than the speed. Write your fallback behavior before you write your experiment config. Define your guardrail thresholds before you open traffic. Start with internal users, then a small cohort, then a percentage of production — and treat the go/no-go decision at each stage as a real decision, not a formality.
The teams that skip these steps aren't moving faster; they're just deferring the cost to a moment when it's harder to absorb.
What AI experimentation requires that general-purpose analytics platforms weren't built for
The tooling question matters because AI experimentation has specific requirements that general-purpose analytics platforms weren't built for: runtime flag evaluation without latency overhead, sequential testing that lets you look at guardrail metrics without inflating false positive rates, retroactive metric addition for when your instrumentation doesn't match what the model actually surfaced, and longitudinal tracking across many model variants.
GrowthBook was built with these requirements in mind, which is why teams like Character.AI use it specifically to compare modeling techniques from the user's perspective — not just from benchmark scores.
One honest tension worth naming: the system described here takes real upfront investment. Defining fallbacks, instrumenting the right metrics, and building the discipline to hold go/no-go criteria under product pressure all require work before you see any benefit. The temptation is to skip the setup and ship.
That tradeoff is real — but so is the asymmetry this article opened with. Trust erosion from a bad AI experiment doesn't recover on the same timeline as a bad button color test.
If you're early in this process, the goal isn't to build the full system on your first experiment. It's to build enough of it that you can pull back fast if something goes wrong. That's a much more achievable starting point than it sounds.
What to do next
If you don't yet have feature flags on your AI features, that's where to start — not because it's the most interesting piece, but because nothing else in this system works without it. Create a flag for your next AI feature, define the fallback value before you enable it for anyone, and make sure your team knows how to disable it without a deploy.
Once that's in place, add one guardrail metric — error rate or latency, not engagement — and configure monitoring against it before the first user sees the variant. That's a complete, functional first version of this system. Build from there.
Related insights
Related Articles
Ready to ship faster?
No credit card required. Start with feature flags, experimentation, and product analytics—free.

