Experiments

How to A/B test AI-powered search or recommendations

A graphic of a bar chart with an arrow pointing upward.

A/B testing an AI search or recommendation system with the same setup you'd use for a button color test will produce results you can't trust — and nothing in your experiment pipeline will warn you.

The model learns from user behavior during the experiment, which means control-group clicks can quietly shape the treatment model's training, and vice versa. By the time you call a winner, the two groups may no longer be independent.

This guide is for engineers, PMs, and data teams who are building or iterating on AI-powered search or recommendation features and need a reliable way to measure what's actually working. Here's what it covers:

  • Why AI search experiments break the assumptions standard A/B testing is built on
  • Which metrics to track — relevance, click-through rate, conversion, session success, and deflection — and why each one catches something the others miss
  • How to structure the experiment correctly: user bucketing, experiment ID tagging, and group isolation
  • The statistical pitfalls specific to AI search tests, including novelty effects, sample size requirements, and false positives from multiple metrics
  • What to do after the first test, including how to iterate on ranking configs and personalization parameters over time

The article moves in that order — from why this is different, to what to measure, to how to set it up, to how to keep improving. Each section is self-contained, so if you already have a test running and just need the metric guidance or the statistical safeguards, you can jump straight there.

Why A/B testing AI-powered search and recommendations is fundamentally different

Traditional A/B testing rests on a deceptively simple assumption: the control and treatment are static, independent, and do not influence each other. Show half your users a blue button and half a green one, measure clicks, declare a winner. The variants don't change based on who interacts with them. That assumption breaks completely when the thing you're testing is an AI model.

AI-powered search and recommendation systems learn from user behavior in real time. The model you're testing today is not the same model you'll be measuring next week — it has been shaped by the behavioral signals your experiment generated.

This creates a category of failure mode that doesn't exist in conventional experimentation: your test can appear to be running correctly while producing results that are structurally uninterpretable.

The contamination risk: how feedback loops make missing experiment IDs a silent failure

When a user submits a query, clicks a result, or abandons a session, that event becomes training signal. In a standard web experiment, a button click is just a data point. In an AI search experiment, that same click can influence how the model ranks results for the next user.

Google's documentation for AI Commerce Search makes this explicit: when running an A/B experiment, you must include which group a user belongs to when recording user events, because that information is used to refine the model and provide metrics. Both the control group and the experimental group must log events — not just the treatment arm. This is a direct acknowledgment that the model is actively learning during your experiment, and that behavioral data from both groups is feeding back into model state.

This is the core structural difference from a button-color test. The button does not update itself based on who clicks it. The AI model does.

If experiment IDs are not attached to every user event log, the model cannot distinguish which behavioral signals came from which group. Control-group clicks can influence the treatment model's training. Treatment-group behavior can bleed into the baseline. The two groups are no longer independent — but nothing in your experiment infrastructure will tell you that.

This failure mode is silent. There is no sample ratio mismatch alert, no error in your event pipeline, no anomaly in your dashboards. The experiment continues to run, metrics continue to accumulate, and at the end you have a result that looks like a result but is not interpretable as one.

Google's solution is precise: every user event must include the experiment ID in a dedicated field, traffic splitting must be handled by a third-party platform (not the AI system itself), and the two groups must be isolated at the model-training level, not just at the UI layer. The requirement exists because the risk is real and the failure mode is invisible.

The moving target problem

Even setting aside contamination, AI search results are probabilistic and personalized in a way that static variants are not. The same query from the same user can return different results depending on session context, prior behavior, and current model state. "The treatment" in an AI search experiment is not a fixed thing you can point to — it is a distribution of outputs that shifts over the duration of the experiment.

This matters for measurement. Tools designed to compare static variants assume the thing being measured stays constant between exposure and outcome. In practice, this means that if your AI search model updates its ranking weights on Tuesday and your experiment doesn't end until Friday, the "treatment" your Thursday users experienced is measurably different from the one your Monday users saw — but your metrics treat them as the same. That inconsistency introduces error you cannot correct retroactively by reprocessing the data.

The infrastructure requirements that have no equivalent in standard experimentation

The design requirements that follow from these dynamics have no equivalent in standard experimentation. Every event from both groups must be tagged with an experiment ID. Traffic assignment belongs outside the AI system entirely. And group isolation has to be enforced at the model-training layer, not just the serving layer.

Because AI search experiments typically require multiple metrics — relevance, engagement, conversion, session success — the statistical risk compounds: running five unrelated metrics at standard significance thresholds produces a 41% chance of at least one false positive. None of this is optional. A contaminated AI search experiment doesn't just produce a wrong answer — it can actively degrade the model for users in both groups without producing any visible signal that something went wrong. That's a higher-stakes failure mode than a misconfigured button test, and it requires a correspondingly more deliberate approach to every layer of the experiment.

Which metrics actually matter when testing AI search: relevance, click-through, conversion, and beyond

Metric selection is the single most consequential decision in an AI search or recommendations experiment. Get it wrong, and you'll ship regressions you can't see — optimizing a number that looks healthy while the actual user experience quietly deteriorates. This isn't a hypothetical risk. Practitioners discussing LLM A/B testing have noted exactly this failure mode: "Any negative result can now be thrown straight into the trash because of the chance [the platform] put you on the wrong side of an A/B test." That's what happens when your metric coverage has gaps.

The GrowthBook experimentation docs put it plainly: "The simplest way this can go wrong is if your metrics are tracking the wrong things, in which case you'll have garbage in and garbage out." AI search experiments need a layered metric stack — not because more metrics are always better, but because each layer captures something the others structurally cannot.

Relevance: the upstream signal everything else depends on

Relevance is the foundation metric, and it's the one teams most often skip because it's the hardest to instrument. In AI search, relevance has two components: finding matching records, and ranking them so the best results appear first. But it extends further — into personalization signals, popularity weighting, and merchandising rules that shape what "best" means for a given user in a given context.

The reason relevance matters upstream is that if your ranking is wrong, every downstream metric is measuring the wrong thing. A model that surfaces irrelevant results with compelling titles will generate clicks. Those clicks will not convert. And by the time you see the conversion signal, you've already run the experiment and shipped the change.

Relevance is qualitative by nature, which is why teams default to proxies. The right approach is to instrument it directly — through human relevance judgments, offline evaluation sets, or model-level signals — and treat it as a leading indicator that validates what your engagement metrics are actually measuring.

Engagement metrics: what CTR and dwell time capture (and miss)

Click-through rate is the most commonly used proxy for search quality, and it has real value — but it has a specific failure mode that matters in AI search. A result set with sensational but irrelevant titles can drive high CTR with low conversion. Goodhart's Law applies directly here: when CTR becomes the target, it ceases to be a good measure of search quality.

Dwell time adds a dimension CTR misses. If a user clicks a result and immediately returns to the search page, that's a signal of failure that CTR records as a success. The GrowthBook docs flag this pattern explicitly, warning that relying on short-term engagement metrics can "encourage dark patterns in A/B testing, where you inadvertently exploit user trust to boost numbers temporarily at the expense of long-term retention." In AI search, that dynamic is especially dangerous because the model can learn from those clicks and reinforce the pattern.

Use CTR and dwell time as engagement signals, not as success criteria. They belong in your metric stack, but not at the top of it.

Conversion and revenue: the ground-truth validation layer

Conversion rate and revenue are the metrics that validate whether relevance and engagement improvements are real. Frameworks for evaluating search quality consistently name these as the most business-meaningful signals, and GrowthBook's approach to AI experimentation is built around linking model performance directly to user outcomes rather than stopping at surface engagement.

Because conversion is high-stakes, it warrants stricter statistical controls than softer engagement metrics. GrowthBook supports per-metric significance thresholds — applying tighter significance requirements to revenue metrics than to CTR — which is the right configuration for an AI search experiment where a false positive on conversion is far more costly than a false positive on click rate.

Session success and deflection: the metrics that catch what others miss

Session-level metrics capture whether the user actually accomplished their goal — something CTR and conversion can both miss in multi-step journeys. Query reformulation rate is the clearest signal: if a user submits a second query immediately after the first, the first result set failed. Pogo-sticking back to search after clicking a result is the same signal at the click level.

For support and help-center search, zero-result rate is the most direct indicator of model failure. A user who gets zero results doesn't click, doesn't convert, and often doesn't come back. This metric belongs configured as a guardrail — a threshold that flags or halts an experiment even when primary metrics look positive. GrowthBook's guardrail metric system is designed for exactly this use case: ensuring that a win on engagement doesn't mask a simultaneous degradation in core experience quality.

The concern about AI systems degrading UX without detection is precisely the problem these session-level signals are built to catch. Instrument them from the start, not as an afterthought when something looks wrong.

One practical advantage worth noting: warehouse-native experimentation architectures allow teams to define new metrics retroactively against existing experiment data. As Merritt Aho, Digital Analytics Lead at Breeze Airways, put it: "Being able to spin up new metrics mid-experiment is a game changer. This was simply never possible before." In AI search experiments — where you often discover mid-run that you're missing a critical signal — that flexibility matters more than it does in a standard UI test.

Three non-negotiable infrastructure requirements for AI search experiments

Clean experiment infrastructure is what separates interpretable results from noise in AI search testing — and the consequences of getting it wrong are worse than in a standard feature experiment. When a button-color test is misconfigured, you get unreliable data. When an AI search experiment is misconfigured, you corrupt the behavioral signals the model uses to train itself, and the damage compounds silently over the experiment's lifetime.

There are three non-negotiable requirements: consistent user bucketing, experiment ID tagging on every logged event, and strict group isolation. Each one is load-bearing.

Why inconsistent user assignment corrupts AI model training, not just metrics

Every user must be deterministically assigned to the same group — control or treatment — across every session and every surface where the search experience appears. This sounds obvious, but AI search creates a specific failure mode that simpler experiments don't face: if a user sees AI search results on one session and legacy results on another, their behavioral signals (clicks, dwell time, purchases) get attributed to the wrong model. The AI system learns from contaminated data, and your metric comparison becomes meaningless.

The choice of randomization unit matters and must be made deliberately. For authenticated products, user ID is the right anchor. For anonymous or session-heavy experiences, session ID or device ID may be more appropriate. The key is that the assignment logic is deterministic — given the same user identifier and experiment configuration, it always produces the same group assignment. GrowthBook supports flexible randomization units (user, location, postal code, URL path) precisely because different product architectures require different bucketing strategies.

Experiment ID tagging in every user event log

This is the most technically specific requirement in the section, and it comes directly from how AI search systems use behavioral data. Google Cloud's AI Commerce Search documentation states explicitly that when recording a user event, teams must specify which group the user belongs to by including the experiment ID in the experimentIds field. This is a hard requirement, not a suggestion.

The reason is dual-purpose. First, the experiment ID enables metric comparison between groups — without it, you cannot attribute a click or conversion to a specific treatment. Second, it allows the AI model to refine itself using correctly attributed behavioral data. If events are logged without experiment IDs, the model cannot distinguish which behavioral signals came from which treatment. Without attribution, results are unrecoverable.

Group isolation must happen at the model-training layer, not the UI layer

Both groups must receive identical application experiences except for the single variable under test — in this case, which search system generates the results. Any other difference, whether a UI change, a pricing update, or a promotional banner that only one group sees, invalidates the comparison. Google Cloud's documentation is explicit on this point: both versions of the application must be the same, except that users in the experimental group see results generated by AI Commerce Search.

One practical implication that surprises some teams: Google Cloud's AI Commerce Search does not handle traffic splitting internally. The documentation specifically calls out third-party experiment platforms — naming VWO and AB Tasty as examples — as the mechanism for group assignment. The third-party platform assigns unique experiment IDs to each group, which then flow into the event logs. Any experiment platform that can handle server-side assignment and pass experiment IDs into your event stream can fill this role.

Why keeping experiment data in your own warehouse makes the event tagging requirement tractable

The event logging requirement — tagging every user event with an experiment ID — only works if those events flow somewhere queryable. GrowthBook's experimentation platform analyzes data directly in your existing data warehouse — Snowflake, BigQuery, Redshift, or Postgres — rather than in a proprietary system. This matters for three practical reasons.

First, all tagged events can be joined with other product data and queried with SQL, making metric definitions transparent and reproducible. Second, metrics can be created or modified retroactively without re-running the experiment — a capability that matters specifically for AI search tests, where teams often discover mid-experiment that they should have been tracking query reformulation rate or zero-result rate from the start. Third, keeping experiment data in your own infrastructure means no end-user PII needs to leave your systems — a meaningful consideration for teams logging detailed search behavior at scale.

The warehouse-native approach is what makes the experiment ID tagging requirement tractable operationally. Tag every event, store it where you already keep your data, and define metrics against it in SQL. That's the implementation loop.

Statistical pitfalls in AI search experiments: novelty effects, sample size, and false positives

AI search experiments inherit all the statistical hazards of standard A/B tests and then add several of their own. Teams that borrow their statistical practices from UI experiments — where the treatment is a static change to a button or headline — will find those practices break down when the treatment is a probabilistic, adaptive system with high output variance and multiple correlated metrics. The failure mode is not always obvious: you can ship a false positive on an AI search test and spend months optimizing a model that was never actually better.

Novelty effects and why your first two weeks of data lie

When users encounter a new search interface or recommendation system, their behavior changes simply because it's new. They explore more, click more, and engage differently than they will after the experience becomes familiar. This inflates early engagement metrics — CTR especially — and can push a mediocre AI model over a significance threshold it doesn't deserve to cross.

The standard mitigations are straightforward but frequently skipped: run the experiment long enough for novelty effects to decay (a minimum of two to four weeks is a common baseline, though the right duration depends on your traffic volume and how frequently your users search), or restrict the experiment to new users who have no prior behavior to compare against. AI search compounds this problem because the model itself may be warming up on new behavioral data during the early experiment window, creating a double novelty effect — the user is adjusting to the interface while the model is adjusting to the user.

Sample size, variance, and the low-traffic reality

Conversion and session success are the metrics that matter most for AI search, and they are also the hardest to measure precisely. They are low-frequency, high-variance events. A power calculation sized for a button-color test will dramatically underestimate what you need for a search relevance experiment.

This is not a theoretical concern. The practitioner community is direct about it: most teams simply don't have enough traffic to run statistically valid tests on low-frequency outcomes, and running an underpowered test is not a neutral act — it produces noise you might mistake for signal. GrowthBook's Minimum Data Thresholds guardrail addresses this at the platform level by flagging conclusions drawn from implausibly small samples (the docs give the example of 5 vs. 2 conversions). But the more powerful technique is CUPED — Controlled-experiment Using Pre-Experiment Data — which reduces metric variance by controlling for pre-experiment user behavior. CUPED effectively increases statistical power without requiring more users, which matters enormously when your conversion events are sparse.

Multiple comparisons and the false positive stack

AI search tests don't evaluate a single metric. A properly instrumented experiment tracks relevance signals, CTR, conversion, session success, and deflection rates simultaneously — five or more metrics evaluated at the same significance threshold. The problem is that false positive probability compounds across metrics. GrowthBook's documentation states it directly: if you test the same hypothesis at a 5% significance level across 20 different metrics, the probability of finding at least one statistically significant result by chance alone is approximately 64%. For a more realistic AI search stack of five metrics, the math still produces a false positive risk that should make any team uncomfortable — and that's before accounting for the correlation between metrics, which tends to make things worse rather than better.

Multiple-comparison corrections work by raising the significance bar for individual metrics when you're evaluating several at once — the more metrics you test, the higher each individual metric's bar needs to be to keep your overall false positive rate under control. The standard approaches are Bonferroni correction (conservative, raises the bar uniformly across all metrics), False Discovery Rate correction, and the Benjamini-Hochberg procedure (more permissive, controls the expected proportion of false positives rather than eliminating them entirely). GrowthBook ships both Benjamini-Hochberg and Bonferroni as built-in corrections.

The practical implication is that you should decide which metrics are primary before the experiment runs, apply corrections to your full metric stack, and resist the temptation to go hunting through segments and time windows after the fact — that path leads to the Texas Sharpshooter problem, where you find significance by looking in enough places.

Sequential testing and SRM detection

Two additional tools deserve specific attention for AI search experiments. Sequential testing allows valid inference at any point during an experiment without inflating false positive rates — it addresses the peeking problem, which is acute for AI search tests where teams feel pressure to call results early when a new model appears to be performing well. Standard fixed-horizon tests are not designed for mid-experiment reads; sequential testing is.

Sample Ratio Mismatch detection catches a different class of failure: traffic splits that don't match what you configured. A 48/52 split when you expected 50/50 is a signal that something in your bucketing or event logging is broken, and results from a test with an SRM are not trustworthy regardless of what the metrics show. In AI search experiments, SRM can occur if experiment ID tagging is inconsistent across event types or if the model itself influences which user interactions get logged. GrowthBook runs SRM detection automatically as a data quality check, alongside suspicious uplift detection — which flags metric changes that are implausibly large and likely indicate an instrumentation bug rather than a real effect. For AI search, where a model error can produce dramatic metric swings, that guardrail is worth having.

The broader point is that statistical rigor in AI search testing is not a single decision about significance thresholds. It's a layered system of countermeasures — novelty mitigation, variance reduction, multiple-comparison corrections, and traffic integrity checks — each targeting a distinct failure mode. Skipping any one of them doesn't just weaken your results; it can produce confident, wrong conclusions about which model to ship.

After the baseline wins: shifting from AI vs. legacy to config-vs-config iteration

"When it comes to AI, evals are just the tip of the iceberg. A/B testing is where real value gets created." That framing captures something important about where most teams go wrong: they treat their first AI search experiment as a decision, when it's actually the foundation for a program.

Once your AI-powered search or recommendation system has beaten the legacy baseline and been fully adopted, the experiment that got you there has served its purpose. The infrastructure you built — user bucketing, experiment ID tagging, a layered metric stack — is now the engine for something more valuable: continuous iteration on the system itself.

From baseline to config-vs-config

The shift in framing matters. You're no longer asking "does AI search beat keyword search?" You're asking "which version of AI search wins?" That's a fundamentally different experimental posture, and it opens up a much richer hypothesis space.

In practice, config-vs-config testing means running controlled experiments on the parameters that govern how your AI search system behaves: retrieval strategy weights, reranking model choices, embedding model versions, boost specifications for certain product categories, personalization signal inclusion or exclusion. Each of these is a testable hypothesis with measurable downstream effects on the metrics you've already instrumented. The experiment infrastructure doesn't change — you're still bucketing users consistently, tagging events with experiment IDs, and measuring against the same metric stack. What changes is that both variants are AI search configurations rather than AI versus legacy.

Landon Smith, Head of Post-Training at Character.AI, describes exactly this kind of iteration in practice: "GrowthBook has been an invaluable tool for Character.AI, helping us develop our models into a great consumer experience. We can compare different modeling techniques from the perspective of our users — guiding our research in the direction that best serves our product." That's the model: user-facing metrics, not internal evals, as the arbiter of which configuration wins.

One practical caution here, drawn from practitioner experience: config changes tested during iteration can silently harm users if you're not watching guardrail metrics closely. The same rigor that protected your initial experiment — pre-registered guardrails, kill switches, SRM detection — applies equally to follow-on tests. Iteration done carelessly is how teams ship regressions they can't explain.

Tracking cumulative impact across the program

Individual AI search experiments often produce modest effect sizes. A 1–2% improvement in click-through rate or session success rate is meaningful, but it doesn't look impressive in isolation. The compounding value of an iteration program only becomes visible when you measure across experiments, not within them.

This is the argument for tracking cumulative impact as a first-class metric for your experimentation program. Experiment analytics platforms built around exactly this pattern offer a dashboard that shows the cumulative impact of all experiments on any given metric, with North Star metric performance tracked over time so teams can see how each individual experiment contributed to the trend. Individual experiments rarely move the needle in isolation — the compounding value only becomes visible when you measure across the program.

Win rate and frequency tracking adds another dimension. Teams should monitor whether they're running the right mix of bold bets — new retrieval architectures, embedding model swaps — versus incremental config tweaks. A program that only runs safe, small experiments will compound slowly. One that only swings for large changes will have high variance in outcomes and learn less per cycle.

Building a learning library to prevent redundant testing

As an AI search experiment program matures, institutional knowledge becomes both a competitive asset and a liability if it isn't captured. Teams that don't document what they've tested will re-run experiments already answered — wasting cycles, confusing stakeholders, and potentially reversing decisions that were made for good reasons no one remembers.

A learning library solves this. GrowthBook's implementation surfaces experiments that worked and those that didn't, making past results searchable and usable for informing future hypotheses. Some platforms go further, using vector embeddings to flag potentially redundant experiments before they're launched — catching "we already tested this" before it becomes a wasted sprint. AI-generated result summaries help teams quickly process and archive learnings from completed tests, which matters when iteration velocity is high and the volume of completed experiments starts to accumulate.

The broader point is that a learning library is what transforms a collection of individual tests into a compounding organizational capability. Without it, every new team member, every new quarter, every new model release starts from scratch. With it, your experiment history becomes infrastructure — the same way your event logs and metric definitions are infrastructure.

The teams that will extract the most value from AI-powered search aren't the ones who ran the best single experiment. They're the ones who built the program.

The infrastructure decisions that determine whether your AI search results are interpretable

The through-line of this article is that AI search experiments fail silently. The feedback loop problem, the contamination risk, the moving target — none of them produce an alert. They produce a result that looks like a result, and you ship something based on it. That's the actual risk, and it's why the infrastructure decisions covered here aren't optional overhead. They're what makes the output of your experiment mean something.

The non-negotiables: experiment design decisions you can't skip

The three things that will break an AI search experiment before it starts are inconsistent user bucketing, missing experiment IDs on event logs, and traffic splitting handled by the AI system itself. Get any one of these wrong and your groups are no longer independent — and you won't know. These aren't configuration preferences; they're the structural requirements that make everything else in the experiment interpretable.

Pre-registering your metric stack is what separates an interpretable result from a number you'll second-guess

The temptation is to start with CTR because it's easy to instrument and moves quickly. Resist it. Relevance is the upstream signal that validates whether your engagement metrics are measuring anything real. Conversion and session success are the ground-truth layer. Zero-result rate and query reformulation rate are the guardrails that catch regressions your primary metrics will miss.

Decide which metrics are primary before the experiment runs — not after you've seen the data — and apply multiple-comparison corrections to the full stack. Per-metric significance thresholds and built-in multiple-comparison corrections make this tractable without requiring a statistics degree.

The first experiment is an infrastructure investment, not a decision

The tension worth holding onto: your first experiment is not the goal. It's the infrastructure investment that makes every subsequent experiment cheaper and more reliable. The teams that compound value from AI search aren't the ones who ran one careful test — they're the ones who built the program, documented what they learned, and kept iterating on ranking configs and personalization parameters with the same rigor they applied to the baseline comparison.

The honest caveat is that this takes longer than most teams expect. Novelty effects alone require two to four weeks of runtime before your data stabilizes. Low-traffic products may need CUPED variance reduction just to reach adequate power on conversion metrics. That's not a reason to skip the work — it's a reason to size the timeline correctly from the start.

This article was written to give you a complete picture of what reliable AI search experimentation actually requires — not to make it sound simpler than it is, but to make it approachable enough that you can start with confidence.

What to do next: If you don't have a test running yet, your first action is to audit your event logging. Check whether every user event your search system generates can be tagged with an experiment ID and stored somewhere queryable. If it can't, that's your blocking dependency — fix it before you design the experiment. If your logging is already in place, define your metric stack before you start: a relevance signal as your upstream validator, CTR and dwell time as engagement proxies, conversion as your primary success criterion, and zero-result rate as a guardrail. The specific metrics matter less than the commitment to pre-registering them — deciding what counts as a win before you see the data is what makes the result interpretable.

Related insights

Sign up for free

Take Growthbook for a spin, no credit card required.

Create my account

Table of Contents

Related Articles

See All Articles
Experiments

How much traffic do you need to test AI features reliably?

Jun 8, 2026
x
min read
Experiments

Why traditional A/B testing breaks down for AI products

Jun 8, 2026
x
min read
Experiments

How to measure "quality" in AI outputs (beyond accuracy)

Jun 7, 2026
x
min read

Ready to ship faster?

No credit card required. Start with feature flags, experimentation, and product analytics—free.

Simplified white illustration of a right angle ruler or carpenter's square tool.White checkmark symbol with a scattered pixelated effect around its edges on a transparent background.