Experiments

Data Science

How much traffic do you need to test AI features reliably?

Jun 8, 2026

min read

A graphic of a bar chart with an arrow pointing upward.

Your standard sample size calculator will give you a confident number for an AI feature experiment — and that number will probably be wrong.

Not because the math is broken, but because AI features violate the core assumptions the math depends on. They're non-deterministic: the same user can see meaningfully different outputs across sessions, which adds a second layer of variance on top of normal user behavior differences. That extra noise makes your pre-experiment variance estimates too low and your effect size assumptions unreliable.

The result is an experiment that looks properly designed but is structurally underpowered.

This article is for engineers, PMs, and data teams who are shipping or planning to ship AI features — LLM-powered recommendations, AI writing tools, conversational interfaces — and need to know how much traffic to plan for and why the usual approach falls short. Here's what you'll learn:

Why AI feature non-determinism inflates metric variance and breaks standard power calculations
How to estimate variance and effect size when you don't have clean historical analogs
How to calculate the AI feature test traffic you actually need, with a worked example
How to allocate and segment traffic so your sample is valid, not just large enough
What to do when you genuinely don't have enough traffic to run a fully powered experiment

The article moves in that order — from the root cause, through the math, through the operational details, and finally to practical fallbacks for low-traffic situations. Each section builds on the last, but you can also jump to whichever problem you're facing right now.

Why AI features break standard sample size assumptions

When you're planning an experiment on a new checkout flow or a redesigned navigation element, the statistical setup is relatively straightforward. You pull historical conversion data to estimate variance, make a reasonable assumption about the lift you're trying to detect, and run the numbers through a power calculator.

The result is a sample size you can trust — because the feature you're testing behaves the same way for every user who sees it.

AI features don't work that way. And if you apply the same planning process to an LLM-powered recommendation engine or an AI-generated content feature, you'll end up with a sample size estimate that looks credible but is systematically too small.

The experiment will run, reach your calculated threshold, and return a result that may be structurally underpowered — without any obvious sign that something went wrong.

Non-determinism is the root cause

The defining characteristic of AI features is that they are non-deterministic: the same input can produce different outputs across sessions. A button color is identical for every user who sees it. An LLM-generated product recommendation shown to User A on Monday may be substantively different from what that same user sees on Wednesday — different phrasing, different items surfaced, different implicit framing.

The "treatment" in your experiment is not a fixed stimulus. It's a distribution of stimuli, varying across users and across time.

This isn't a bug or an implementation detail. It's the nature of how these systems work. Traditional software features "can be thoroughly tested with unit tests and integration tests". Once those tests pass, teams can trust the feature will continue working as expected.

AI features require quality thresholds instead — and as Stack Overflow discovered when it rolled back its conversational search feature for failing to reach a 70% accuracy threshold, a feature can pass every functional test while still failing to deliver consistent user-facing value.

How non-determinism inflates metric variance

In a standard A/B test, metric variance is driven primarily by differences in user behavior. For a conversion rate experiment, some users convert and some don't — that behavioral heterogeneity is the variance you're estimating from historical data.

When you add an AI feature, you introduce a second source of variance: output variability. Now your metric reflects user behavior differences plus the variability in what the AI actually showed each user plus the interaction between those two factors.

The practical consequence is that pre-experiment variance estimates based on historical metric data will be too low. Your historical engagement data on "recommendation clicks" doesn't capture the additional noise introduced by non-deterministic outputs.

Standard power calculators assume variance is a stable, knowable input — for AI features, it isn't. The calculator doesn't warn you. It just returns a number that's smaller than it should be.

Effect sizes are harder to predict

Sample size calculations also require an assumed minimum detectable effect. For a UI change, you can often anchor this estimate to analogous past experiments. For an AI feature, the effect size depends on model quality, prompt design, user segment, and context — none of which have clean historical analogs.

The relationship between a model's technical performance metrics and its actual user-facing impact is non-linear and difficult to anticipate in advance.

Effect sizes for AI features also tend to be highly heterogeneous across user segments. Power users and casual users often respond very differently to AI-generated suggestions, which means a single MDE assumption can mask a situation where the feature has a strong effect on one segment and a null or negative effect on another.

Averaging across that heterogeneity produces an effect size estimate that may not represent any real subpopulation accurately.

Why your sample size calculator will fail you

The failure mode here is subtle. Standard sample size calculators aren't broken — their math is correct. The problem is that the inputs they require (stable variance, a single predictable effect size, a deterministic treatment) are assumptions that AI features violate.

The calculator produces a confident, plausible-looking number, and nothing in the output signals that the inputs were wrong.

Teams run the experiment, hit the calculated sample size, and declare a result. What they've actually run is an underpowered experiment built on variance estimates that were too optimistic and effect size assumptions that may not reflect the feature's actual behavior in production.

GrowthBook's framing captures the failure pattern well: the industry has a recognized tendency to shipping AI features based on "vibe checks" rather than rigorous measurement — and applying a standard sample size formula to a non-standard problem is a more sophisticated version of the same mistake.

The rest of this article addresses how to fix the inputs, not just the formula.

High variance and uncertain effect sizes: the core AI feature test traffic problem

Every sample size calculation reduces to two inputs: variance and effect size. Get either one wrong, and your AI feature test traffic estimate is wrong — and with AI features, both are genuinely hard to pin down before the experiment runs.

This isn't a minor calibration issue you can correct with a slightly larger safety margin. It's a structural problem that makes standard pre-experiment planning unreliable.

Why AI outputs inflate metric variance

When the treatment itself is heterogeneous across users and sessions — as it is for any LLM-powered feature — the metric distribution widens, and the sample size required to detect a given effect grows with it. Different user segments — power users versus casual users, early adopters versus mainstream — react to AI-generated content in ways that diverge substantially from each other.

GrowthBook's AI testing playbook frames the challenge as "tuning AI responses across thousands of use cases." That breadth isn't incidental — it's precisely what drives variance. Add the novelty effect on top: tech-savvy early adopters tend to respond more positively to AI features than mainstream users, which means the variance you observe early in an experiment often understates the variance you'll see once the full user population is exposed.

Estimating variance before the experiment runs: three defensible approaches

The honest answer is that you can't know your AI feature's metric variance until you've run the experiment — but you can make a defensible estimate. Three approaches are worth considering.

The most reliable starting point is historical data on the closest analogous metric you have. A power analysis tool that uses the last eight weeks of historical data to estimate baseline metric values and expected traffic for a given population gives you a grounded starting point rather than a guess.

If you're launching an AI-powered recommendation feature, look at the variance in engagement metrics for your existing recommendation system, even if it's rule-based. The distributions won't be identical, but they'll be closer than starting from nothing.

If historical analogues are weak, a limited pilot rollout — even to a small percentage of traffic — gives you empirical variance data before you commit to a full experiment design. The variance estimate from a pilot will be noisy, but it's real signal.

When neither option is available, apply a conservative multiplier to your initial estimate. The direction of the error matters: underestimating variance produces an underpowered experiment, which is a worse outcome than overestimating it and running slightly longer than necessary.

Why effect sizes are harder to predict for AI features

For a conventional product change, you often have a reasonable prior on expected lift — historical experiments on similar UI changes, industry benchmarks, or a tight relationship between the change and the metric. For an AI feature, the effect on user behavior depends on model quality, prompt design, user context, and the interaction between all three. These factors don't produce a stable prior.

GrowthBook's power analysis documentation puts it directly: "Obviously you do not have complete information about the true lift, otherwise you would not be running the experiment." Their explicit recommendation is to run power analysis across a range of effect sizes — optimistic, expected, and pessimistic — rather than anchoring to a single point estimate.

For AI features, the pessimistic scenario deserves particular attention, because the population-average effect is often smaller than the effect observed in your most engaged users, who tend to be overrepresented in early data.

The cost of getting either wrong

Underestimating variance means your minimum detectable effect is larger than you planned for. In GrowthBook's documented power analysis example, the MDE at Week 1 is 34.5% — only very large effects are detectable that early. If your AI feature produces a real but modest improvement, an underpowered experiment will simply miss it.

Underestimating effect size produces the same outcome through a different path: you design for a lift that doesn't materialize, and the experiment lacks the power to detect what's actually there.

The standard 80% power threshold means a correctly specified experiment still misses real effects one time in five. Underspecification makes that rate worse in ways that are hard to recover from mid-experiment without inflating your false positive risk.

Setting conservative assumptions as a feature, not a bug

The practical implication is straightforward: when you're uncertain about variance or effect size for an AI feature — which is most of the time — assume higher variance and smaller effects than your intuition suggests. GrowthBook's guidance frames this as asking, "Suppose your feature impact is smaller than you think — how small could it be?" and designing the experiment to have adequate power at that pessimistic estimate.

Conservative assumptions translate into longer runtimes or larger traffic allocations. That's not a failure of planning — it's the correct response to genuine uncertainty. An experiment designed for realistic conditions is far more valuable than one that reaches a conclusion quickly but can't be trusted.

Calculating the AI feature test traffic you actually need

The inputs that make AI feature experiments hard to plan — elevated variance, uncertain effect sizes, non-deterministic treatments — don't change the mechanics of the sample size calculation itself. The formula still requires four inputs: your baseline metric value, the effect size you want to detect, the variance of your metric, and your desired statistical power.

What changes is how difficult it is to specify those inputs honestly. The goal of this section is to walk through those mechanics with AI-specific inputs, then show concrete techniques that can reduce the traffic requirement when your numbers look daunting.

The standard sample size formula and its inputs

Power is the probability of detecting a real effect when one exists. The conventional threshold is 80% — meaning if you ran the same experiment 100 times with different randomizations, you'd expect to observe a statistically significant result in at least 80 of them.

The minimum detectable effect (MDE) is the smallest effect size for which your experiment achieves that 80% power threshold. Smaller MDEs are better: they mean you can detect subtler improvements.

The relationship between these inputs is straightforward but unforgiving. Sample size grows as effect size shrinks, as variance increases, or as you raise your power requirement. For a simple revenue metric, if your control averages $100 and your treatment averages $102, your effect size is 2%.

Detecting a 2% lift requires substantially more traffic than detecting a 10% lift at the same variance and power level.

For AI features, the practical challenge is that you often don't know your effect size going in, and your metric variance is likely higher than historical baselines suggest. The right response is to run your power analysis across a range of effect sizes — an optimistic scenario, a realistic best guess, and a pessimistic smaller-than-expected case — so you understand the runtime implications under each assumption rather than anchoring on a single number.

CUPED reduces traffic requirements by absorbing pre-existing behavioral variance

CUPED is a statistical technique that strips out the noise that was already present before your experiment started. Here's the intuition: some users are just heavier users of your product than others, and that pre-existing difference shows up in your metrics regardless of which variant they're assigned to.

CUPED identifies those pre-existing behavioral patterns and removes their contribution from the result — leaving behind a cleaner signal of what the treatment actually caused. Less noise in the estimate means you need fewer users to reach the same level of confidence.

For AI features specifically, CUPED is particularly valuable because user engagement level is one of the biggest drivers of heterogeneous reactions to AI-generated content. A power user who already relies heavily on the product will respond to an AI suggestion feature differently than a casual user who opens the app twice a month.

CUPED can absorb much of that pre-existing behavioral difference, tightening your estimate.

Post-stratification serves a similar variance-reduction purpose and is reported to run experiments approximately 20% faster — both techniques are available natively without requiring custom statistical infrastructure.

Sequential testing for AI experiments

Sequential testing lets you check results at multiple points during an experiment without inflating your false positive rate — a meaningful capability for AI experiments where you may want to stop early if a model is causing harm, producing degraded outputs, or clearly underperforming.

The trade-off is direct: enabling sequential testing reduces power, which means you need more traffic or a longer runtime to reach the same 80% power threshold you'd achieve with a fixed-horizon test. This isn't a reason to avoid sequential testing — it's a reason to model the power impact before you commit.

A power calculator with sequential testing support lets you toggle this in analysis settings and see the effect on your projected runtime before the experiment launches.

Bayesian vs. frequentist — a practical choice for AI tests

Bayesian vs. frequentist — Modern experimentation platforms support both Bayesian (with custom priors) and frequentist statistical engines, and the choice between them is a practical one, not a philosophical one. The relevant question is: do you have reliable data about how the AI model performs before it goes live?

This might come from internal model evaluations, from running the model in the background without showing results to users (sometimes called shadow mode), or from benchmark tests on historical data.

If you have that kind of data, a Bayesian approach lets you feed it into the analysis as a starting assumption — which can shorten the time to a confident conclusion because you're not starting from scratch. If you don't have reliable pre-launch data, or if your model behaves differently in production than it did in testing, the frequentist approach is the safer default.

The practitioner community debates this actively, with some arguing Bayesian is strictly superior and others treating frequentist as a more defensible standard for organizational buy-in. The honest answer is that both are valid, and the right choice depends on the quality of your prior data.

A worked example: AI reply suggestions for customer support

Consider a concrete AI feature test scenario: an AI reply-suggestion tool for customer support agents. The primary metric is agent adoption rate — the share of sessions where an agent accepts at least one AI suggestion. Baseline adoption is 18%. The team wants to detect a 3 percentage point absolute improvement (to 21%), with α = 0.05 and 80% power. Eligible sessions per week: 8,400, split 50/50 between control and treatment.

Using the standard normal approximation for a two-sided test of two proportions with p₁ = 0.18 and p₂ = 0.21, the required sample size works out to approximately 1,732 sessions per group — roughly 3,464 total. At 8,400 eligible sessions per week with a 50/50 split, that's about 4,200 sessions per arm per week, meaning the experiment could reach 80% power in under a week under these assumptions. That's a favorable scenario.

Now compare it to a tighter traffic situation. GrowthBook's own power calculator documentation illustrates an experiment with approximately 2,195 users per week, a 14.41% baseline conversion rate, and a 20% expected relative effect size. At that traffic level, the experiment reaches 80% power at three weeks.

In week one, the MDE is 34.5% — meaning you can only detect very large effects early on. As users accumulate, the MDE shrinks and power increases. If your weekly traffic is lower than this, or your expected effect size is smaller, you may be looking at runtimes measured in months rather than weeks.

The practical takeaway: run your power analysis before you launch, use realistic variance estimates from analogous historical metrics, and model multiple effect size scenarios. If the numbers show a 12-week runtime to detect a plausible effect, that's not a calculation to ignore — it's a signal to either find variance reduction techniques, reconsider your primary metric, or set honest expectations about what the experiment can and cannot tell you.

Traffic allocation and segmentation for AI feature experiments

Hitting your target sample size is necessary but not sufficient. In AI feature experiments, the composition of your test population shapes your results as much as its size. A statistically adequate sample drawn from the wrong segment — or exposed in the wrong sequence — can produce results that are internally valid but externally meaningless, or worse, actively misleading.

Defining who actually belongs in the experiment

Not every user in your system is a meaningful participant in an AI feature test. The AI feature needs to produce substantive output for the experiment to measure anything real, which means participants generally need sufficient interaction history, relevant context, and technical eligibility — logged-in state, the right platform, enough session depth for the feature to engage.

Exposing users who barely trigger the AI feature dilutes your signal and inflates variance, because you're averaging over a large population where most of the treatment effect is zero.

Defining this eligible population precisely — and consistently — matters more than it does in a standard button-color test. Experiment targeting rules let you specify eligibility using attribute-based rules with AND/OR logic, saved groups, and org-level targeting for B2B contexts.

The key word is consistently: the same eligibility rules must apply to both control and treatment at assignment time. Eligibility criteria that are ambiguous, loosely defined, or applied differently across variations are a documented cause of Sample Ratio Mismatch.

Gradual rollout limits blast radius before full exposure

AI features carry higher risk than deterministic features precisely because their outputs are unpredictable at scale. A gradual rollout — exposing a small fraction of eligible users first, then expanding as you gain confidence — lets you catch unexpected behavior before it affects your entire user base.

The practitioner consensus on this is clear: as one engineer put it in a discussion on production testing, "every new change goes into your codebase behind a flag. You can flip the flag for a limited set of users".

The important implementation detail is that gradual rollout phases should be planned upfront, not adjusted reactively mid-experiment. Feature flags that use deterministic hashing ensure the same user consistently receives the same experience across sessions — a prerequisite for valid measurement. GrowthBook's percentage-based rollouts use deterministic hashing (MurmurHash3) so the same user always gets the same variant, without requiring server-side session storage.

When you're ready to move from rollout to formal experiment, converting a feature flag directly into an A/B test keeps the rollout and the measurement instrument unified rather than managing them as separate systems.

Avoiding sample ratio mismatch

Sample Ratio Mismatch is the failure mode where the observed traffic split doesn't match the configured split. It invalidates your results silently unless you're checking for it. Experiment platforms surface SRM as a warning in results views, but understanding why it happens in AI feature contexts is what lets you prevent it.

AI feature architectures create elevated SRM risk for several specific reasons. Server-side assignment combined with client-side tracking — common when an AI feature runs inference on the backend — creates opportunities for tracking to fire inconsistently.

Misconfigured Activation Metrics can trigger at different rates across variations if the AI feature itself changes session behavior (users who see an AI recommendation may stay longer, generating more events). Mid-experiment targeting changes are another documented cause: if you tighten or expand eligibility criteria after the experiment starts without creating a new phase with re-randomization, your traffic split will drift.

The most reliable mitigation is running an A/A test before the real experiment launches. This validates that traffic is flowing correctly and that your metrics are instrumented consistently across both groups — before any treatment effect is in play.

Managing the novelty effect

AI features are particularly susceptible to novelty effects. Users who are first exposed to an AI-powered experience often respond more positively simply because it's new and different, not because it's better.

If your initial rollout population skews toward power users or early adopters — which is common when you're doing a gradual rollout and your most engaged users are most likely to encounter the feature first — your early results may look stronger than they'll prove to be at scale.

The practical response is to run the experiment long enough for novelty to decay, and to monitor whether the treatment effect is stable over time or trending downward as users acclimate. Segment composition should also include a representative mix of user types, not just your most active cohort.

An experiment that proves an AI feature works for power users but was never exposed to casual users hasn't answered the question you actually need to answer before a full launch.

When traffic is genuinely insufficient: strategies that don't pretend otherwise

Correcting for the segmentation and allocation problems described in the previous section gets you a valid experiment — but it doesn't conjure traffic that doesn't exist. Insufficient AI feature test traffic is often a genuine structural constraint, not a planning failure you can engineer your way out of.

Your product may simply not have the volume needed to reach statistical significance in a reasonable timeframe — especially given the elevated variance that AI-generated outputs introduce. That's a real situation, and it deserves real strategies rather than a lecture about running your power calculations earlier.

What follows is a prioritized toolkit. Proxy metrics and sequential testing are the highest-leverage technical interventions. Holdout groups are best suited for AI features with compounding, longitudinal effects. Qualitative evaluation is the fallback when quantitative methods genuinely can't deliver signal. Feature flags are the operational layer that makes all of the above safer to execute.

Lower-variance proxy metrics reduce the traffic threshold without sacrificing causal validity

When your primary outcome metric — 30-day retention, revenue per user, long-term engagement — requires enormous sample sizes because of high variance or low base rates, a lower-variance leading indicator can dramatically reduce the traffic you need. For an AI-powered recommendation feature, that might mean measuring immediate click-through on AI suggestions rather than downstream conversion.

For an AI writing assistant, it might mean session depth or output acceptance rate rather than subscription renewal.

The critical design constraint here is causal validity. A proxy metric is only useful if it's genuinely connected to the outcome you care about — not just correlated in historical data, but plausibly on the causal path. Choosing a proxy for convenience rather than validity is how teams end up optimizing for something that doesn't move the needle downstream.

If your experimentation platform runs analysis directly against your data warehouse — querying BigQuery, Snowflake, Redshift, or similar sources where your event data already lives — you can often identify candidate proxy metrics from existing event streams without new instrumentation, which matters when you're already resource-constrained.

Sequential testing makes longer experiments statistically valid, not just longer

Running an experiment longer to accumulate traffic is legitimate. Peeking at results repeatedly while it runs is where the problem starts. Here's the issue: each time you check a metric for significance, there's a small chance you'll see a false positive just by chance — that's what a 95% confidence threshold means. It controls that chance for a single check.

But if you're tracking five separate metrics, the chances that at least one of them shows a false positive compound. The math works out to roughly a 23% false positive rate across five independent metrics at α = 0.05 — not because anything went wrong, but because you gave chance five opportunities instead of one. Even two metrics gets you to about 10%. Sequential testing is the tool that controls this inflation when you need to check results more than once.

Sequential testing provides the guardrails that make longer experiments statistically valid. Rather than applying a fixed threshold at a single decision point, sequential testing adjusts significance boundaries dynamically as data accumulates, preserving your error rates whether you stop early or run long.

The practical implication: commit to a maximum experiment duration upfront, use sequential testing to allow early stopping if a clear signal emerges, and don't treat "we'll just run it longer" as a free option. It isn't — not without the right statistical infrastructure in place.

Holdout groups measure cumulative AI impact that two-week tests miss

Some AI features don't produce their full effect in a two-week experiment window. A recommendation engine that adapts to user behavior, or a personalization layer that improves with interaction history, may show modest short-term lift but substantial long-term value. Standard A/B tests measure point-in-time lift; holdout groups measure cumulative, longitudinal impact.

A holdout group is a set of users permanently excluded from the AI feature who serve as a long-running control. The tradeoff is real: those users never receive the feature, which means you're permanently withholding something potentially valuable from a slice of your audience.

That cost is worth bearing when the AI feature is expected to compound over time and you need evidence of that compounding to justify continued investment. Experimentation platforms support holdouts as a named capability, and feature scheduling — the ability to activate and deactivate features at specific times — enables structured measurement windows within holdout designs.

Qualitative evaluation as a complement to quantitative testing

When traffic is genuinely insufficient for statistical significance, qualitative methods provide directional signal that quantitative testing cannot. User interviews, session recordings, human rater studies, and LLM evaluation frameworks (evals) can tell you whether the AI feature is producing coherent, useful outputs — and whether users are engaging with it in the way you intended.

This isn't a replacement for A/B testing. It's the evidence that justifies keeping a feature live while traffic accumulates, or killing it before you sink further resources into something that isn't working.

Feature flags enable safe incremental exposure, but don't replace experimental design

Feature flags let you expose an AI feature to a small percentage of traffic — say, 5 to 10% — monitor for regressions, and gradually increase exposure as confidence builds, without requiring a fully powered experiment from day one. Attribute-based targeting rules with AND/OR logic, saved groups, and org-level targeting for B2B contexts give you control over who sees the feature and when.

One distinction matters here: gradual rollout is not the same as a controlled experiment. Without randomization and a held-out control group, you can't attribute metric changes to the AI feature with any confidence. Feature flags enable safe exposure and operational control; they don't replace experimental design.

What they do provide — particularly when combined with warehouse-connected guardrail metrics — is an automated safety net. If a guardrail metric degrades during rollout, the exposure can be halted automatically before a problem compounds. That's a meaningful protection when you're shipping without full statistical power, which is exactly the situation this section is about.

Planning your AI feature test traffic: matching method to constraint

The through-line of this article is simple: the inputs matter more than the formula. Standard power calculators aren't broken — they just require stable variance and a predictable effect size, and AI features provide neither.

Non-deterministic outputs inflate your metric variance in ways that historical data won't capture, and effect sizes for AI features depend on model quality, prompt design, and user segment in ways that resist clean prior estimates. The result is that a plausible-looking sample size can mask a structurally underpowered experiment.

Estimating traffic requirements before launch: three honest inputs

Before you run a single user through your experiment, you need three honest estimates: your baseline metric variance (drawn from the closest historical analog you have, not from the AI feature itself), a range of plausible effect sizes from pessimistic to optimistic, and a realistic weekly eligible user count that reflects your actual targeting criteria — not your total user base.

Run those numbers through a power analysis at each effect size scenario, and let the runtime implications tell you whether you need variance reduction techniques like CUPED, a lower-variance proxy metric, or a frank conversation about what the experiment can actually detect.

Matching your testing approach to traffic volume and feature type

The right testing method depends on your specific constraints. Use this decision framework to identify your situation:

If you have adequate traffic and a short-horizon metric: Run a standard fixed-horizon experiment with conservative variance assumptions. This is your cleanest path.
If your traffic is tight: Switch to a lower-variance proxy metric earlier in the causal chain — output acceptance rate rather than 30-day retention, for example.
If the AI feature compounds over time: Use a holdout group, not a two-week A/B test. Short-horizon experiments will understate the feature's longitudinal value.
If you're in genuinely low-traffic territory with no good proxy: Qualitative evaluation is not a consolation prize — it's the appropriate tool for the evidence level you can actually achieve.

The tooling layer: variance reduction and flag-to-experiment unification

A power calculator that models runtime across multiple effect size scenarios before you commit, combined with native support for CUPED, post-stratification, and sequential testing, means the variance reduction techniques described in this article don't require custom statistical infrastructure to use.

The ability to convert a feature flag directly into a formal experiment also keeps your gradual rollout and your measurement instrument unified — which matters when the gap between "we're testing this" and "we're measuring this" is where most AI experiment validity problems originate.

Where to start: power analysis before experiment design

Start with your power analysis, not your experiment design. Pull eight weeks of historical data on the closest metric analog you have, estimate your weekly eligible user count using your actual targeting criteria, and run the numbers at three effect size scenarios: the lift you're hoping for, half that lift, and a quarter of it.

If the pessimistic scenario shows a runtime you can't accept, that's your signal to either find a lower-variance proxy metric or apply CUPED before you launch — not after you've been running for six weeks and the results look inconclusive.

Related insights

Sign up for free

Take Growthbook for a spin, no credit card required.

Create my account

Example H2

See All Articles

Experiments

Data Science

T-test vs z-test: Key differences and when to use each

Jul 15, 2026

min read

Experiments

Data Science

Bayesian statistics: What it is and how it applies to A/B testing

Jul 15, 2026

min read

Experiments

Data Science

What is statistical significance? Definition and how to calculate it

Jul 14, 2026

min read

Ready to ship faster?

No credit card required. Start with feature flags, experimentation, and product analytics—free.

Get Started

Book a Demo

Simplified white illustration of a right angle ruler or carpenter's square tool.

White checkmark symbol with a scattered pixelated effect around its edges on a transparent background.

Your standard sample size calculator will give you a confident number for an AI feature experiment — and that number will probably be wrong.

Why AI features break standard sample size assumptions

Non-determinism is the root cause

How non-determinism inflates metric variance

Effect sizes are harder to predict

Why your sample size calculator will fail you

High variance and uncertain effect sizes: the core AI feature test traffic problem

Why AI outputs inflate metric variance

Estimating variance before the experiment runs: three defensible approaches

Why effect sizes are harder to predict for AI features

The cost of getting either wrong

Setting conservative assumptions as a feature, not a bug

Calculating the AI feature test traffic you actually need

The standard sample size formula and its inputs

CUPED reduces traffic requirements by absorbing pre-existing behavioral variance

Sequential testing for AI experiments

Bayesian vs. frequentist — a practical choice for AI tests

A worked example: AI reply suggestions for customer support

Traffic allocation and segmentation for AI feature experiments

Defining who actually belongs in the experiment

Gradual rollout limits blast radius before full exposure

Avoiding sample ratio mismatch

Managing the novelty effect

When traffic is genuinely insufficient: strategies that don't pretend otherwise

Lower-variance proxy metrics reduce the traffic threshold without sacrificing causal validity

Sequential testing makes longer experiments statistically valid, not just longer

Holdout groups measure cumulative AI impact that two-week tests miss

Qualitative evaluation as a complement to quantitative testing

Feature flags enable safe incremental exposure, but don't replace experimental design

Planning your AI feature test traffic: matching method to constraint

Estimating traffic requirements before launch: three honest inputs

Matching your testing approach to traffic volume and feature type

The tooling layer: variance reduction and flag-to-experiment unification

Where to start: power analysis before experiment design

Related insights

Sign up for free

Table of Contents

Related Articles

T-test vs z-test: Key differences and when to use each

Bayesian statistics: What it is and how it applies to A/B testing

What is statistical significance? Definition and how to calculate it

Ready to ship faster?