false
Experiments

Switchback Experiments vs A/B Tests

A graphic of a bar chart with an arrow pointing upward.

A/B tests fail in two ways.

The first kind is fixable: bad sample sizes, early peeking, broken event tracking. The second kind isn't fixable at all — because the test design itself is wrong for the system being tested. If you're building on top of a two-sided marketplace, a shared dispatch system, an ad auction, or any infrastructure where users compete for the same underlying resource, a standard A/B test will give you confident, wrong answers. Not noisy answers. Wrong ones.

This article is for engineers, data teams, and PMs who work on systems where users aren't isolated from each other — logistics platforms, rideshare apps, AdTech infrastructure, ML ranking systems, or anything with shared supply pools.

If that's your context, you've likely seen A/B test results that didn't hold up after full rollout. This piece explains why that happens structurally, and what to do instead. Here's what we'll cover:

  • Why A/B tests break down in networked and marketplace systems — and why it's a bias problem, not a noise problem
  • How switchback experiments work, using time-based randomization instead of user-based splits
  • Where switchbacks are essential: delivery logistics, rideshare, ad auctions, and ML-driven systems
  • The key design tradeoffs in switchback experiments — period length, carryover effects, and statistical validity
  • A practical decision framework for choosing between a switchback experiment vs A/B test for your specific system

The article moves in that order — from the structural problem, to the method that solves it, to the real-world systems where it applies, to the design decisions that make it work, and finally to a set of diagnostic questions you can apply to your own system today.

Why traditional A/B tests break down in networked and marketplace systems

Most A/B testing failures are execution problems: insufficient sample size, peeking at results too early, misconfigured event tracking. These are fixable.

But there's a category of A/B test failure that isn't fixable through better execution — it's structural. In two-sided marketplaces, shared infrastructure systems, and real-time algorithmic environments, the foundational mathematical requirement for a valid A/B test is violated before the experiment even runs. The result isn't noisy data. It's systematically biased data that produces confident wrong answers.

The independence assumption is load-bearing, not optional

Every A/B test rests on a structural requirement known in causal inference literature as the Stable Unit Treatment Value Assumption, or SUTVA. In plain terms: the treatment applied to one user must have no effect on the outcomes of any other user. Groups must be genuinely independent.

GrowthBook's own documentation on A/B test fundamentals reflects this requirement directly — it defines valid experimentation as randomly splitting an audience into "persistent, independent groups" and tracking outcomes after exposure.

The word "persistent" is doing real work here. It assumes that users stay cleanly in their assigned group and that their behavior doesn't bleed into the other group's environment. When that assumption holds, the measured difference between groups is a valid estimate of the treatment effect. When it doesn't hold, the measurement itself is broken.

In a standard web product — a checkout flow, a landing page headline, a notification cadence — SUTVA usually holds. Users don't share a resource pool. Showing one user a new button design doesn't change what another user experiences. But in networked systems, this assumption collapses.

Shared resource pools: how driver supply and auction inventory break the independence assumption

The interference problem emerges whenever treatment and control groups draw from the same underlying resource.

Consider a rideshare platform testing a new dispatch algorithm. If 50% of drivers are assigned to the treatment condition, those drivers are still operating in the same geographic supply pool as drivers in the control condition. A treatment-group driver accepting a ride in a given neighborhood reduces supply availability for control-group riders in that same area. The two groups are not independent — they're competing for the same resource.

The same structure appears in ad auction systems. LinkedIn's engineering teams have documented this problem: when treatment-group advertisers use a new bidding algorithm, they participate in the same auction as control-group advertisers.

The algorithm changes clearing prices for everyone, including the control group. You haven't created two isolated experiments — you've run one distorted market and called it two.

DoorDash encountered the same dynamic with surge pricing experiments. Testing SOS pricing on a subset of orders affects driver availability across the entire pool, not just for the treatment segment. The control group's outcomes are contaminated by the treatment's effect on shared driver supply.

Why this produces bias, not just noise

This distinction matters more than it might initially seem. Noise produces uncertainty — wide confidence intervals, inconclusive results, the need for more data. Bias produces false certainty.

In supply-constrained marketplace systems, giving a better algorithm to the treatment group typically degrades the control group's experience by pulling shared resources toward treatment users. The measured treatment effect is therefore larger than the true treatment effect at full deployment.

The practical consequence: teams ship features that appear to win in A/B tests but underperform once rolled out to 100% of users. The "win" was partially an artifact of stealing resources from the control group. At full deployment, there's no control group left to steal from, and the effect shrinks or disappears.

GrowthBook's documentation on experimentation problems notes that teams sometimes experience "cognitive dissonance" when A/B test results don't match intuition — and respond by trusting their gut over the data.

In marketplace systems, this instinct is often correct. The results genuinely shouldn't be trusted, not because the team ran the test poorly, but because the test design was structurally invalid for the system being tested.

The interference problem isn't a reason to abandon experimentation. It's a reason to use a different experimental design — one that doesn't require independent groups in the first place.

Switchback experiments randomize across time, not users — and that difference is structural

The core insight behind switchback experiments is a reframe of the randomization problem itself. Instead of asking "which users get treatment A versus treatment B?", a switchback experiment asks "during which time periods does the entire system run treatment A versus treatment B?"

That shift in axis — from population-space to time-space — is what makes switchbacks structurally different from A/B tests, not just operationally different.

Replacing population splits with time slices

In a standard A/B test, treatment and control exist simultaneously. Half your users see the new pricing algorithm right now while the other half sees the old one.

In a switchback experiment, there is no simultaneous split. At any given moment, the entire network is in a single treatment state. As Statsig describes it, the experiment works by "switching between test and control treatments based on time, rather than randomly splitting the population" — and at any given time, everyone in the same network receives the same treatment.

This design eliminates cross-group contamination by construction. There is no control group to contaminate because there is no control group running at the same time.

The comparison happens across time periods, not across user segments. The interference problem that makes A/B tests unreliable in networked systems simply does not arise, because the two treatment states never coexist.

The experimental unit is a time period, not a user

This reframe changes what counts as an experimental unit. In an A/B test, the unit is a user (or session, or device). In a switchback experiment, the unit is a time period — and sometimes a geographic region paired with a time period.

The Nextmv delivery driver assignment example makes this concrete. When a dispatch algorithm assigns an order to a driver, you cannot assign the same order to two different drivers as a test for comparison. The same driver must complete the order they picked up.

As Nextmv puts it directly: "there isn't a way to isolate treatment and control within a single unit of time" using a traditional A/B framework. The only tractable solution is to randomize across time — run the candidate model during some time windows and the baseline model during others, then compare outcomes across those windows. The time window becomes the unit of analysis.

Period length: the foundational design decision

Once you accept that time periods are your experimental units, the most consequential design decision is how long each period should be. Statsig identifies determining the time interval as the minimum requirement for setting up a switchback experiment, and the reason the choice is difficult is that it involves a genuine tradeoff.

Periods that are too short create carryover problems: the system hasn't had enough time to stabilize under the new treatment before you flip back. A rideshare marketplace running a new driver incentive for ten minutes hasn't given drivers time to change their behavior in response to it, so the measured effect reflects a transient state, not a steady-state outcome.

Periods that are too long introduce temporal confounders: if one treatment runs primarily on weekday mornings and another runs primarily on weekend evenings, you're no longer comparing treatments — you're comparing time-of-day demand patterns. The period length has to be long enough for the system to reach a meaningful steady state, but short enough that the switching schedule samples evenly across the natural cycles in your data.

Treatment sequence randomization: preventing order effects

The sequence in which treatment and control periods are assigned must itself be randomized. If treatment always runs first and control always runs second, any time trend in your underlying metrics — a gradual increase in demand, a seasonal pattern, a product launch happening mid-experiment — will be systematically attributed to treatment.

Nextmv's implementation addresses this directly: the platform "randomly assigns units of production runs to each model," applying randomization to the time slots rather than to users.

This is the same logic as randomization in any experiment, applied one level up. Just as you randomize which users get treatment to prevent selection bias, you randomize which time slots get treatment to prevent temporal bias. Without this step, a switchback experiment can produce results that are just as misleading as the A/B tests it was designed to replace.

Where switchback experiments are essential: marketplaces, decision models, and adaptive infrastructure

If you work in logistics, ride-hailing, AdTech, or any system built around shared resources and real-time algorithmic decisions, the question isn't whether switchback experiments apply to your work.

It's whether you've been running A/B tests on systems that structurally can't support them. The examples below aren't edge cases — they're the default operating conditions for a wide class of production systems.

Singular-unit systems: when one order can only have one driver

The clearest structural argument for switchback experiments comes from delivery logistics. Nextmv, which builds optimization tooling for operational decision models, frames the problem precisely: "you cannot assign the same order to two different drivers as a test for comparison. The same driver must deliver the order that they picked up, so a traditional A/B test would not be effective as there isn't a way to isolate treatment and control within a single unit of time."

This isn't a matter of statistical inconvenience. The experimental unit — a single order matched to a single driver — is operationally indivisible. There is no treatment group and control group. There is only one assignment decision, made once, in real time.

Switchback experiments resolve this by shifting the experimental unit from the individual transaction to a window of time (or a time-and-location combination). The routing model applied during a given window handles all orders within that window.

The system alternates between models across windows, and the measured difference in outcomes reflects the true effect of one model versus the other. The same structural constraint applies to fleet dispatch, warehouse slotting, last-mile delivery, and any system where a shared resource serves requests sequentially rather than in parallel.

Two-sided marketplaces: shared supply pools and rideshare contamination

The rideshare case is the canonical two-sided marketplace example, and Statsig describes the interference mechanism directly: all riders in a given area share the same pool of available drivers.

A test that increases booking probability for the treatment group — say, a discount code — draws down driver supply for everyone, including the control group. The control group's metrics are now artificially depressed not because the treatment failed, but because the experiment itself created a resource scarcity that wouldn't exist in production.

As Statsig puts it: "Since the test and control groups are not independent, a simple A/B test will produce inaccurate results." The bias here isn't random noise that washes out with a larger sample. It's directional and systematic — the treatment group looks better than it is, and the control group looks worse. Running the experiment longer doesn't fix it; it compounds it.

Switchbacks eliminate this contamination by ensuring the entire market operates under a single treatment condition at any given time. There's no cross-group interference because there are no simultaneous groups.

Adaptive algorithmic systems: AdTech, auction mechanics, and ML ranking

Programmatic advertising surfaces the same interference problem in a different form. When a new bidding algorithm is tested via user-level A/B split, the treatment group's more aggressive bids win impressions — but they win them by displacing the control group within the same shared inventory pool. The measured lift is real in a narrow sense, but the underlying dynamic is displacement, not creation of value.

The same dynamic plays out across campaign pacing engines, ML-driven ranking models, and retail media platforms with constrained inventory. Any system where treatment and control groups compete for the same finite resource — ad slots, search positions, recommendation surface area — will produce interference by design when split at the user level.

Switchback experiments allow the full system to operate under one algorithm at a time, so measured differences between periods reflect genuine treatment effects rather than competitive displacement within the experiment.

This is why switchback testing is gaining traction at sophisticated platforms for validating changes in auctions, pacing, ranking, and ML-driven decisions.

The pattern across all three of these domains is the same: when the system is singular, shared, or adaptive, the independence assumption that A/B testing requires simply doesn't hold. Switchbacks aren't a workaround — they're the correct design.

Key design tradeoffs in switchback experiments: period length, carryover effects, and statistical validity

Switchback experiments solve the interference problem that makes A/B tests unreliable in networked systems — but they don't solve it for free. By applying both treatments to the same experimental unit over time, switchbacks introduce a new primary challenge: carryover effects. Understanding this tradeoff is what separates a well-designed switchback from one that produces results you can't trust.

Carryover effects: what you're trading interference for

When you switch a system from treatment A to treatment B, the system doesn't instantly reset. Algorithmic state, queued decisions, user behavior patterns, and downstream feedback loops all carry residual influence from the previous treatment period into the next one. This is carryover — and it's the direct consequence of the design choice that makes switchbacks work in the first place.

As ibojinov.com frames it explicitly: switchback experiments transform the problem of interference into one of carryover effects. That's not a flaw in the method — it's the fundamental tradeoff.

Interference in A/B tests produces systematic bias that's nearly impossible to detect or correct after the fact. Carryover in switchbacks is a known, manageable problem that good design can control. The entire design process is an exercise in that management.

Period length calibration: the central design decision

Period length — how long each treatment runs before switching — is the primary lever for controlling carryover, and it pulls in two directions simultaneously.

If periods are too short, the system hasn't stabilized under the new treatment before you switch again. What you're measuring isn't the steady-state effect of treatment B; it's the noise of the transition itself. A rideshare pricing algorithm might reach equilibrium in minutes, but a recommendation engine that influences downstream engagement may need hours before its effects are observable in outcome metrics.

If periods are too long, you introduce a different problem: temporal confounders. The longer a single treatment period runs, the more likely it is that time-of-day patterns, day-of-week traffic shifts, or external events are driving outcome differences rather than the treatment itself.

The Hour 1: A | Hour 2: B | Hour 3: B | Hour 4: A | Hour 5: B | Hour 6: A example from ibojinov.com illustrates what a short-period design looks like in practice. This kind of schedule works for systems with fast feedback loops.

The right period length for your system depends on how quickly your system reaches a new steady state after a treatment change — which is something you need to understand before you can design a valid experiment.

Controlling for temporal confounders

Even with well-calibrated period lengths, temporal confounders remain a real validity threat. GrowthBook's documentation on experiment duration makes this point directly in the context of standard A/B tests: if you start a test on Friday and end it Monday, you may not capture a representative picture of weekday traffic patterns, and the results will reflect that gap. The same principle applies with more force to switchback design.

The mitigation is even temporal sampling: treatment A and treatment B each need to run across equivalent distributions of time slots — mornings and evenings, weekdays and weekends, peak and off-peak periods.

This is where treatment sequence randomization matters. Randomly assigning which treatment runs in which period, rather than following a fixed alternating pattern, is what prevents systematic time-of-day bias from accumulating across the experiment.

Why standard A/B analysis tools produce wrong numbers on switchback data

The number of switches determines statistical power: more periods mean more observations, which increases your ability to detect real effects. But more switches also mean more transitions, and each transition is a window where carryover noise can contaminate your measurements — especially if period lengths are short.

The deeper issue is that standard A/B test analysis tools can't be applied directly to switchback data. In plain terms: the data points in a switchback experiment are not independent of each other the way individual users in an A/B test are.

What happened in period 3 influences what you observe in period 4, because the system carries state across time. Observations within the same experimental unit across time therefore have a temporal autocorrelation structure that violates the assumptions built into most significance testing frameworks — a standard t-test will give you numbers, but those numbers won't mean what you think they mean.

The same rigor that governs good A/B test design — ensuring adequate sample sizes, covering representative time windows, avoiding truncated observation periods — is the foundation you need before extending to switchback experiments. The methods differ, but the underlying discipline of controlling for what you can't randomize away is identical.

Switchback experiment vs. A/B test: three structural conditions that make the choice for you

The choice between a switchback experiment and an A/B test is not a matter of sophistication or preference. It is determined by a single structural question: can the experimental units be made independent of each other?

If yes, A/B is valid. If no, switchback is required. Everything else in this decision follows from that.

When A/B tests are valid

A valid A/B test has a precise structural requirement: you must be able to randomly assign your audience into persistent, non-interacting groups. GrowthBook's A/B testing fundamentals capture this in plain terms — the test anatomy requires that you "randomly split your audience into persistent groups."

That phrase does a lot of work. It presupposes that groups can be made distinct, that assignment is stable over the test window, and that what happens to users in one group does not affect outcomes for users in the other.

When those conditions hold, A/B testing is the right tool. UI changes, copy variations, onboarding flow experiments, pricing page layouts — any change where one user's experience is structurally isolated from every other user's outcome.

Standard experimentation platforms support flexible randomization units — user, session, location, postal code, URL path — precisely because these are all cases where clean isolation is achievable without interference. When you can segment the affected population and contain the treatment within that segment, A/B produces trustworthy results.

The validity boundary is also worth stating explicitly. GrowthBook's experiment guidance notes that including users who cannot actually see the experiment "would increase the noise and reduce the ability to detect an effect." The corollary is that including users whose outcomes are structurally entangled with each other doesn't just add noise — it introduces systematic bias. That's a different problem entirely.

When switchbacks are required

Three structural conditions break the independence assumption and make A/B testing not just imprecise but actively misleading.

The first is a shared resource pool — the rideshare driver supply problem described earlier. When treatment and control groups draw from the same underlying resource, the groups are not independent regardless of how cleanly you assigned them.

Auction systems introduce a related problem: cross-group treatment spillover. In ad auction systems, running two bidding strategies simultaneously means both strategies compete in the same auctions. As ibojinov.com puts it, this creates "marketplace interference that is incredibly difficult to analyze." The strategies interact at the auction level regardless of which users are assigned to which group.

The most decisive condition is a singular experimental unit. When the unit of experimentation is a city, a seller account, a server, or a shared data pipeline, there is no population to split.

ibojinov.com states this directly: "the unit of experimentation — the city, the seller's account, the server — is singular." You cannot divide a city into treatment and control halves and expect riders and drivers to respect that boundary. The unit is indivisible, so user-level randomization is structurally impossible.

Statsig frames the consequence clearly: when test and control groups are not independent, "a simple A/B test will produce inaccurate results." Not noisy results — inaccurate ones. The bias is systematic, not random, which means running more traffic or extending the test duration will not fix it.

Three questions that determine whether your system can support a valid A/B test

Apply these questions to your system before choosing a method.

First: can your experimental units be made independent? If treatment applied to one unit can affect the outcomes of another unit — through shared supply, pricing dynamics, auction competition, or any other coupling mechanism — independence fails and switchback is required.

The second question: is your experimental unit singular or shared? If the entity you're testing on is a marketplace, a city, a server, or any resource that cannot be partitioned without creating interaction effects, you cannot run a valid A/B test. Switchback is the only structurally sound option.

The final question: does your system involve a shared resource pool? If users in different treatment groups compete for or draw from the same underlying inventory — drivers, ad slots, fulfillment capacity — their outcomes are entangled regardless of how you assign them.

If all three answers are no, A/B is valid and standard experimentation tooling is the appropriate choice. If any answer is yes, you are operating in territory where user-level randomization produces systematically biased estimates, and time-based switching is not an alternative approach — it is the correct one.

Choosing between switchback and A/B testing: what to do if your last rollout didn't hold

Your experimental design has to match your system's structure, not the other way around

The whole argument of this article reduces to one sentence: your experimental design has to match the structure of your system, not the other way around. If your users share a resource pool, compete in the same auction, or can't be partitioned without creating interference, a user-level A/B split doesn't produce noisy results — it produces wrong ones. Switchbacks aren't a more sophisticated version of A/B testing. They're a structurally different tool for a structurally different problem.

Auditing your current experiments for interference risk: start with the rollouts that didn't hold

The fastest way to assess your exposure is to look at your last three A/B test results that didn't hold up after full rollout. If the treatment effect shrank or disappeared at 100% deployment, that's the fingerprint of resource contamination — the "win" was partly borrowed from the control group.

Ask whether the users in your experiment shared any underlying resource: driver supply, ad inventory, fulfillment capacity, auction clearing prices. If they did, the independence assumption was violated before the experiment started, and no amount of statistical rigor would have saved it.

Before picking a tool, determine your system's equilibration time

The design work for a switchback experiment starts with one question you have to answer honestly: how long does your system take to reach a new steady state after a treatment change? That determines your minimum period length, which determines everything else.

Before you think about statistical analysis — which requires accounting for temporal autocorrelation, not a standard t-test — get that number right. If your system has fast feedback loops, you can run short periods and accumulate many switches. If it's slow to stabilize, you need longer periods and will have fewer experimental units to work with.

For systems where user-level isolation is achievable, standard A/B experimentation platforms handle the standard case well, with flexible randomization units and statistical guardrails. But if your three diagnostic questions point toward interference, the right starting point is understanding your system's equilibration time — not picking a tool.

This article was written to give you a clear mental model for a problem that's easy to misdiagnose. Hopefully it saves you from shipping a few false wins.

What to do next: Run the three diagnostic questions against your most recent experiment. If any answer is yes, pull up the last rollout where your A/B results didn't hold — there's a good chance the interference mechanism described here explains the gap. That's your starting point for making the case internally that switchback design isn't optional for your system. It's the correct one.

Table of Contents

Related Articles

See all articles
Experiments
AI
What I Learned from Khan Academy About A/B Testing AI
Experiments
Designing A/B Testing Experiments for Long-Term Growth
Experiments
AI
How a Team of 4 Used A/B Testing to Help Fyxer Grow from $1M to $35M ARR in 1 Year

Ready to ship faster?

No credit card required. Start with feature flags, experimentation, and product analytics—free.