The Uplift Blog
Chasing Velocity in A/B Testing: Why More Experiments Can Mean Less Learning
Experiment velocity is a useful diagnostic, not a goal. The moment you make it a KPI, teams start optimizing for the count instead of the learning, and the program quietly drifts toward trivial tests that answer trivial questions. The healthier target is the rate at which you learn, which usually requires a portfolio of easy, medium, and hard experiments.


12 Common Feature Flag Mistakes to Avoid
Feature flags almost always start as a matter of convenience. Maybe you’d like to control a feature’s release, or reduce the risk associated with every deployment. But the ones you ship today keep running tomorrow, and the ones you ship next quarter run alongside them.
Next thing you know, you’re operating a distributed system inside your distributed system—one with its own state, its own consistency problems, and its own failure modes.
The trouble is that most of these failures are quiet. And you won’t know until they do real damage to your infrastructure.
In this article, we’ll walk you through 12 feature flagging mistakes and how you can avoid them irrespective of your team size and usage.
Why feature flags fail in real-world systems
Like any other tool, feature flags are prone to incidents because of how they’re used in the first place. If you look at public postmortems from engineering teams that run thousands of flags, the damage traces back to how a flag was created or managed.
Now think about how you treat a feature flag. It gets a pull request, a Slack emoji, and a long, quiet life nobody tracks. That asymmetry is where things break down. At 10 flags, you carry the full picture in your head. At 100, you rely on a spreadsheet and good intentions. At 1,000, ownership becomes tribal knowledge, and stale flags pile up faster than anyone can clean them.
It’s also why companies like Uber built Piranha, a tool specifically designed to retire more than 2,000 stale flags. Its teams realized that manual cleanup processes could never keep up with the pace at which flags were created.
You don’t know what you don’t know. So, incidents also happen because you’re not sure what problems flags can create in the first place. Unless you know the pitfalls, it’s hard to implement the right governance measures to prevent that.
12 common feature flag mistakes that reduce its efficacy
Here are some of the most common mistakes engineering teams make while using feature flags. These mistakes fall into three broader categories, which include:
- Implementation mistakes: These issues live in your code and are introduced when you create the flags themselves. They usually stay invisible until something breaks in production.
- Operational mistakes: These are process gaps that widen over time and turn manageable flag counts into unmanageable debt.
- Strategic mistakes: These are the larger missed opportunities because it includes the ways your flag practice could generate more value but doesn't, because nobody designed for it.
Implementation mistakes with feature flagging
1. Reusing feature flags
You shipped a flag six months ago. The feature is live, everyone’s happy, and the flag name is just sitting there in the codebase. So when you’re adding a toggle to a new feature, you might think about reusing the name. But that’s a bad idea.
Knight Capital learned this in the most expensive way possible. In 2012, an engineer repurposed a flag name still tied to an obsolete trading algorithm. The deployment activated the old code path instead of the new one, and within 45 minutes, the firm lost $460 million.
How to avoid it: Treat flag names as immutable. Once you’ve deployed a flag, retire the name after you retire the flag. If you’re using a feature flagging platform like GrowthBook, it enforces this with regex-based naming validation that catches duplications before they reach production.
2. Using client-side flags for security
You use a feature flag to gate access to premium features or admin functionality on the client side. But the problem is that client-side flags are visible to users. Anyone with browser DevTools can:
- Inspect the SDK payload and see every flag and its rules
- Figure out exactly what’s being gated
- Modify the local flag state
- Call your API directly to bypass it
Feature flags control visibility, not access. They decide what users see—not what they’re authorized to do.
How to avoid it: Keep your authorization logic server-side. Use feature flags for UI presentation but enforce actual access control through your backend. For an additional layer of protection, GrowthBook supports encrypted SDK payloads that obfuscate client-side flag configurations. So it makes it much harder to reverse-engineer your flag rules.

3. Not testing all flag states
Your CI/CD pipeline tests your application with the current production flag configuration. But does it test what happens when the new flag is turned on or off? If you’re only testing one state, you’re assuming the other works, and that assumption may not always hold up in reality.
That’s how Slack dealt with a 6-hour outage back in 2020. When its team rolled out a feature flag, it triggered a performance bug. Even though they caught the bug and rolled back within 3 minutes, the rollback left a stale HAProxy state that caused the outage.
How to avoid it: You can avoid this by testing the current production state, the new state you’re rolling out, and the rollback state for every flag you deploy. GrowthBook offers a simulation tool that lets you see how different rules impact what users see and you can even test it in different states.
That said, it’s a simulation of what user see, not how the flag will behave so you need to run an actual experiment for that purpose.

4. Overloading a single flag with too much logic
Let’s say you created a flag called new-dashboard. It was meant as a toggle for the new UI. But over time, your product team could’ve asked you to display the new analytics panel if the user is in the Enterprise tier. Now the flag controls two behaviors and you can’t change one without risking the other.
Even if you have 10 flags with clean boolean logic, you’ve already created 1,024 possible code paths. Overloading every single one of those with complex logic complicates this.
How to avoid it: Apply the same single-responsibility principle you’d use for any function or class. If a feature requires multiple independent toggles, create separate flags and use prerequisite flags to define their dependencies.

Operational mistakes with feature flagging
5. Letting flags become zombie flags
A zombie flag is a flag that’s still in your codebase but no longer serves any useful purpose. It increases your technical debt and the problem doesn’t stop there. Every zombie flag adds a conditional branch that your team has to debug or manage in the future. That’s why you need the right governance measures in place to stop flags from accumulating in the first place.
How to avoid it: The simplest way is to define the flag type and the time it should be live. For example, if it’s a release toggle, set a calendar reminder or Jira ticket to clean it up after 30 days.
Or use a feature flagging platform that offers stale feature flag detection. For instance, GrowthBook identifies stale flags automatically, and when you use it with Code References, you’ll know where these flags are present—making cleanup easier.

6. Poor naming conventions
Compare these two lists:
// Typical naming convention
gb.isOn("ff-123")
gb.isOn("test")
gb.isOn("experiment_2")
gb.isOn("new-thing")// Self-documenting
gb.isOn("new-checkout-flow")
gb.isOn("holiday-2024-promo-banner")
gb.isOn("pricing-page-v2-experiment")
gb.isOn("premium-analytics-entitlement")The first set is vague at best. If something goes wrong, you’ll spend too much time wading through your commit history and Slack threads to figure out what it means. The second one, however, tells you why the flag was made and for which rollout or experiment.
How to avoid it: Establish a naming convention early and enforce it. A good pattern includes the feature area, intent, and optionally a timestamp or version:
{feature-area}-{description}-{type}So: checkout-redesign-release, pricing-page-v2-experiment, eu-compliance-widget-killswitch. GrowthBook lets you enforce naming patterns with regex validation. If you accidentally reuse an older flag’s name, it’ll reject it and force you to create another one.
7. No ownership or lifecycle management
If you don’t require your team to own a flag when they create or use it, you’ll end up with a codebase full of decisions nobody can explain. It makes the cleanup and auditing process almost impossible because no one has the necessary context for the flag’s purpose and usage.
Without it, you can’t answer basic questions:
- Who do you page when this flag behaves unexpectedly?
- Who decides when it’s safe to retire?
- Who’s accountable if it causes an incident?
How to avoid it: Assign an owner to every flag when it’s being created. It has to be an individual so there’s a clear line of accountability when they move out of the role—and someone else steps in. GrowthBook supports flag-level ownership and project-based organization, so you can filter by owner and quickly see who’s responsible for what.
8. Ignoring rollback procedures
Most teams think about rollback as “just turn the flag off.” And for simple boolean flags, that might work. But you also need to remember that flags don’t exist in isolation. A rollout can trigger side effects that don’t reverse when you roll it back.
How to avoid it: For every flag rollout, document what happens beyond the flag itself. Ask:
- Does this rollout trigger any irreversible writes?
- Will caches, queues, or third-party integrations retain state from the rolled-out version?
- Does the rollback path need its own deployment, or is flipping the flag truly enough?
This is where testing comes into the picture. But also, you should have a way to gradually roll out features so that when you see an inkling of something problematic, you can roll back before the blast radius expands.
For instance, GrowthBook offers a Feature Diagnostics that lets you inspect how flags are actually evaluated in production. As a result, you can verify what’s actually happening or has happened in one place.

Strategic mistakes with feature flagging
9. Treating feature flags as a short-term tool only
Most teams adopt feature flags for one reason: safer releases. And that’s a perfectly good reason. But that’s never the end of it. If you only think of flags as temporary release wrappers, you never build the governance or engineering mindset you need to sustain them at scale.
Over time, you’ll end up with thousands of ad hoc flags that become a pain to manage or even clean up because nobody designed a system to handle them. That’s why you need to treat feature flags as a critical part of your code’s infrastructure.
How to avoid it: Build a governance system that acts like you’ll be managing 100 flags in a week, even if you’re not right now. GrowthBook gives you the scaffolding for this. Here are a few ways it does that:
- Force naming conventions through regex validation
- Allow flag and project-level ownership
- Provides the ability to schedule flags to roll out and back
- Build approval workflows to control who can deploy flags
- Create kill switches that work as long-lived flags

10. Lack of observability and metrics
Unless you have observability tied to your flags, you’re flying blind. Most engineering teams monitor infrastructure metrics like error rates but don’t connect the flag’s state to the product’s metrics. Let’s say there’s a 5% drop in a payment feature’s performance, you won’t notice it in real time.
How to avoid it: Tie your flags to the metrics that matter. Every flag rollout should have at least one success metric and one guardrail metric defined before you flip the switch. GrowthBook’s Safe Rollouts does this natively. You can select the guardrail metrics and the platform monitors behavior in real time.
You can even set the rollout cadence using Ramp Schedules where you can define the percentage thresholds and the platform handles the increments automatically. And because the platform is warehouse-native, it analyzes your metrics directly in your data warehouse—reducing latency.

11. No segmentation or controlled rollouts
Gone are the days when Big Bang deployments were the only way to release a feature or system. You no longer have to wait with bated breath for every deployment, because controlled rollouts let you test a deployment with a small segment of users before rolling the feature out completely.
You start with 5% of traffic, monitor the results, and expand gradually. If something goes wrong, you’ve affected a fraction of your users instead of all of them.
How to avoid it: Use percentage-based rollouts as the default for every flag, and use targeting rules to control who sees the feature, not just how many.
GrowthBook supports attribute-based targeting with AND/OR logic, where you can segment by geography, subscription tier, device type, company ID, or any custom attribute you define. You can use that with Saved Groups for reusable audience segments and percentage rollouts with deterministic hashing, so the same user always gets the same experience.

12. Not connecting feature flags to experimentation
This problem stems from a lack of observability. You already have the delivery mechanism and the underlying infrastructure in place to experiment. But many engineering teams still don’t measure performance using this system.
They ship the feature, confirm it doesn’t break anything, and move on. But they never ask: “Did it actually improve anything?”
How to avoid it: When you create a flag for a new feature, ask whether it’s also a candidate for an experiment. If so, attach metrics to the flag and measure the difference between the old and new experiences. Many engineering teams call these “do no harm experiments” where you’re not running a full-blown experiment. But you’re attaching a few guardrail metrics to every rollout to see if a release affects something that matters.
Alternatively, if you’re already running experiments, start using feature flags to make the process easier.
GrowthBook makes this simpler by keeping feature flags and experiments on the same platform. You can turn any feature flag into an experiment with a few clicks by just:
- Assign users to control and variation groups
- Defining success metrics
- Using the GrowthBook’s stats engine for analysis

How lack of proper feature flagging practices can break the system at scale
At 20 flags, these mistakes are minor inconveniences. But at 200 flags, they’re systemic risks.
When you start implementing feature flags at scale, it turns them from a simple coding tool to a critical part of your coding infrastructure. At that level, even small mistakes can balloon into complicated (and expensive) incidents if you don’t manage them well.
Facebook’s 2021 outage is one example of this pattern. During routine maintenance, an engineer issued a command to assess backbone capacity via a functional global ops flag.
Unfortunately, it unintentionally severed all connections between Facebook’s data centers. Even though the internal audit system should’ve blocked the command, a bug allowed it through. The 6-hour outage resulted in a $100 million revenue loss and affected 3.5 billion users.
That’s why you need to implement these best practices carefully because the incidents don’t scale linearly. If you’d like to learn how to build the right feature flagging infrastructure at scale, check out this guide.
Where most teams go wrong with feature flags
You don’t experience high-risk incidents because of feature flags themselves. But rather because of how you use them.
Ultimately, feature flags are a distributed system that’s constantly growing within your infrastructure. That’s why you need the same discipline you apply to any other production infrastructure. Measures like feature flag governance and observability are table stakes today—and the only way to prevent technical debt in the long run.
If you’re looking for a feature flagging platform that operationalizes this line of thinking, try out GrowthBook for free.
.png)
A Practitioner's Guide to Treatment Effects in Experimentation: ATE, CATE, ITT, LATE & ATT Explained
A practitioner's guide to the treatment effects hiding behind your experiment’s number — ATE, CATE, ITT, LATE, ATT — with the vocabulary for telling them apart.
What treatment effect is your experiment actually measuring?
Your junior data analyst reports from the free trial experiment: customers who activated the trial ordered +2.0 more times per month than those who didn't.
Your senior data scientist investigates the same data and reports: the Average Treatment Effect of the trial offer is +0.9 orders per month per customer.
Same experiment, two numbers. Which one goes in the deck?
Both sound reasonable on the surface. Only one of them is a fair answer to a well-defined question. The other conflates the effect of the offer with the pre-existing differences between customers who engage with offers and customers who don't.
Every experiment looks like a single number on a dashboard, but the same data can be summarized in more than one way. The number you see depends on who you're averaging across, what you actually randomized, and whether you count the customers who ignored the treatment. This piece walks through those choices using a framework called potential outcomes, and gives you the vocabulary for asking which number actually answers your business question.
Two customers, two treatment effects
You run a food delivery platform. There's a subscription: pay a monthly fee, get free delivery on every order. It pays for itself at around two orders per month, but take-up is poor. Most customers don't find it, or they just don't want yet another subscription to manage. To enlist them, you run a 30-day free trial. The outcome you care about is each customer's monthly order frequency: the number of deliveries they place in a month.
Think about two customers in your sample. Adiya orders four or five times a month. She's a high-frequency customer who knows the app inside out; she'd probably try the subscription on her own eventually. Marco orders once or twice a month. He doesn't explore features beyond what he needs.
If the free trial landed in front of them, what would happen?
Adiya is already ordering a lot, and free delivery wouldn't meaningfully change her habits. Her frequency goes up by about +0.5. Marco is different. He'd never bother with the subscription on his own, but if the trial shows up in the app, he tries it, and discovers that free delivery feels better than he expected. He starts ordering more often. His frequency goes up by +1.5.
Same offer, very different effects.
Two potential outcomes per customer
Look carefully at what we just said about Adiya. Her frequency goes up by +0.5 "if the trial lands in her app." That's a statement about a hypothetical. It implies there are two versions of Adiya's purchasing behavior: one where she receives the trial offer, and one where she doesn't. Each version produces a different order frequency.
Adiya's two versions: about 4.5 orders per month without the offer, about 5 with it. Marco's two versions: about 1.5 without, about 3 with. The difference between the two versions is the effect of the trial on that customer.
These two versions are called potential outcomes. We label them Y(0) for "without the offer" and Y(1) for "with the offer." For each customer, Y(1) − Y(0) is the effect of the offer on them.
In the real world, each customer only lives one of their two versions. If Adiya is offered the trial, we observe her Y(1); her Y(0), the world where she wasn't offered, never happens. We can never compute her individual Y(1) − Y(0). This is the fundamental problem of causal inference (Holland, 1986): you never see both potential outcomes for the same customer, so you can never measure directly a causal effect for any individual.
What you can do, and what the rest of this post is about, is measure averages across many customers. Randomization is what makes those averages fair to compare.
Picturing potential outcomes
Now let's visualize Adiya and Marco alongside 18 other customers.

Each dot is a customer. The x-axis is their Y(0), their frequency without the trial offer. The y-axis is Y(1), their frequency with the offer. A dot on the 45-degree line means the offer changes nothing for that customer. A dot above the line means the offer moves them to order more. The vertical distance from the line is that customer's treatment effect.
Adiya sits a little above the diagonal, +0.5. Marco sits noticeably higher, +1.5. Most other customers are in the same ballpark. A few sit right on the line because the offer did nothing for them.
This is the hypothetical view from the previous section: both potential outcomes for every customer, side by side. In the real world, you only ever see one outcome per customer.
What could this look like with a larger sample? The next figure shows the same hypothetical pairs of potential outcomes for 10,000 customers. The cloud tells you something the 20-customer version couldn't. Most customers sit above the 45-degree line, so the offer works for most of them. But the vertical distance varies wildly from customer to customer. Some customers float more than two orders above the diagonal, others sit right on it. The free trial's effect isn't a single number. It's almost as many different numbers as there are customers.

The dashed gray lines mark the averages. Without the offer, this population would order about two times per month. With it, almost three. That's the level we'll come back to.
Selection bias: mistaking intent for treatment effect
Before we get to randomization, look at what happens when you try to measure the effect the obvious way, without randomization.
A junior analyst looks at customers who were offered the trial and splits them into two groups: activators who took it up, and non-activators who didn't. He compares their orders. Activators averaged +2.0 orders per month more than non-activators.

Activators are in purple; non-activators are in orange. The non-activators cluster along the 45-degree line because the offer didn't change their behavior; they never engaged with it. Many activators float above the diagonal, most of them noticeably higher.
But look at where each group sits on the x-axis. Non-activators are concentrated on the low end. These are the customers who would not order much regardless. Activators skew toward higher no-trial order rates. They are more engaged anyway.
The junior analyst's +2.0 isn't the effect of the trial offer. It's the effect of the offer on activators plus the baseline difference in order intent. He's conflating two things at once: the effect of the offer, and the fundamental difference in intent between customers who engage with offers and customers who don't. This is called selection bias, and in this analysis it's inflating the estimate by more than a factor of two.
Why randomization fixes the bias
Random assignment breaks the selection bias. Instead of splitting customers by a decision they made in response to the experiment, you split them before they've decided anything.
Offer the trial to a random half of your customers and hold it back from the other half. Now you can expect the groups to have the exact same baseline intent, because the only thing that made them different was chance. By chance alone, you now observe Y(0)'s in the control group, and Y(1)'s in the treatment group.
Now the comparison is fair. Any difference in outcomes between the groups has to come from the offer itself, because the offer is the only thing the groups differ on.
The fundamental problem still applies to individuals: you'll never see both of Adiya's potential outcomes. Randomization solves the group version of it: treatment and control are balanced on everything except the offer, which means you can fairly attribute the difference in their averages to the trial offer.
The Average Treatment Effect (ATE)
Let's take a small slice to see the magic of randomization concretely. Here are ten customers from the experiment. For each we've drawn both potential outcomes and noted which variation they were randomly assigned to. In reality you only see the outcome under their assignment. The rest are "?".
The true average effect across these ten, if you could see everything, is +0.9 orders per month. In the experiment you only see one outcome per customer, but you can take the mean of what you do see in each variation. The observed treatment-group mean is 3.0; the observed control-group mean is 2.0. The difference is +1.0, close to the true effect. With such a small sample, luck can drive the estimate in either direction. But in expectation, they're the same.
Under random assignment, the difference in observed means is an estimate of the average of those individual effects. Therefore, it is called an estimate of the Average Treatment Effect (ATE).
What is the Average Treatment Effect (ATE)?
The ATE is the average of Y(1) − Y(0) across all customers in your experiment. You can never compute it directly because you never have both potential outcomes for any one customer. But under random assignment, the difference in observed group means is an unbiased estimator of the ATE.
The ATE is what a standard A/B test is designed to estimate. Multiply it by your customer base and you have the business case to deploy.
Why the ATE isn't the full story
The ATE is by definition an average. Like any average, it hides the shape of the distribution behind it.

Here's the distribution of the individual treatment effects illustrated in the scatter plot above. A visible chunk of customers sit at exactly zero: the offer did nothing for them, the dots that sat on the diagonal earlier. Many others have a positive effect. A few gain more than two orders a month. The ATE of +0.9, marked as the dashed line, is the mean of this whole distribution.
Two customers at the same Y(0) can have very different individual effects, and two customers with the same individual effect can be at very different Y(0)s. The ATE collapses all of that variation into one number.
The Conditional Average Treatment Effect (CATE)
A nice average. Is that really all there is? Nothing more to say about the offer?
Back to Adiya and Marco. Her effect is +0.5, his is +1.5. The ATE of +0.9 falls between them, but it doesn't really describe either of them. It's an artifact of mixing two very different customer types into a single number.
Figure 4 shows the spread, but nothing about who sits where on it. To dig further, we can split the sample into subgroups we believe respond differently. For example, take the same 10,000 customers from Figure 2 and split them by their order frequency in the three months before the experiment. Split at the median, and you get two groups: low-frequency customers (the Marcos) and high-frequency customers (the Adiyas).
The average effect within each subgroup has a name of its own: the Conditional Average Treatment Effect, or CATE. Every dimension split in your experiment scorecard is a CATE.
What is the Conditional Average Treatment Effect (CATE)?
A CATE is the ATE conditional on one or more characteristics. It's the answer to "what's the treatment effect for customers who look like this?", where "this" can range from a broad subgroup to a specific individual.

The orange cloud floats higher above the diagonal than the purple. Low-frequency customers' CATE is +1.0 orders per month. High-frequency customers come in at +0.7. The overall ATE of +0.9 is the weighted average of the two.
In this experiment, low-frequency customers benefit most from the offer, yet Figure 3 shows many of them never activate. The hint here is that there could be potential in these low-frequency non-activators that never gets unlocked. Combining the pre-experiment split with activation gives you four informal segments (high/low frequency × activated/didn't) and a richer targeting picture — though note activation is only observable after the fact, while frequency is targetable upfront. Capturing the low-frequency potential probably requires upgrading the offer for that group: more visible placement, a longer trial window, messaging that emphasizes no commitment. When CATEs vary like this, the treatment has heterogeneous (different) treatment effects on different types of customers; for more on surfacing and acting on them, see Your experiment lift is an average — which users actually benefited?
The Intention-to-Treat effect (ITT)
Something keeps nagging from Figure 3. The experiment offered the trial to a random half of customers. About 40% of them activated; the other 60% didn't bother. The ATE estimation uses all of them, while most of them obviously aren't affected. Is this really all we can muster?
The experiment randomized the offer, not the trial experience itself. Customers decided on their own whether to activate. The +0.9 is therefore the effect of being offered the trial, averaged across everyone who received the offer, activators and non-activators alike.
The question the junior analyst is trying to answer is: what's the effect of actually trialing? The +0.9 doesn't answer that question. It's the Intention-to-Treat effect, or ITT: the effect of the assignment, regardless of whether the assigned customer actually took up the treatment. When only some of the assigned take it up — partial compliance — the ITT dilutes toward zero as the non-takers pull the average down. When everyone assigned takes it up, ITT equals the effect of the treatment itself on everyone.
What is the Intention-to-Treat effect (ITT)?
The ITT is the effect of assignment, not of actually receiving the treatment. It includes customers assigned to treatment who never took it up — in our case, the 60% who dismissed it. When everyone assigned complies, ITT equals the effect of the treatment itself; when compliance is partial, the ITT dilutes. In experiments, this is relevant when what you consider treatment is not what you can assign directly, and take-up is thus voluntary.
The ITT is an honest number for the thing the experiment actually varied. But it isn't always the number the business needs. They might want the effect of the trial on customers who actually used it: the Marcos who changed their habits, not the ones who didn't bother.
The Local Average Treatment Effect (LATE)
Under some assumptions¹, you can get at it without running anything new. Divide the ITT by the activation rate, and the ratio estimates the Local Average Treatment Effect (LATE): the effect on compliers, customers who activated because the offer moved them to. In our experiment, LATE ≈ 0.9 / 0.4 ≈ +2.3 orders per month. Much bigger than the ITT, because the ITT averaged in zeros from everyone the offer didn't move. That could be part of the story too.
What is the Local Average Treatment Effect (LATE)?
The LATE is the average treatment effect on compliers: customers who took up the treatment because of their assignment, but wouldn't have on their own. It excludes always-takers (who'd take it regardless) and never-takers (who'd never take it). LATE is identified from experimental data using instrumental variables, when assignment is random and only affects outcomes through take-up.
Proper LATE estimation uses instrumental variables, which most experimentation platforms don't provide out of the box. The ratio formula above gives you a point estimate; IV gives you confidence intervals too.
For most practitioners, the takeaway is simpler. Your A/B test gives you an ATE of the thing you randomized. That's often the most actionable number for a ship decision, because it bakes in the drop-off: no matter how persuasive the prompt, some customers won't bother. If what you really care about is something downstream — the effect of actually trialling, not just being offered the trial — your ATE is an ITT of that downstream thing. LATE strips out the dilution and tells you the effect on the customers the offer actually moved. That is useful when you're decomposing the mechanism, less so for sizing the launch.
Five estimands at a glance
We've covered four estimands rooted in randomization — ATE, CATE, ITT, LATE. The table below adds a fifth, ATT, that you'll meet when randomization isn't on the table.
ATT, the Average Treatment Effect on the Treated, is the version you'll meet the moment you can't randomize — observational studies, geo experiments, marketing campaigns deployed without a control. The randomization that gives the ATE its meaning is unavailable, and analysts fall back on identifying the effect specifically on the units that ended up treated.
What is the Average Treatment Effect on the Treated (ATT)?
The ATT is the average of Y(1) − Y(0) restricted to units that actually received the treatment. Under random assignment, ATT and ATE coincide because the treated and control groups look alike in expectation. Outside randomization — matching, difference-in-differences, synthetic control, GeoLift — that equivalence breaks, and the ATT is what those methods are designed to estimate.
Which treatment effect goes in the deck?
Your junior analyst had +2.0. Your senior data scientist had +0.9. We now have a vocabulary for what each one is, and what each one isn't.
The junior analyst's +2.0 isn't on the list. He compared the orders of customers who activated the trial with those who didn't, inside the treatment group, and reported the difference as the trial's effect. That's a biased comparison of two subpopulations with different baselines. Easy to compute, easy to misinterpret, and more common than it should be.
The senior data scientist's +0.9 is an ATE in the offer framing, and at the same time an ITT in the trial-experience framing. It's the effect of the thing the experiment actually varied. It isn't the effect of the trial itself on customers who used it, and if the launch discussion treats it that way, that's when you come in.
The point isn't to use more jargon with your stakeholders. It's to think more clearly yourself about what a number represents, and to frame the discussion around that.
When you absolutely, positively need to know exactly what your experiment measured — accept no substitutes.
References
Holland, P. W. (1986). "Statistics and Causal Inference." Journal of the American Statistical Association, 81(396), 945–960.
Imbens, G. W., & Angrist, J. D. (1994). "Identification and Estimation of Local Average Treatment Effects." Econometrica, 62(2), 467–475.
¹ Formally, the Wald estimator (ITT ÷ activation rate) identifies LATE under two key assumptions. Monotonicity: assignment (offer) doesn't make anyone less likely to take up treatment (trial). In our setup this is automatic because the trial is only available through the offer. Exclusion restriction: the offer affects orders only via activation, not directly; seeing the offer doesn't itself motivate orders. These sit on top of the usual random-assignment setup.

Chasing Velocity in A/B Testing: Why More Experiments Can Mean Less Learning
TL;DR: Experiment velocity is a useful diagnostic, not a goal. The moment you make it a KPI, teams start optimizing for the count instead of the learning, and the program quietly drifts toward trivial tests that answer trivial questions. The healthier target is the rate at which you learn, which usually requires a portfolio of easy, medium, and hard experiments.
There is a particular kind of dysfunction that only shows up in experimentation programs that are working well. The team has the tooling. The events and data pipelines are working, metrics and results are trustworthy. Experiments ship every week. By every visible measure, the program is mature. But the experiments themselves keep getting smaller, the hypotheses keep getting safer, and the insights keep getting shallower. The program is producing more experiments than ever and somehow learning less. This is what happens when velocity stops being a diagnostic and starts being the goal.
It is a quiet failure mode, which is part of why it persists. Nothing breaks. No one misses a deadline, and the dashboards still tell a flattering story. The shift from "what is the most valuable thing to learn?" to "what is the fastest thing we can test?" happens one prioritization meeting at a time, and by the time anyone notices, it has become the way the team works. The trap is not that velocity is a bad thing to care about. It is that velocity, treated as the main metric, will reliably get you a program that runs more experiments and understands its users less.
Why experiment velocity became the main metric
Experimentation maturity is usually described in terms of throughput. The popular crawl, walk, run framework puts a handful of experiments at one end and thousands at the other, with every feature change running as an A/B test at the top tier. It is easy to extrapolate: if the most advanced teams run the most experiments, then running more experiments must be how you become advanced.
Teams that hit the “fly” level have done a lot right. They launch features behind a feature flag. They have the data piped so the marginal cost of an additional experiment is close to zero. They trust their experimentation platform and know how to interpret results. Culturally, they are more comfortable learning from users than arguing in planning meetings. Throughput matters.
If your team can only launch one experiment a quarter, you have a velocity problem. But it does not follow that maximizing the number of experiments is the right objective forever. That is where teams get into trouble.
The problem with velocity as a KPI
Velocity is a proxy for the real goal, which is learning what your users want from your product. A team running a lot of experiments looks like a team asking lots of questions, challenging assumptions, and replacing opinion with evidence. More tests, more learning. The logic is not wrong. It is just incomplete.
The moment a proxy becomes a target, people optimize for the measurement instead of the thing the measurement was supposed to represent. This is Goodhart’s law. When a measure becomes a target, it stops being a good measure.
Experiment velocity is a textbook example. If leaders say “we want to learn faster,” teams build better systems, reduce friction, and ask sharper questions. If leaders say “we need to run more experiments this quarter,” the behavior changes. Teams look for the fastest path to a count, not the fastest path to insight.
That makes the metric easy to game. A team can inflate velocity by prioritizing simple tests over difficult, high-value ones, or by launching half-formed tests just to show they are testing. From the dashboard, this looks like progress. In reality, the organization may be learning less.
The irony is that this produces the opposite of what leaders wanted. The original goal was faster learning and a better-performing product. Setting velocity as the headline KPI lowers the quality of that learning. Teams become more active but less ambitious. They generate fewer durable insights. The program optimizes for motion rather than understanding.
The hidden cost is shallower learning
Trivial experiments answer trivial questions. A button-color test tells you which variant got more clicks. It rarely tells you anything about your users, their motivations, or the constraints in your product.
A harder experiment often teaches something more generalizable. Suppose you move the paywall later in onboarding, and conversion to paid goes up. That result is not just about one UI sequence. It challenges an institutional belief about when users are ready for purchasing. It suggests users need more confidence before being asked to commit. It can influence pricing, packaging, and other related projects.
That is a qualitatively different kind of learning. The best experimentation programs do not just produce winners. They produce insight. They learn from experiments that fail and use those learnings to drive future iterations. They help an organization understand which assumptions were wrong. Chasing easy velocity sacrifices that deeper layer.
Think like a portfolio manager
The fix is not to reject velocity. It is to stop treating velocity as the only mark of success. A healthy experimentation program runs a portfolio.
Some experiments should be fast and cheap. These iterative tests improve local conversion points, clarify messaging, and tune workflows. They keep momentum high and help teams build instincts.
Some should be medium-scope bets. A bigger design change, a more consequential workflow adjustment, a new targeting rule.
Some should be hard. These force you to instrument something new, redesign a key system, or challenge an assumption that has been sitting untouched for years.
If your portfolio only contains the first category, you are under-reaching. If it only contains the third, you are overloading the system. The right balance depends on team maturity, traffic, engineering capacity, and risk tolerance. But balance is the point. Not every experiment should be bold. Not every experiment should be easy.
What to measure instead of raw velocity
If you want teams to behave differently, measure differently. Velocity should still be measured, as it is a useful operational signal, but it should be used alongside other, harder to quantify measures:
• How many strategically important surfaces are actually being tested?
• What percentage of experiments target known bottlenecks in activation, retention, or monetization?
• Did the experiment teach us something reusable, even if the variant did not win?
• Are hypotheses clear and well-structured? Are they trying to understand user behaviour?
• Did the team define guardrails and power the test appropriately?
Experiment quality is not about launching tests. It is about designing them well enough that the result is trustworthy and useful.
Questions for experiment review
One practical fix is to change the review conversation. Instead of asking only “how many experiments did we run,” ask:
• What important question did this experiment answer?
• What assumption was being challenged?
• What did we learn that changes future roadmap decisions?
• What meaningful experiment are we currently avoiding because it is hard?
That last question is especially useful. Every mature product org has a backlog of experiments it is quietly avoiding. They are hard to instrument. They cross team boundaries. They touch pricing, relevance, or onboarding logic and feel risky. Those are often exactly the areas leaders should pay attention to.
Conclusion: Velocity matters, but it is not the goal
The mission is to increase the rate at which your organization learns important truths about users, products, and business tradeoffs.
Sometimes that means running more experiments because your current process is too slow. Sometimes it means resisting the urge to inflate the count and putting real effort into the experiments that are harder, riskier, and more consequential.
A mature experimentation culture does not ask only “how can we run more tests?” It asks, “how can we run more of the right tests?” That is a much better optimization target. If you are building toward that kind of program, GrowthBook gives you the feature flagging, metric definitions, and analysis layer to make harder experiments easier to run.
.avif)
What Makes Experimentation Unique at Chess.com
Chess.com has users who cannot move a pawn and users who play at FIDE competitive ratings. Both groups open the same app. For anyone running an experimentation program, that kind of skill variance changes almost every decision you make.
On the latest episode of The Experimentation Edge, I sat down with Nafis Shaikh, Director of Product Management at Chess.com, to talk about how his team designs experiments for a 10 million daily active user base that spans absolute beginners and rated competitors. Chess.com ran 400 experiments in 2024 and has set a goal of 1,000 in 2025, already 195 deep in Q1. The scale is impressive, but what makes their program genuinely different isn't the volume. It's what they've had to learn about designing for users whose needs pull in opposite directions, and their willingness to push past surface-level test results into something more useful.
Here's what stood out.
One Product, Wildly Different Users
Chess.com turned 20 last year. For most of its history, the product was built by Chess players for Chess players, which worked because the user base was relatively homogeneous. Starting in 2020, the population exploded. Today the app serves roughly 10 million daily active users, and the demographic spread inside that number is extreme.
Some users have never played Chess in their lives. They don't know how the pieces move. They don't know what a pin is. At the other end of the spectrum, Chess.com has FIDE-rated competitive players, people with tournament histories and formal ratings who are using the product to prepare for real games.
This is where one-size-fits-all quietly falls apart. Nafis gave a specific example: the app has an AI coach that talks to users during play, explaining what's happening and offering tips. The way the coach speaks to a rated player is, and has to be, completely different from how it speaks to someone who just learned how the knight moves. Throw advanced concepts at a beginner and you confuse them and make the experience worse. Dumb down the feedback for an expert and it's worthless.
For an experimentation program, this has a concrete implication: every test needs to think about skill segments, not just aggregate results. A feature that lifts engagement overall might be destroying the experience for 20% of users while thrilling another 20%. If you're only looking at the average, you miss it. You also pick the wrong winner.
A Four-Dimension Framework for Deciding What to Measure
Nafis organizes every metric he cares about across four dimensions, and the order matters:
- Inflows. How effectively does the product bring new users in?
- Engagement. Once they're here, do they do the core thing? For Chess.com that means playing games, doing puzzles, using the coach.
- Retention. Do they come back? Measured at D1, D7, and D30, with weekly active users segmented into new, current, and returning cohorts.
- Monetization. Do they start a trial, and do they end up paying for the subscription?
The order is deliberate. Nafis is explicit that most products cannot short-circuit to revenue. "You actually have to give people a really solid product that they find value in. They'll come back and use the product more often. And when that tipping point hits, they're more likely to pay for your product because they've found the value in it."
Inside Chess.com this shows up as a deliberate division of labor. The monetization team optimizes monetization metrics. The gameplay team optimizes the core experience. These groups do not get confused about who owns what, and, crucially, the gameplay team isn't pressured to justify every experiment through a revenue lens. The bet is that a better core experience eventually lifts everything downstream.
If you're setting up an experimentation program, this is worth copying. Deciding which metrics each team owns, and which ones they explicitly do not, removes a huge source of noise from experiment results.
Going Beyond "Did the KPI Move?"
The part of our conversation that stuck with me most was Nafis's push to evolve how Chess.com treats experiment results.
A lot of experimentation programs live at the level of "we ran test X, metric Y moved Z%." That's fine. It's necessary. But it's not enough. Nafis calls the next question "so what?" What does this result actually say about how users behave? What were they doing in the control condition that makes them respond this way to the treatment? What does the side effect in that other metric tell you about the kind of user this feature attracts?
He also has strong feelings about write-ups. A result that ships as "yeah, this improves retention" is not worth much. The pattern Chess.com is moving toward is narrative: we launched this specific feature on this date, we saw this lift at this step of the funnel, it carried through to this downstream behavior, and here is what we now believe about our users that we did not believe before.
That discipline is what turns a test count into organizational learning. 1,000 experiments per year is a meaningless number if the team cannot tell you what it learned from them. The writing is where the learning gets captured.
Key Learning: Chess.com Users Prefer to Celebrate Their Wins
Now for the specific experiment that made me laugh out loud on the recording.
Chess.com has a feature called Game Review. After a game ends, the coach walks you through each of your moves and explains where you played well, where you blundered, and where you could have done something different. Game Review is Chess.com's freemium hook: everyone gets one free per day, and if you want more, you need a subscription. It's a huge driver of paid conversions.
The original design assumed something that felt obvious. When a player loses, they want to understand what went wrong. So the entry point to Game Review led with the things they had done poorly: here are your blunders, here are your misses, let's figure out what to fix.
Then the team looked at the data. 80% of Game Reviews were happening on wins.
Think about that for a second. Four out of five times a user reached for Game Review, they weren't trying to debug a loss. They were savoring a victory. The feature was introducing itself with a list of their mistakes, and people were opening it anyway, because what they actually wanted to see was the game where they won.
So they ran a test. Same feature. Same analysis engine. Same subscription gate. The only thing that changed was the entry point: instead of leading with "here are your blunders and misses," they led with "here are the good moves you made."
Game Review starts jumped 25%. Subscription conversions went up meaningfully. Same product, completely different framing, significant lift on the metric that actually pays the bills.
Nafis said he was "somewhat dumbfounded" by the magnitude, but the lesson lines up with something he has seen across every game he has worked on at Zynga, Prodigy, and now Chess.com: "People just want to feel good. Focus on the things that make people feel better about themselves. The world's a hard place and people have difficult lives, and when you come to play a game that's supposed to be enjoyable, focus on the things that are enjoyable."
If you are running any consumer-facing product, this is a test worth trying yourself. Look at every surface where you currently lead with user failure: error states, empty states, review flows, retry prompts, churn emails. Ask whether the default framing could be flipped to celebrate what the user got right instead. Then put the reframe in a test. You will probably be surprised.
Listen to the Full Episode
Chess.com's program isn't unique because of its size or tooling. It's unique because of two things: a user base whose skill range forces segment-level thinking on every test, and a team that refuses to stop at "the metric moved." That combination is what turns a test count into genuine understanding.
You can hear my full conversation with Nafis Shaikh, including why experimentation velocity is itself a productivity metric and the strange challenge of measuring whether users are actually listening to an audio coach, on this episode of The Experimentation Edge.
Listen to the full episode: The Experimentation Edge with Nafis Shaikh, Chess.com

What I Learned from Khan Academy About A/B Testing AI
Every team building on top of LLMs faces the same fundamental question: how do you know if your AI feature is actually good? For some products, the answer is straightforward. For others, it requires inventing an entirely new way to measure quality. Khan Academy's journey to A/B testing their AI tutor, Khanmigo, is one of the best examples I've seen of a team solving this hard measurement problem and then using experimentation to dramatically accelerate how fast they improve their product.
Dr. Kelli Hill, Head of Data at Khan Academy, recently joined us for a GrowthBook webinar to walk through their three-year journey from vibes-based prompt testing to rigorous A/B testing of GenAI features in production. Here's what stood out.
Sometimes Measuring AI Impact Is Easy
Sometimes the impact of AI on a product is straightforward to measure. When Typeform introduced an AI-powered form builder, their Chief Product and Technology Officer Alex Bass told us on The Experimentation Edge that it doubled their activation rate, the percentage of users who go from signing up all the way through to publishing a form and collecting data. Out of roughly 50 experiments Typeform ran, nothing else came close to that kind of impact.
In cases like Typeform, the metrics are clear. A user either publishes a form or they don't. The signal is clear and happens quickly. And you can measure it with the same metrics you were already tracking.
What Happens When the Output Is Harder to Evaluate
Khan Academy faced a fundamentally different challenge. Khanmigo is a generative AI-powered tutor that helps students work through math and other subjects. It's not a chatbot for entertainment. It's an educational tool used by students in classrooms. The bar is high: Khanmigo needs to be accurate, it needs to actually teach (not just give answers), and its tutoring quality needs to be measurable at scale.
That last part is the hard part. The same prompt can produce a dozen different responses. The underlying model changes regularly. A response that looks polished might actually reflect poor tutoring practice. And with nearly 200 million registered users and roughly a million daily active users on Khan Academy, they needed measurement that could operate at massive scale.
When Khanmigo first launched, the team had no way to rigorously evaluate quality. They started where everyone starts: reading outputs and making gut judgments. Kelli described their earliest eval work as "vibes-based prompt engineering" in Slack threads. It was useful for building intuition, but it didn't scale, it wasn't repeatable, and it couldn't tell them whether a change actually improved anything.
Turning Something Hard to Measure Into a Real Metric
The breakthrough was deciding to measure cognitive engagement, a construct from learning science research. Khan Academy adapted the ICAP framework (Interactive, Constructive, Active, Passive) published by Chi and Wylie in 2014. The original framework was designed for classrooms, so the team adapted it for AI tutoring interactions, focusing on questions like: who has the agency in help requests? How is the student processing Khanmigo's feedback? Who's driving the ownership of the learning?

The key insight was that cognitive engagement isn't just an abstract academic concept. Khan Academy's prior efficacy research had already demonstrated that students who are more cognitively engaged on the platform get more skills to proficient, and that increased proficiency on Khan Academy transfers to higher scores on third-party assessments. So if they could measure cognitive engagement in Khanmigo conversations, they'd have a metric that actually predicted real learning outcomes.
Building the metric was the hardest part. Kelli was emphatic about this. The team defined a rubric, brought in subject matter experts, and had those experts hand-label student chat transcripts. They iterated on the rubric until they achieved 85% inter-rater agreement on a test dataset. Then they used the agreed-upon labels to create a ground truth dataset.
With ground truth in hand, they built an LLM-as-judge: an AI system that could automatically label transcripts using the same rubric. They fed the judge examples from the ground truth data, iterated on the prompt until the LLM judge's labels matched the human experts with high accuracy, and then scaled it. Today, they process about 20% of Khanmigo's chat data every night through this pipeline, feeding results into dashboards that the team monitors continuously.
Why This Unlocked A/B Testing for GenAI
Once Khan Academy had a reliable metric, they could finally do what they couldn't before: run controlled experiments on Khanmigo and measure whether changes actually improved tutoring quality.
Khan Academy uses GrowthBook for both feature flags and experimentation, self-hosted on top of their existing BigQuery data warehouse. They built additional infrastructure to randomize not just at the user level, but at the individual chat thread level, so each new Khanmigo conversation could be independently assigned to a treatment. This was critical because the unit of analysis for tutoring quality is a conversation, not a user.
The experiments they run aren't typical feature tests. They're testing different versions of a prompt, changes to system instructions, and even head-to-head model comparisons (Gemini vs. OpenAI models, for example). Kelli described it as "hill climbing": making very small, deliberate changes, sometimes just a single sentence in a prompt, and measuring whether cognitive engagement moves.
Their primary metrics are cognitive engagement and performance (are students getting more skills to proficient?). Their secondary and guardrail metrics include non-desirable behaviors (like giving the answer away), thread length, verbosity, and response latency. This layered approach ensures they're not accidentally improving one dimension while degrading another.
From Speed Bump to Safety Net
One of the most striking things Kelli shared was how the culture around experimentation shifted at Khan Academy. Before they had this infrastructure in place, experimentation was sometimes perceived as a speed bump, an extra hurdle before shipping. That's a common tension in product organizations.
But with GenAI, the calculus changed. LLM outputs are non-deterministic. A small prompt change can shift output dramatically. A response that looks better to a human reviewer might not reflect better tutoring. The AI tutor quality team at Khan Academy became the heaviest users of GrowthBook specifically because they realized that without A/B testing, they were relying on intuition in a domain where intuition consistently fails.
Kelli put it directly: experimentation went from being perceived as something that slows down shipping to being "a safety net" for understanding how changes actually perform across millions of users and prompts. The team now sees it as essential infrastructure, not overhead.
What This Means for Teams Building on LLMs
Khan Academy's journey illustrates a pattern that applies broadly. If you're building AI features, your path to effective experimentation runs through measurement. Sometimes you'll have a Typeform situation where existing metrics already capture the impact. But often, especially when the AI's output is complex or subjective, you'll need to invest in building new evaluation frameworks first.
The process Khan Academy followed is replicable: define a rubric grounded in domain expertise, get humans to agree on labels, build a ground truth dataset, train an LLM-as-judge, validate it, and scale it. It's not fast. Kelli described a three-year evolution from vibes testing to production A/B testing. But once you have that metric in place, the standard toolkit of A/B testing becomes incredibly powerful for improving AI features.
If you want to hear the full story, you can watch the webinar recording or or read the Khan Academy research paper. And if you're looking for an experimentation platform that can handle GenAI testing at scale, give GrowthBook a try.

Designing A/B Testing Experiments for Long-Term Growth
Ronny Kohavi — Stanford PhD, Ex-VP and Technical Fellow at Airbnb, formerly Microsoft and Amazon — is one of the top cited researchers in Computer Science and a leading voice in experimentation. He recently joined Luke Sonnet, Head of Experimentation at GrowthBook, for a webinar sharing best practices, mistakes to avoid, and surprising insights into how often experiments actually succeed. Watch Designing Experiments for Long-Term Growth on demand.
This article covers the key principles Ronny and Luke shared for designing experiments that drive long-term growth — from understanding the importance of experimentation, why you shouldn’t ship on flat results, the key metrics you should track, and how to create a shipping criteria framework. Whether you're just getting started with experimentation or looking to sharpen how your team makes decisions, these are the foundational concepts that separate programs that deliver real impact.
In science, randomized controlled experiments are the gold standard, sitting at the top of the hierarchy of evidence. A/B tests are the online equivalent and the most reliable tools teams have for determining whether a change actually has an effect — whether that's a new feature, a UI change, a pricing change, or a backend optimization.
The problem is that most teams haven't done the harder work first: agreeing on what success actually looks like before the data comes in. Without that foundation, even a well-run experiment produces a result nobody knows how to act on.
Stop guessing: embrace the high failure rate
Humans are systematically bad at predicting what will work and assessing the value of ideas. You cannot reliably judge which ideas are valuable before testing and will be wrong far more often than most teams expect. An effective experimentation program is critical for focusing effort toward what actually works.
Here is some surprising success rate data from across the industry:
Microsoft's 33% success rate stands out, but this came at a cost. Significant upfront work went into scoping and refining ideas before they ever entered an experiment, which directly impacted that number.
The median organization sees roughly 10% of experiments move the metrics they were designed to improve. Given this success rate, we can compute the False Positive Risk (FPR) — the probability that a statistically significant result is actually a false positive. At a 10% success rate with standard thresholds (𝛼=0.05, 80% power), that risk is around 22%, meaning roughly 1 in 5 'successful' experiments are actually false positives. Most teams assume p < 0.05 means they will rarely make mistakes, but the math shows otherwise.
The most impactful teams are the ones with the infrastructure to test fast and realign priorities based on evidence. A $120M improvement at Bing sat in the backlog for months because nobody thought it was worth testing. At Airbnb, the biggest win was a one-line code change. Neither of these could have been predicted. Both required running the experiment.
The importance of building and aligning on A/B testing key metrics
An experimentation program is only as good as the metrics it optimizes for. These metrics include:
- Success or goal metrics: Defines why an organization or product exists and what success looks like (stock price, revenue, market share, etc.) These are the real objectives, but are not easy to move or measure in the short term.
- Driver metrics: Short-term metrics that are the signals believed to predict movement in success metrics. These are what you actually measure to signal success.
- Overall Evaluation Criterion (OEC): The weighted combination the organization agrees to optimize for, typically composed of a few success and driver metrics. Defining a good OEC is one of the hardest and most important things an experimentation program does.
What goes wrong without a good OEC
Real-world scenarios from search engines (Bing, Google) and booking sites (Airbnb, VRBO) illustrate how badly things can go wrong with poor OECs, despite well-meaning intentions.
The search engine example
At Bing, naively using queries per user as the OEC would have led to very poor decisions. The example he gave was a ranking bug that returned terrible search results. This increased queries by 10% due to users reformulating queries several times and increased ad revenue by 30%. The short-term metrics look great, but the product is broken.
More optimal metrics to track here would be to minimize queries per session (users should be able to find answers quickly) and maximize sessions per user (repeat usage indicates high value). Bing has a suite of metrics they actually track, including sessions per user, queries per user, time to success, revenue per user, and more. We'll cover a framework for identifying and aligning on good OECs later in the article.
A booking site example
Similarly, a booking platform such as Airbnb that ignores satisfaction signals like user rating and instead optimizes purely for conversion rate is optimizing for the wrong thing. If users book listings they end up hating, they don't return.
A better OEC would also include a measure of satisfaction, such as the user's star rating, so you can build machine learning models that predict whether this user will book a listing they love and rate five stars. Deciding on the trade-off between multiple metrics, such as revenue and user satisfaction, is a key business decision.
The flat result trap: the most expensive mistake in product
Getting your OECs right is important, but only if you're willing to act on what the data actually tells you. A flat result means an experiment didn't produce a statistically significant improvement in the OEC. Shipping flat means deploying that feature anyway. It was discussed that this is a decision error in nearly every case.
One example from Bing was a major effort with ~100 engineers to introduce a third pane to the search window. The experiments failed to show value, but it shipped to all users anyway because it was determined to be a strategic business move. A year later after countless additional experiments failed to show value, the 3rd pane was rolled back at significant costs to Bing.
Had Bing acted on what the data told them to begin with, they could have failed much faster, avoiding months of sunk cost and instead redirected their engineering resources toward something that actually moved the needle.
Debunked: common justifications for shipping flat
Ronny shared the four primary reasons he has seen teams use to rationalize the decision to ship flat and dives into the real implications of each.
Justification #1: It’s flat, we’re not hurting the users or business
A flat result doesn't mean no effect exists. All it tells you is “we didn’t find enough evidence of an effect.” The experiment could simply be underpowered. "Not statistically significantly worse" is not the same as safe to ship. The true effect could still be negative.
Justification #2: Team morale depends on shipping:
Shipping a flat feature to protect morale means celebrating shipping rather than actually moving goal metrics, which can also complicate the codebase and require maintenance costs. The culture should be results-oriented and simply recognize that many ideas fail. Hold a learning review, share what was discovered, and move on. Failures that generate learning are worth celebrating.
Justification #3: It’s an enabler for future work:
You can cut through this justification with one question: if we ship this and deprecate the old version, would we ever roll it back? At Bing, the answer was yes. Every flat enabler that ships becomes code that must be maintained and a foundation you'll keep building on even when the follow-on value never arrives.
Justification #4: It’s strategic:
Strategic conviction is not a substitute for evidence, and as the data shows, even small changes are hard to predict correctly. Set a vision, but move toward it in small, testable steps. Test a meaningful component first, get data, then adjust.
A framework for making better experimentation decisions
With the importance of good metrics and understanding of what can go wrong without them clearly laid out, the conversation then shifts to a practical approach for building a decision framework that connects short-term measurements to long-term goals without overcomplicating the process.
Bridging short-term experimentation metrics to long-term goals
The messy reality is that most measurable short-term metrics don’t align 1:1 with business goals, so we must instead build frameworks to do so.
Start by identifying your long-term goals and what you can actually measure. From there, identify the short-term metrics that are the strongest indicators of those long-term goals. These are the signals that move in the right direction when the product or business is genuinely improving.
Once you've identified the right metrics, put guardrails in place. Guardrails are secondary metrics you monitor to ensure that improving your primary metric isn't coming at the expense of something else that matters, such as revenue, retention, or user satisfaction. They don't have to move, but they can't go backward.
A word of caution: overcomplicating things and tracking too many metrics can make it difficult to act. Before running an experiment, think critically about what you would do if your metrics told conflicting stories afterward. This exercise forces clarity around prioritization and how you make business decisions around tradeoffs. The goal is to identify the key signals you can build a decision framework around so you know exactly how you'll act on them.
A real-world shipping example: LLM chatbots
An example that highlights this concept is an AI chatbot company. They can't measure customer lifetime value in a two-week experiment. Instead, they’ll need to look at the short-term metrics that signal value, such as distinct sessions per user, topic breadth, short-term subscription conversion, and how often responses are copied externally. Build a framework connecting these to the long-term goal, validate against historical data, and you have an OEC you can actually experiment on.
But throwing all of these metrics into your results dashboard can complicate the picture. If some results are flat or vaguely negative, while others are statsig negative, and others are statsig positive, then how do you make a shipping decision?

This is exactly where clearly defined shipping criteria earns its value.
Shipping criteria: enabling independent shipping decisions at scale
Translate your metrics into explicit shipping criteria that are determined prior to an experiment launching. This is a decision framework that enables independent shipping decisions and eliminates bias from decision-making during the evaluation phase.
Some decisions are very straightforward, such as the example below. With the revenue change being equal, you would choose the latter with higher Daily Active Users.

However, a clearly defined framework for shipping criteria becomes increasingly necessary in situations where metrics conflict, such as in the example below, where DAU is higher in the first experiment, but revenue is higher in the second. In this situation, you need to understand the tradeoff between these metrics that you’re willing to accept when shipping.
This approach encodes your decision-makers' preferences into a repeatable framework so shipping decisions are consistent, defensible, and free from bias.

Luke’s Twitter example
An example from Twitter highlights how this works in practice. Daily Active Users (DAU) was a key metric for Twitter, but they wanted to make sure that people were using the product repeatedly and over time to see that they're getting value out of it in a wide variety of applications. Some of the measured indicators included tweets created, likes, and other forms of engagement. They used the decision framework below to determine when to ship:
- If DAU is up and stat sig → ship
- If DAU is negative → rollback
- If DAU is up, not stat sig and no guardrails are negative:
- If engagement metrics are up (tweets created, likes, etc.) → ship
- Otherwise → experiment review
- Murky results → rollback
This type of framework scales. It forces tradeoffs to be agreed on before you're under pressure from a live result.
A key note to remember is that your metric models will likely drift over time. This is something teams need to revisit regularly as their product and business evolves. The metrics that predicted success a few months ago may not be the right ones today.
Closing: shift the experimentation culture
Ronny and Luke close with a shared belief: the teams that win at experimentation aren’t always the ones with the most resources or sophisticated tools, but the ones that have built a culture around learning.
The most important piece of advice is to shift the organizational mindset from celebrating shipping to celebrating learning. Most ideas will fail. The teams that internalize this stop treating failed experiments as something to hide and start treating them as the mechanism by which they get smarter and faster over time.
That cultural shift is supported by the practical framework Luke outlined. When you have clearly defined metrics, explicit shipping criteria, and a shared understanding of your tradeoffs, experimentation becomes the foundation for confident, independent decision-making at scale.
Key takeaways
- Most experiments fail. The median industry success rate is ~10%, meaning you will be wrong far more often than you expect. An effective experimentation program is how you find what actually works.
- False positive risk is higher than most teams realize. At a 10% success rate, roughly 1 in 5 "winning" experiments are actually false positives, even when running at p < 0.05.
- Your experimentation program is only as good as the metrics it optimizes for. Poorly defined OECs lead to decisions that look good on paper, but break the product.
- Shipping flat is a decision error in nearly every case. "Not statistically significantly worse" is not the same as safe to ship. The true effect could be negative and the code will have maintenance costs.
- Short-term metrics rarely align 1:1 with long-term business goals. Build an explicit framework connecting the two and put guardrails in place to protect what actually matters.
- Define your shipping criteria before the experiment runs, not after. This eliminates bias, enables independent decision-making, and forces tradeoffs to be agreed on in advance.
- Shift the culture from celebrating shipping to celebrating learning. The teams that win at experimentation are the ones that treat failed experiments as the mechanism by which they get smarter.
Want to go deeper? Ronny teaches two online courses on Maven
Accelerating Innovation with A/B Testing: Ronny’s flagship course and recommended starting point for most practitioners
Advanced Topics in A/B Testing: A follow-on to Accelerating Innovation with A/B Testing for practitioners with a solid foundation in p-values, statistical power, and OEC design

How a Team of 4 Used A/B Testing to Help Fyxer Grow from $1M to $35M ARR in 1 Year
A team of 4 growth engineers ran 360 experiments in a year, helping Fyxer grow from $1M to $35M ARR. Here's how they combined a growth engineering mindset with AI-powered coding to test at a pace most companies can't match.
How Fyxer used AI coding and GrowthBook to run 541 experiments in 1 year
Something remarkable is happening at Fyxer. The AI email assistant grew from $1M to $35M in annual recurring revenue last year. This year, they’re targeting $100M to $150M. Behind that trajectory is a company-wide culture of experimentation that produced 541 experiments in twelve months, more than two per working day. The growth engineering team alone, just four people led by Kameron Tanseli, accounted for 360 of those.
The story of how they did it comes down to two things: the right mindset and an AI-first approach to experimentation. The mindset meant treating every product change as a hypothesis to validate, not a feature to ship. The AI-first approach meant using tools like Cursor, Claude, and GrowthBook to compress the entire experimentation loop, from research to development to analysis, so a small team could operate at a scale that would have been impossible even two years ago.
Kameron joined Fyxer when the company had $1M in ARR. He brought a discipline he’d honed across B2C healthtech, B2B SaaS, and now prosumer AI: measure everything, share everything, and learn as fast as possible. One of his first moves was creating a public Slack channel where every experiment result, win, and loss was visible to the entire company. The founders loved it. It became the company’s central nervous system for understanding what was working and what wasn’t.
Kameron recently joined The Experimentation Edge podcast to share the full story. Below are the key takeaways, but the real unlock came when they combined that learning culture with AI-powered development. That combination is what made 541 experiments possible across the company, and it’s what turned a high volume of losses into the wins that turbo-charged Fyxer’s trajectory.
The Growth Engineering Mindset: Why Learning Speed Beats Intuition
Here’s something Kameron will tell you openly: he’s bad at his job for the first few months every time he starts somewhere new. And it’s not just him. It’s everyone in growth.
When Kameron joined Fyxer, his instincts were calibrated to B2C healthtech, his previous role. He defaulted to discount-heavy messaging, pricing-focused copy, and the kind of urgency-driven language that works for consumer subscription boxes. At a B2B SaaS company selling an AI productivity tool to professionals, none of it landed. The only way to close that gap was to get experiments in front of real users and let the data teach him what his intuition couldn’t.
This is the core of the growth engineering mindset at Fyxer: A/B testing isn’t just an optimization tool. It’s a learning tool. And when you’re new to a product, a market, or a customer base, it’s the fastest way to develop the intuition you don’t have yet.
The numbers back this up. Fyxer’s win rate in GrowthBook is 25%. That means 75% of their experiment ideas failed. If they had shipped every idea to 100% of users without testing, the cumulative damage would have been severe. A 50/50 test, even with imperfect sample sizes, beats shipping blind every time.
Kameron pushes back hard on the common startup objection that “we’re not big enough to A/B test yet.” His view: you may not be able to detect 5% lifts, but you can detect 20% or 30% effects, and at a startup, those are exactly the kinds of changes you should be testing. Pricing models, usage limits, core product flows. The risk of getting those wrong without testing is far greater than the cost of running an imperfect experiment.
A key element of Fyxer’s approach is how they think about iteration. Rather than waiting for a fully polished feature, they ship the core experience and then immediately run experiments to improve adoption and engagement. As Kameron puts it, almost nobody uses your new feature on day one. The real work starts after launch, when you test messaging, onboarding flows, and nudges to find what actually drives usage. This iterative approach was central to their PLG breakthroughs later in the year.
Kameron uses a simple framework to evaluate which features could drive viral growth. First, he identifies the actions users are already repeating within the product. Then he asks: every time a user sends an email, schedules a meeting, or triggers a confirmation, is there a way to use that touchpoint to introduce Fyxer to someone new? When the answer is yes, the team builds and tests a loop around it.
Not every loop works. Fyxer has a scheduling feature, similar to Calendly, and Kameron hypothesized that sending booking confirmations could drive recipients back to Fyxer to sign up. In theory, it was a clean growth loop. In practice, users pushed back immediately. Fyxer’s entire value proposition is reducing inbox noise, and here they were adding another email on top of the Google Calendar and Outlook invites that people have already received. They killed the experiment and pivoted to a different approach. That willingness to test assumptions, even ones that look great on a whiteboard, is what separates a growth-minded team from one that ships on conviction alone.
Using AI to Scale Experimentation from Weeks to Hours
The mindset gets you to the right experiments. AI is what lets a team of four run them at startup speed.
Fyxer’s experimentation stack is built around a few key tools, with Claude as the central hub. The growth team shares Claude's skills across the team, so common workflows, like turning a GrowthBook experiment result into a Slack post or generating a hypothesis from a data analysis, are reusable and consistent. They’ve connected Claude to their internal systems through MCP integrations, including GrowthBook’s API, so experiment data flows directly into their AI workflows.
For development, they use Cursor across the full stack. But the real unlock has been Cursor’s desktop mode with virtual environments. Here’s why that matters: traditionally, even a simple experiment requires a developer to write the code, pull it down locally, run the app, and manually check that the new upsell panel or copy change looks right. With Cursor desktop, the tool runs the app in a virtual environment and shows Kameron a video of what the experiment will look like. He reviews it, signs off, and moves on, without ever pulling down the code himself.
This means he can run five or six experiments in parallel, as long as they’re relatively contained changes. For even simpler experiments, like backend configuration changes or one-line feature flag adjustments, they use Claude Opus, Codex, and Tembo to one-shot the implementation entirely.
The AI acceleration extends beyond development. On the data side, Fyxer uses Dot, an AI data analyst that connects to their BigQuery warehouse and lives in Slack. The data team documented their table schemas, columns, and relationships, and Dot uses that context to answer complex questions — segmentation analysis, survival curves, custom queries — from anyone on the team. Non-technical stakeholders can get answers in seconds without waiting for the data team, which unlocked a bottleneck that plagues almost every growing company.
The experimentation lifecycle itself is increasingly automated. Cursor automations fire when PRs are opened, daily jobs check for stale experiment code that should be cleaned up, and product release docs are generated automatically. When a key metric dips unexpectedly, the data team uses the GrowthBook API combined with Claude to cross-reference recent experiment launches and diagnose whether an experiment caused the problem.
The net effect: AI compresses the entire experimentation loop. Research that took days happens in hours. Development that took a week happens in an afternoon. Analysis that requires a data scientist can be done by anyone on the team through Slack. That’s how four engineers run 360 experiments in a year.
What 541 Experiments Actually Produced
Volume without results is just busywork. Here’s what Fyxer’s experimentation program actually delivered:
- Increasing free-to-paid conversion from 5% to 35% by adding a credit card gate before the free trial.
- 2.3x-ing the share of paying customers on annual plans, which now accounts for 50% of subscribers.
- Increasing the trial start rate for personal email users by 65% by segmenting trial lengths based on signup type.
- Creating a referral growth loop in which 33% of invites are accepted.
None of these were obvious in advance. The credit card gate, for example, contradicts conventional wisdom about reducing friction in signup flows. But Kameron noticed that many AI apps were already asking for credit cards upfront, and Fyxer’s users had high intent because they were connecting their email. They also made the paywall optional during the experiment, drawing design inspiration from Canva’s checkout flow by showing users a clear timeline: what happens today, in 5 days, and in 7 days. The result was essentially free revenue on existing traffic.
The annual plan shift followed a similar pattern. The original UI defaulted to monthly billing with a modest 8% annual discount. Kameron tested defaulting to the yearly plan, increasing the discount to 25%, and displaying the effective monthly price. It’s the kind of change that takes a few hours to implement and test, but has a massive compounding effect on retention and cash flow.
That’s the compounding advantage of high-velocity experimentation: you find the counterintuitive wins that your competitors are leaving on the table because they’re still debating whether to test.
Where Fyxer’s Growth Team Is Headed Next
Fyxer is scaling the growth engineering team from 6 to 13 this year, with a target of 1,000 experiments. But the real multiplier isn’t headcount. It’s a continued investment in AI-powered developer performance: more reusable skills, more automated workflows, and tighter integration between their experimentation platform and their AI tooling.
Their revenue target of $100M to $150M ARR would represent another 3–4x leap. If the pattern holds, that growth won’t come from a single breakthrough. It will come from the compounding effect of hundreds of experiments, most of which will fail, but the ones that win will change the trajectory of the business.
Key Takeaways
- You don’t need to be big to experiment. You need to be disciplined about testing the things that carry the most risk.
- A/B testing at a startup is primarily a learning tool. It’s how you build customer intuition fast, especially when you’re new to a market.
- AI doesn’t just make development faster. It compresses the entire experimentation loop, from hypothesis to analysis, making high-velocity testing possible with a small team.
- A 25% win rate is a feature, not a bug. It means you’re testing bold ideas and catching the failures before they ship to everyone.
- The combination of the right mindset and an AI-first approach to tooling is a genuine competitive advantage, and one that’s accessible to any team willing to invest in both.
Want to hear the full conversation? Watch Kameron’s episode on The Experimentation Edge podcast, where he goes deeper on Fyxer’s growth loops, AI tooling stack, and advice for growth engineers starting at a new company.
Fyxer runs its entire experimentation program on GrowthBook, the open-source feature flagging and A/B testing platform. If your team is looking to scale experimentation without scaling headcount, get started for free or request a demo.

How to Migrate from Statsig to GrowthBook
The industry's only open-source, warehouse-native experimentation platform gives you predictable pricing, full data ownership, and results you can verify. Here's why Statsig customers are switching to GrowthBook.
When OpenAI acquired Statsig, engineering leaders at hundreds of companies started asking the same question: what happens to our data?
It's a fair question. Statsig routes all event data through its own servers. With that infrastructure now under OpenAI's control — and Statsig's CEO gone — teams that cared about data governance found themselves re-evaluating a platform they'd built their experimentation programs on. Add event-based pricing that climbs as you scale, and the calculus tilts further.
GrowthBook is where many of them land. This post explains what GrowthBook offers, what the migration looks like in practice, and how to decide whether GrowthBook compared to Statsig makes sense for your team.
What is the GrowthBook open-source platform?
GrowthBook is an open-source feature flag, experimentation, and product analytics platform. It is the original warehouse-native platform, trusted by more than 3,000 companies, including Dropbox, Khan Academy, Upstart, Sony, and Wikipedia. It handles over 100 billion feature flag lookups per day.
The warehouse-native architecture is the defining design choice. Rather than copying your data into its own system, GrowthBook queries your data where it already lives — Snowflake, BigQuery, Databricks, Redshift, ClickHouse, Postgres, and more. Analysis runs in your warehouse with read-only access. Every SQL query is visible. Every result is reproducible.
That's the short version. The longer version explains why it matters when you're evaluating a replacement.
4 reasons teams switch from Statsig
1. Lack of data ownership
Statsig's architecture requires sending event data to Statsig's servers for analysis. That worked when Statsig was an independent company. Under OpenAI ownership, with no published data firewall policy between Statsig customer data and OpenAI's AI training, the risk profile changed.
GrowthBook inverts the model. Your data warehouse holds the data. GrowthBook reads aggregate statistics from it, using read-only credentials. Raw PII stays in your environment, under your control, and subject to your compliance policies. For teams operating under GDPR, HIPAA, CCPA, COPPA or daa residency requirements, this distinction is operational, not philosophical.
John Resig, Chief Software Architect at Khan Academy, described exactly this concern: the ability to retain data ownership was, in his words, "very, very important," because most platforms require passing user data to a third-party service.
GrowthBook's self-hosted deployment takes it further. Deploy within your own infrastructure, behind your own firewall, with zero external data egress. Fully air-gapped deployments are supported for the most sensitive environments.
2. Expensive to scale
Statsig prices on events and traffic. That structure makes sense when you're running a handful of experiments on modest traffic. At scale, it penalizes the behavior you want to encourage: more experiments, more feature flags, more coverage.
Teams using Statsig often end up managing their experimentation volume to manage their bill — sampling down traffic, avoiding flagging minor changes, skipping experiments on low-stakes features. That's the opposite of a healthy experimentation culture.
GrowthBook uses per-seat pricing. A team that runs 10 experiments a month pays the same as one running 100. Feature flag evaluations don't generate a cost event. The experimentation ROI calculator can model your specific usage to show expected savings.
3. Limited visibility into underlying results
Statsig's statistics engine is proprietary. You can see the outputs, but not the logic that produced them. When a result is surprising, your options for investigation are limited to the interfaces Statsig exposes.
GrowthBook's engine is fully open source on GitHub (7,000+ stars). Every calculation is inspectable. Every query is visible in your warehouse. If a result looks off, you can drill into the underlying SQL, check the raw data, and confirm or refute the calculation on your own terms.
Diego Accame, Director of Engineering at Upstart, put it this way: "Our strength is as an AI-powered lending marketplace, not an experimentation framework company. GrowthBook lets us focus our resources where they matter most — on growing our core business."
That confidence comes partly from owning the infrastructure and partly from being able to verify what the infrastructure is doing.
4. Limited statistical depth
GrowthBook supports Bayesian, frequentist, and sequential testing, with CUPED variance reduction and post-stratification. Statsig supports a similar range, but without post-stratification and the ability to inspect or reproduce the calculations.
For data science teams that care about methodology — particularly at companies where an experiment result drives a significant product or business decision — the ability to validate the math is a meaningful advantage.
What the Statsig migration kit covers
GrowthBook ships a migration kit specifically for Statsig customers, including an AI-powered assistant that can transform your existing codebase. Here's what migrates:
Projects, teams, and tags carry over cleanly, preserving your workspace organization so teams can keep working without rebuilding their context.
Feature gates from Statsig map to GrowthBook feature flags, which support multiple environments, targeting rules, gradual rollouts, and instant kill switches.
SDKs migrate automatically. The AI migration assistant points at your codebase and handles the transformation — feature gates, dynamic configs, and user attributes converted to GrowthBook equivalents. JavaScript, TypeScript, and React are supported today, with more coming.
This is the step that usually takes weeks; the assistant reduces it to minutes.
Experiments transfer, including past experiments run on Statsig. You can generate custom reports from past Statsig experiments in GrowthBook, which preserves institutional knowledge.
Targeting rules transfer with full visibility into conditions and rollouts. GrowthBook includes debugging tools that simulate flag values for specific audiences, making it straightforward to verify that migration behavior matches pre-migration behavior.
Safe rollouts remain a first-class concept. GrowthBook supports gradual exposure with automatic monitoring of guardrail metrics, so regressions trigger alerts before they reach your full user base.
The SDKs themselves don't require replacement during migration. If you're moving from Statsig cloud to GrowthBook cloud, or from Statsig to self-hosted GrowthBook, your feature flag configuration and experiment setup carry over without requiring SDK changes or redeployment of your application code.
The full GrowthBook platform you're migrating to
Migration is the starting line, not the finish line. Here's what GrowthBook offers beyond Statsig feature parity.
Feature flagging that doesn't cost per evaluation
GrowthBook's feature flags run through zero-network-call SDKs. The SDK downloads a payload at startup and evaluates flags locally, so each flag evaluation adds sub-millisecond latency without generating a billable event. You can flag every feature in your product — including low-traffic, experimental, and internal-use features — without worrying about cost.
GrowthBook supports 24+ SDKs: JavaScript, React, React Native, Node.js, Python, Ruby, Go, PHP, Java, Kotlin, Swift, and more. The Chrome debugger lets you inspect flag state and experiment assignment in real time without touching application code.
Experimentation with SQL you write and own
GrowthBook's metric system is SQL-first. You write metrics using your warehouse's SQL dialect, join against any tables in your schema, and apply whatever business logic your team uses. A metric for revenue per activated user might join your experiment assignment table to your payments table to your activation events — all using the same logic your data team uses everywhere else.
Forgot to add a metric before an experiment started? Add it retroactively. The data is already in your warehouse. Just define the metric and run the analysis against the historical assignment data.
Metrics can be standardized in a library, enabling every team to measure success consistently. They can be scoped to specific experiments or applied globally as guardrails.
Deployment on your terms
GrowthBook Cloud runs on AWS with automatic updates, encrypted data at rest and in transit, 99.99% uptime SLA at Enterprise tier, and SOC 2 Type II, ISO 27001, GDPR, COPPA, and CCPA compliance.
GrowthBook Self-Hosted runs on your infrastructure, choose any major cloud provider or on-premises, deployed with Kubernetes or any container platform. Same codebase. Same features. Same development roadmap. The only difference is who manages the infrastructure.
Many teams start on GrowthBook Cloud for the fastest path to running experiments, then migrate to self-hosted when compliance requirements or internal policy require it. GrowthBook's SDK and configuration structure don't change in that migration, so the transition preserves everything you've built.
If you don't have a data warehouse yet, GrowthBook's Managed Warehouse gives you a fully functional environment immediately, with the option to migrate to your own warehouse at any time.
AI-ready experimentation
Three of the five leading AI infrastructure companies use GrowthBook to test and optimize their products. The platform handles the non-deterministic, high-variance nature of AI feature testing well.
- Sequential testing reduces false positives
- CUPED variance reduction accelerates decision-making
- Fully custom SQL metrics capture what matters for AI outputs (task completion, output acceptance, engagement depth) rather than just clicks
GrowthBook's MCP server connects to Cursor, VS Code, Claude Code, and any other MCP-compatible IDE. Create feature flags and experiments in natural language, query past results, and build agents with your experimentation data as context — all without leaving your editor.
Getting started for free
GrowthBook's offer for current Statsig customers: use GrowthBook for free through your current renewal date, up to one year (up to $100,000 value). The migration kit, including the AI-powered SDK migration assistant, is available immediately.
The practical starting path for most teams:
- Connect GrowthBook to your data warehouse. Pre-built SQL templates get you to first results without custom data engineering. Customize from there.
- Run the AI migration assistant against your codebase. It transforms Statsig feature gates to GrowthBook equivalents and generates a diff for your team to review.
- Import your Statsig experiments. Historical results carry over so you don't lose the record of what you've learned.
- Start your first GrowthBook experiment. The Chrome debugger and visual editor make the first experiment accessible to non-engineers.
The decision to switch to GrowthBook
If your team depends on Statsig and the OpenAI acquisition raises questions you can't yet get answered – about data governance, roadmap continuity, or long-term pricing - then GrowthBook is the switch that costs the least to evaluate and offers the most structural independence.
Open source lets you inspect and audit what you're running. Warehouse-native architecture gives you data ownership that doesn't depend on a vendor relationship. Per-seat pricing gives you the freedom to run more experiments without watching a meter.
The migration kit makes the practical barriers manageable. The question is whether the reasons to switch outweigh the friction of switching. For most Statsig customers evaluating the post-acquisition landscape, that math is becoming clearer.
Ready to get started?
Read the GrowthBook vs. Statsig comparison →

Your Experiment Lift Is An Average — Which Users Actually Benefited?
The case for looking beyond the Average Treatment Effect
One number, many stories
You moved the recommendations carousel higher on the product page. After two weeks, the experiment comes back: +1.6% on conversion rate. Stakeholders are happy. You ship. You celebrate. You move on.
That workflow is fine. The Average Treatment Effect (ATE) is the right first thing to look at. It's what experiments in GrowthBook are typically designed to estimate. If you're going to act on a single number, that's probably the one. But if you stop there, you are leaving money on the table.
What is the Average Treatment Effect (ATE)?
The ATE is the difference in average outcome between users in the treatment group and users in the control group. It's the standard summary statistic from a randomized experiment — and the right first number to look at. But it summarises across all individual responses in your experimental sample.
That 1.6% is an average. Your user base includes many different types of people: varying usage patterns, needs, and baseline behaviors. You already know this. You probably already segment users for marketing and personalization, or you wish you did. Yet when the experiment result comes back, all of that diversity collapses into a single number.
The question is: what is that single number hiding? In this post, we look at what the average treatment effect actually represents, why the same average can mask very different realities, and what that means for how you act on your results.
What the Average Treatment Effect (ATE) really means
The average treatment effect is exactly that — an average. Behind it sits a distribution of individual responses: some users who gained a lot, some who barely noticed, some who were actively put off by the change. It is a summary across your entire experiment sample, not necessarily the effect on any particular user. If your metric is binary — the user converts, or they don't — nobody converted 1.6% more times. Some users were pushed over the edge and converted when they otherwise wouldn't have. Others were unaffected. Some may have been put off by the change and thus did not convert. What you observe as +1.6% is the net result after all of these individual responses are averaged together.
You cannot observe any individual user's treatment effect — that's the fundamental problem of causal inference. You only ever see what actually happened to a user, never what would have happened without the treatment. But the underlying distribution of those individual effects is real. The average effect tells you where the center is, but it tells you nothing about the rest.
Why the same experiment result can hide very different realities
To see why this matters, consider three scenarios — all with the same average effect of +1.6%. The distributions below are conceptual illustrations of what might be hiding behind that average. But the question of which one you're in is very real.
Scenario (a): Nearly everyone benefits a little

Think of a pure copy change: rewording a headline on a product page. There's no structural change, no new functionality. The tweak lands roughly the same way for everyone (not all copy changes do, of course, but this one did). A small, diffuse lift. This is what most people implicitly picture when they hear "1.6% lift." It's also the easy case. When the effect is similar for most users, the average tells the whole story, and you can act on it with confidence.
Scenario (b): One subgroup drives the entire effect

Let's take the carousel for another spin. You moved it higher on the product page. Two types of users now have very different experiences. Browsers, the ones who enjoy discovering new products, engage with the carousel and convert more. Searchers, users who came for a specific item, now have to scroll past content they never asked for. Slightly annoying. The browsers see a meaningful positive effect. The searchers see zero or slightly negative. Most users barely notice. The +1.6% average is real, but it's driven by a single user type, and you are shipping the change to everyone.
Scenario (c): Winners and losers

You raised the free shipping threshold from $25 to $50. Users who were comfortable buying one or two small items with free shipping now face a delivery charge that feels offensive. Some abandon their carts. Some find the same items as a competitor. Meanwhile, users with larger baskets add a few extra items to clear the new threshold, pushing average order value up. The overall effect is positive, but a sizeable share of users are notably worse off.
The bottom line: from the average effect alone, you cannot tell which of these scenarios you are in. The decision to ship looks the same in all three cases. The implications are very different.¹
Why experiment results vary across user segments — and why it matters
This isn't an academic exercise. The scenario you are in changes what you should do next.
Is the signal real? An effect driven entirely by one segment is either a discovery or a warning sign. If the subgroup is large and the effect is real, you may have found something worth doubling down on. If the group is small and their effect is noisy, your positive result may not replicate.
Why, not just what? Understanding who benefits also generates hypotheses about why the treatment works. That's how you build on experiment results rather than just collecting them. A team that knows the carousel helped browsers but irritated searchers can design a better version: show it on category pages, suppress it on search results. A team that only sees +1.6% moves on to the next test.
Will it last? Your experiment ran at a specific point in time, on the users who happened to be active during that window. If the effect is similar for all your users, it is more likely to hold as your user base evolves. If it's concentrated in one segment, the result is only as durable as that segment's share of your traffic. If the lift came from a seasonal cohort of holiday shoppers, it may not survive into Q1. What will your user base look like next quarter, or next year? An effect that's similar across users and one that's concentrated in one particular segment ages very differently.
From average results to individual insights in experimentation
You don't have to use fancy machine-learning frameworks to start asking these questions. The simplest version is to look at your experiment results across dimensions you already have: geography, platform, user tenure, and purchase frequency. In GrowthBook, that's what dimension splits are for. From a different angle, quantile treatment effects let you compare different percentiles of the outcome distribution across variants — for example, did the free shipping change hurt users at the low end of spend while benefiting those at the top? And with Experiment Dashboards, you can make these breakdowns a default part of every experiment readout, so looking beyond the average becomes standard procedure.
For teams willing to go further, more advanced methods can produce effect estimates at the individual level — the closest you'll get to making those conceptual distributions real.
And once you know the effect varies by segment, you don't have to ship the same experience to everyone. Most experimentation platforms, GrowthBook included, let you target features to specific user segments. The experiment told you who benefits. Targeting lets you act on it.
Slicing data post-hoc does come with real statistical risks, but there are well-established ways to handle them. The next post in this series covers how to navigate these waters: how to slice your experiment data, what to watch out for, and how to tell a real finding from a lucky split.
In the meantime, explore dimension splits and think about what dimensions might be interesting in your experiments. And next time you're building a feature, think about how your different users might respond to it, and who might not like it at all.
As always, beware and have fun!
This is part of a series on treatment effect heterogeneity. The next post is about how to uncover the different ways users respond to the same treatment, without fooling yourself.
¹ This idea is developed more formally by Gelman, Hullman, and Kennedy (2024) as "causal quartets" — different data-generating processes that produce identical average effects. The American Statistician, 78(3), 267–272.
Ready to ship faster?
No credit card required. Start with feature flags, experimentation, and product analytics — free.


