The Uplift Blog

Subscribe
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
What I Learned from Khan Academy About A/B Testing AI
Experiments
AI

What I Learned from Khan Academy About A/B Testing AI

Apr 21, 2026
x
min read

Every team building on top of LLMs faces the same fundamental question: how do you know if your AI feature is actually good? For some products, the answer is straightforward. For others, it requires inventing an entirely new way to measure quality. Khan Academy's journey to A/B testing their AI tutor, Khanmigo, is one of the best examples I've seen of a team solving this hard measurement problem and then using experimentation to dramatically accelerate how fast they improve their product.

Dr. Kelli Hill, Head of Data at Khan Academy, recently joined us for a GrowthBook webinar to walk through their three-year journey from vibes-based prompt testing to rigorous A/B testing of GenAI features in production. Here's what stood out.

Sometimes Measuring AI Impact Is Easy

Sometimes the impact of AI on a product is straightforward to measure. When Typeform introduced an AI-powered form builder, their Chief Product and Technology Officer Alex Bass told us on The Experimentation Edge that it doubled their activation rate, the percentage of users who go from signing up all the way through to publishing a form and collecting data. Out of roughly 50 experiments Typeform ran, nothing else came close to that kind of impact.

In cases like Typeform, the metrics are clear. A user either publishes a form or they don't. The signal is clear and happens quickly.  And you can measure it with the same metrics you were already tracking.

What Happens When the Output Is Harder to Evaluate

Khan Academy faced a fundamentally different challenge. Khanmigo is a generative AI-powered tutor that helps students work through math and other subjects. It's not a chatbot for entertainment. It's an educational tool used by students in classrooms. The bar is high: Khanmigo needs to be accurate, it needs to actually teach (not just give answers), and its tutoring quality needs to be measurable at scale.

That last part is the hard part. The same prompt can produce a dozen different responses. The underlying model changes regularly. A response that looks polished might actually reflect poor tutoring practice. And with nearly 200 million registered users and roughly a million daily active users on Khan Academy, they needed measurement that could operate at massive scale.

When Khanmigo first launched, the team had no way to rigorously evaluate quality. They started where everyone starts: reading outputs and making gut judgments. Kelli described their earliest eval work as "vibes-based prompt engineering" in Slack threads. It was useful for building intuition, but it didn't scale, it wasn't repeatable, and it couldn't tell them whether a change actually improved anything.

Turning Something Hard to Measure Into a Real Metric

The breakthrough was deciding to measure cognitive engagement, a construct from learning science research. Khan Academy adapted the ICAP framework (Interactive, Constructive, Active, Passive) published by Chi and Wylie in 2014. The original framework was designed for classrooms, so the team adapted it for AI tutoring interactions, focusing on questions like: who has the agency in help requests? How is the student processing Khanmigo's feedback? Who's driving the ownership of the learning?

Khan Academy adapted the ICAP framework for peer-to-peer interactions
Khan Academy adapted the ICAP framework, originally designed for classrooms, to score cognitive engagement in Khanmigo conversations.

The key insight was that cognitive engagement isn't just an abstract academic concept. Khan Academy's prior efficacy research had already demonstrated that students who are more cognitively engaged on the platform get more skills to proficient, and that increased proficiency on Khan Academy transfers to higher scores on third-party assessments. So if they could measure cognitive engagement in Khanmigo conversations, they'd have a metric that actually predicted real learning outcomes.

Building the metric was the hardest part. Kelli was emphatic about this. The team defined a rubric, brought in subject matter experts, and had those experts hand-label student chat transcripts. They iterated on the rubric until they achieved 85% inter-rater agreement on a test dataset. Then they used the agreed-upon labels to create a ground truth dataset.

With ground truth in hand, they built an LLM-as-judge: an AI system that could automatically label transcripts using the same rubric. They fed the judge examples from the ground truth data, iterated on the prompt until the LLM judge's labels matched the human experts with high accuracy, and then scaled it. Today, they process about 20% of Khanmigo's chat data every night through this pipeline, feeding results into dashboards that the team monitors continuously.

Why This Unlocked A/B Testing for GenAI

Once Khan Academy had a reliable metric, they could finally do what they couldn't before: run controlled experiments on Khanmigo and measure whether changes actually improved tutoring quality.

Khan Academy uses GrowthBook for both feature flags and experimentation, self-hosted on top of their existing BigQuery data warehouse. They built additional infrastructure to randomize not just at the user level, but at the individual chat thread level, so each new Khanmigo conversation could be independently assigned to a treatment. This was critical because the unit of analysis for tutoring quality is a conversation, not a user.

The experiments they run aren't typical feature tests. They're testing different versions of a prompt, changes to system instructions, and even head-to-head model comparisons (Gemini vs. OpenAI models, for example). Kelli described it as "hill climbing": making very small, deliberate changes, sometimes just a single sentence in a prompt, and measuring whether cognitive engagement moves.

Their primary metrics are cognitive engagement and performance (are students getting more skills to proficient?). Their secondary and guardrail metrics include non-desirable behaviors (like giving the answer away), thread length, verbosity, and response latency. This layered approach ensures they're not accidentally improving one dimension while degrading another.

From Speed Bump to Safety Net

One of the most striking things Kelli shared was how the culture around experimentation shifted at Khan Academy. Before they had this infrastructure in place, experimentation was sometimes perceived as a speed bump, an extra hurdle before shipping. That's a common tension in product organizations.

But with GenAI, the calculus changed. LLM outputs are non-deterministic. A small prompt change can shift output dramatically. A response that looks better to a human reviewer might not reflect better tutoring. The AI tutor quality team at Khan Academy became the heaviest users of GrowthBook specifically because they realized that without A/B testing, they were relying on intuition in a domain where intuition consistently fails.

Kelli put it directly: experimentation went from being perceived as something that slows down shipping to being "a safety net" for understanding how changes actually perform across millions of users and prompts. The team now sees it as essential infrastructure, not overhead.

What This Means for Teams Building on LLMs

Khan Academy's journey illustrates a pattern that applies broadly. If you're building AI features, your path to effective experimentation runs through measurement. Sometimes you'll have a Typeform situation where existing metrics already capture the impact. But often, especially when the AI's output is complex or subjective, you'll need to invest in building new evaluation frameworks first.

The process Khan Academy followed is replicable: define a rubric grounded in domain expertise, get humans to agree on labels, build a ground truth dataset, train an LLM-as-judge, validate it, and scale it. It's not fast. Kelli described a three-year evolution from vibes testing to production A/B testing. But once you have that metric in place, the standard toolkit of A/B testing becomes incredibly powerful for improving AI features.

If you want to hear the full story, you can watch the webinar recording or or read the Khan Academy research paper. And if you're looking for an experimentation platform that can handle GenAI testing at scale, give GrowthBook a try.

Designing A/B Testing Experiments for Long-Term Growth
Experiments

Designing A/B Testing Experiments for Long-Term Growth

Apr 6, 2026
x
min read

Ronny Kohavi — Stanford PhD, Ex-VP and Technical Fellow at Airbnb, formerly Microsoft and Amazon — is one of the top cited researchers in Computer Science and a leading voice in experimentation. He recently joined Luke Sonnet, Head of Experimentation at GrowthBook, for a webinar sharing best practices, mistakes to avoid, and surprising insights into how often experiments actually succeed. Watch Designing Experiments for Long-Term Growth on demand.

This article covers the key principles Ronny and Luke shared for designing experiments that drive long-term growth — from understanding the importance of experimentation, why you shouldn’t ship on flat results, the key metrics you should track, and how to create a shipping criteria framework. Whether you're just getting started with experimentation or looking to sharpen how your team makes decisions, these are the foundational concepts that separate programs that deliver real impact.

In science, randomized controlled experiments are the gold standard, sitting at the top of the hierarchy of evidence. A/B tests are the online equivalent and the most reliable tools teams have for determining whether a change actually has an effect — whether that's a new feature, a UI change, a pricing change, or a backend optimization.

The problem is that most teams haven't done the harder work first: agreeing on what success actually looks like before the data comes in. Without that foundation, even a well-run experiment produces a result nobody knows how to act on.

Experimentation is How You Stop Guessing: Embrace the High Failure Rate

Humans are systematically bad at predicting what will work and assessing the value of ideas.  You cannot reliably judge which ideas are valuable before testing and will be wrong far more often than most teams expect. An effective experimentation program is critical for focusing effort toward what actually works.

Here is some surprising success rate data from across the industry:

Company Experiment Success Rate False Positive Risk — Probability that a statistically significant result is a false positive Reference
Microsoft 33% 5.9% Kohavi, Crook and Longbotham 2009
Avinash Kaushik 20% 11.1% Kaushik 2006
Bing 15% 15% Kohavi, Deng and Longbotham, et al. 2014
Booking.com 10% 22% Manzi 2012, Thomke 2020, Moran 2007
Google Ads 10% 22% Manzi 2012, Thomke 2020, Moran 2007
Netflix 10% 22% Manzi 2012, Thomke 2020, Moran 2007
Airbnb Search 8% 26.4% Kohavi

Microsoft's 33% success rate stands out, but this came at a cost. Significant upfront work went into scoping and refining ideas before they ever entered an experiment, which directly impacted that number.

The median organization sees roughly 10% of experiments move the metrics they were designed to improve. Given this success rate, we can compute the False Positive Risk (FPR) — the probability that a statistically significant result is actually a false positive. At a 10% success rate with standard thresholds (𝛼=0.05, 80% power), that risk is around 22%, meaning roughly 1 in 5 'successful' experiments are actually false positives. Most teams assume p < 0.05 means they will rarely make mistakes, but the math shows otherwise.

The most impactful teams are the ones with the infrastructure to test fast and realign priorities based on evidence. A $120M improvement at Bing sat in the backlog for months because nobody thought it was worth testing. At Airbnb, the biggest win was a one-line code change. Neither of these could have been predicted. Both required running the experiment.

The Importance of Building–and Aligning on–A/B Testing Key Metrics

An experimentation program is only as good as the metrics it optimizes for. These metrics include:

  • Success or goal metrics: Defines why an organization or product exists and what success looks like (stock price, revenue, market share, etc.) These are the real objectives, but are not easy to move or measure in the short-term.
  • Driver metrics: Short-term metrics that are the signals believed to predict movement in success metrics. These are what you actually measure to signal success.
  • Overall Evaluation Criterion (OEC): The weighted combination the organization agrees to optimize for, typically composed of a few success and driver metrics. Defining a good OEC is one of the hardest and most important things an experimentation program does.

What Goes Wrong Without a Good OEC

Real-world scenarios from search engines (Bing, Google) and booking sites (Airbnb, VRBO) illustrate how badly things can go wrong with poor OECs, despite well-meaning intentions.

The Search Engine Example

At Bing, naively using queries per user as the OEC would have led to very poor decisions. The example he gave was a ranking bug that returned terrible search results. This increased queries by 10% due to users reformulating queries several times and increased ad revenue by 30%. The short-term metrics look great, but the product is broken.

More optimal metrics to track here would be to minimize queries per session (users should be able to find answers quickly) and maximize sessions per user (repeat usage indicates high value). Bing has a suite of metrics they actually track, including sessions per user, queries per user, time to success, revenue per user and more. We'll cover a framework for identifying and aligning on good OECs later in the article.

A Booking Site Example

Similarly, a booking platform such as Airbnb that ignores satisfaction signals like user rating and instead optimizes purely for conversion rate is optimizing for the wrong thing. If users book listings they end up hating, they don't return.

A better OEC would also include a measure of satisfaction, such as the user's star rating, so you can build machine learning models that predict whether this user will book a listing they love and rate five stars. Deciding on the trade-off between multiple metrics, such as revenue and user satisfaction, is a key business decision.

The Flat Result Trap: The Most Expensive Mistake in Product

Getting your OECs right is important, but only if you're willing to act on what the data actually tells you. A flat result means an experiment didn't produce a statistically significant improvement in the OEC. Shipping flat means deploying that feature anyway. It was discussed that this is a decision error in nearly every case.

One example from Bing was a major effort with ~100 engineers to introduce a third pane to the search window. The experiments failed to show value, but it shipped to all users anyway because it was determined to be a strategic business move. A year later after countless additional experiments failed to show value, the 3rd pane was rolled back at significant costs to Bing.

Had Bing acted on what the data told them to begin with, they could have failed much faster, avoiding months of sunk cost and instead redirected their engineering resources toward something that actually moved the needle.

Debunked: Common Justifications for Shipping Flat

Ronny shared the four primary reasons he has seen teams use to rationalize the decision to ship flat and dives into the real implications of each.

Justification #1: It’s flat, we’re not hurting the users or business

A flat result doesn't mean no effect exists. All it tells you is “we didn’t find enough evidence of an effect.” The experiment could simply be underpowered. "Not statistically significantly worse" is not the same as safe to ship. The true effect could still be negative.

Justification #2: Team morale depends on shipping:

Shipping a flat feature to protect morale means celebrating shipping rather than actually moving goal metrics, which can also complicate the codebase and require maintenance costs. The culture should be results-oriented and simply recognize that many ideas fail. Hold a learning review, share what was discovered, and move on. Failures that generate learning are worth celebrating.

Justification #3: It’s an enabler for future work:

You can cut through this justification with one question: if we ship this and deprecate the old version, would we ever roll it back? At Bing, the answer was yes. Every flat enabler that ships becomes code that must be maintained and a foundation you'll keep building on even when the follow-on value never arrives.

Justification #4: It’s strategic:

Strategic conviction is not a substitute for evidence, and as the data shows, even small changes are hard to predict correctly. Set a vision, but move toward it in small, testable steps. Test a meaningful component first, get data, then adjust.

A Framework for Making Better Experimentation Decisions

With the importance of good metrics and understanding of what can go wrong without them clearly laid out, the conversation then shifts to a practical approach for building a decision framework that connects short-term measurements to long-term goals without overcomplicating the process.

Bridging Short-Term Experimentation Metrics to Long-Term Goals

The messy reality is that most measurable short-term metrics don’t align 1:1 with business goals, so we must instead build frameworks to do so.

Start by identifying your long-term goals and what you can actually measure. From there, identify the short-term metrics that are the strongest indicators of those long-term goals. These are the signals that move in the right direction when the product or business is genuinely improving.

Once you've identified the right metrics, put guardrails in place. Guardrails are secondary metrics you monitor to ensure that improving your primary metric isn't coming at the expense of something else that matters, such as revenue, retention, or user satisfaction. They don't have to move, but they can't go backwards.

A word of caution: overcomplicating things and tracking too many metrics can make it difficult to act. Before running an experiment, think critically about what you would do if your metrics told conflicting stories afterward. This exercise forces clarity around prioritization and how you make business decisions around tradeoffs. The goal is to identify the key signals you can build a decision framework around so you know exactly how you'll act on them.

A Real-World Shipping Example: LLM Chatbots

An example that highlights this concept is an AI chatbot company. They can't measure customer lifetime value in a two-week experiment. Instead, they’ll need to look at the short-term metrics that signal value, such as distinct sessions per user, topic breadth, short-term subscription conversion, and how often responses are copied externally. Build a framework connecting these to the long-term goal, validate against historical data, and you have an OEC you can actually experiment on.

But throwing all of these metrics into your results dashboard can complicate the picture. If some results are flat or vaguely negative, while others are statsig negative, and others are statsig positive, then how do you make a shipping decision?

Most metrics in a typical experiment cluster near zero. A single strong signal stands out, illustrating why clear shipping criteria matter before results come in.

This is exactly where clearly defined shipping criteria earns its value.

Shipping Criteria: Enabling Independent Shipping Decisions at Scale

Translate your metrics into explicit shipping criteria that are determined prior to an experiment launching. This is a decision framework that enables independent shipping decisions and eliminates bias from decision-making during the evaluation phase.

Some decisions are very straightforward, such as the example below. With the revenue change being equal, you would choose the latter with higher Daily Active Users.

When revenue is equal, the choice is clear: ship the variation with higher DAU. Defined criteria make this decision automatic.

However, a clearly defined framework for shipping criteria becomes increasingly necessary in situations where metrics conflict, such as in the example below, where DAU is higher in the first experiment, but revenue is higher in the second. In this situation, you need to understand the tradeoff between these metrics that you’re willing to accept when shipping.

This approach encodes your decision-makers' preferences into a repeatable framework so shipping decisions are consistent, defensible, and free from bias.

When DAU and revenue point in different directions, you need a pre-agreed framework to make the call. This is exactly where shipping criteria earn their value.

Luke’s Twitter Example

An example from Twitter highlights how this works in practice. Daily Active Users (DAU) was a key metric for Twitter, but they wanted to make sure that people were using the product repeatedly and over time to see that they're getting value out of it in a wide variety of applications. Some of the measured indicators included tweets created, likes, and other forms of engagement. They used the decision framework below to determine when to ship:

  • If DAU is up and stat sig → ship
  • If DAU is negative → rollback
  • If DAU is up, not stat sig and no guardrails are negative:
    • If engagement metrics are up (tweets created, likes, etc.) → ship
    • Otherwise → experiment review
  • Murky results → rollback

This type of framework scales. It forces tradeoffs to be agreed on before you're under pressure from a live result.

A key note to remember is that your metric models will likely drift over time. This is something teams need to revisit regularly as their product and business evolves. The metrics that predicted success a few months ago may not be the right ones today.

Closing: Shift the Experimentation Culture

Ronny and Luke close with a shared belief: the teams that win at experimentation aren’t always the ones with the most resources or sophisticated tools, but the ones that have built a culture around learning.

The most important piece of advice is to shift the organizational mindset from celebrating shipping to celebrating learning. Most ideas will fail. The teams that internalize this stop treating failed experiments as something to hide and start treating them as the mechanism by which they get smarter and faster over time.

That cultural shift is supported by the practical framework Luke outlined. When you have clearly defined metrics, explicit shipping criteria, and a shared understanding of your tradeoffs, experimentation becomes the foundation for confident, independent decision-making at scale.

Key Takeaways

  • Most experiments fail. The median industry success rate is ~10%, meaning you will be wrong far more often than you expect. An effective experimentation program is how you find what actually works.
  • False positive risk is higher than most teams realize. At a 10% success rate, roughly 1 in 5 "winning" experiments are actually false positives, even when running at p < 0.05.
  • Your experimentation program is only as good as the metrics it optimizes for. Poorly defined OECs lead to decisions that look good on paper, but break the product.
  • Shipping flat is a decision error in nearly every case. "Not statistically significantly worse" is not the same as safe to ship. The true effect could be negative and the code will have maintenance costs.
  • Short-term metrics rarely align 1:1 with long-term business goals. Build an explicit framework connecting the two and put guardrails in place to protect what actually matters.
  • Define your shipping criteria before the experiment runs, not after. This eliminates bias, enables independent decision-making, and forces tradeoffs to be agreed on in advance.
  • Shift the culture from celebrating shipping to celebrating learning. The teams that win at experimentation are the ones that treat failed experiments as the mechanism by which they get smarter.

Want to go deeper? Ronny teaches two online courses on Maven

Accelerating Innovation with A/B Testing: Ronny’s flagship course and recommended starting point for most practitioners

Advanced Topics in A/B Testing: A follow-on to Accelerating Innovation with A/B Testing for practitioners with a solid foundation in p-values, statistical power, and OEC design

How a Team of 4 Used A/B Testing to Help Fyxer Grow from $1M to $35M ARR in 1 Year
Experiments
AI

How a Team of 4 Used A/B Testing to Help Fyxer Grow from $1M to $35M ARR in 1 Year

Apr 4, 2026
x
min read

A team of 4 growth engineers ran 360 experiments in a year, helping Fyxer grow from $1M to $35M ARR. Here's how they combined a growth engineering mindset with AI-powered coding to test at a pace most companies can't match.

How Fyxer used AI coding and GrowthBook to run 541 experiments in 1 year

Something remarkable is happening at Fyxer. The AI email assistant grew from $1M to $35M in annual recurring revenue last year. This year, they’re targeting $100M to $150M. Behind that trajectory is a company-wide culture of experimentation that produced 541 experiments in twelve months, more than two per working day. The growth engineering team alone, just four people led by Kameron Tanseli, accounted for 360 of those.

The story of how they did it comes down to two things: the right mindset and an AI-first approach to experimentation. The mindset meant treating every product change as a hypothesis to validate, not a feature to ship. The AI-first approach meant using tools like Cursor, Claude, and GrowthBook to compress the entire experimentation loop, from research to development to analysis, so a small team could operate at a scale that would have been impossible even two years ago.

Kameron joined Fyxer when the company had $1M in ARR. He brought a discipline he’d honed across B2C healthtech, B2B SaaS, and now prosumer AI: measure everything, share everything, and learn as fast as possible. One of his first moves was creating a public Slack channel where every experiment result, win, and loss was visible to the entire company. The founders loved it. It became the company’s central nervous system for understanding what was working and what wasn’t.

Kameron recently joined The Experimentation Edge podcast to share the full story. Below are the key takeaways, but the real unlock came when they combined that learning culture with AI-powered development. That combination is what made 541 experiments possible across the company, and it’s what turned a high volume of losses into the wins that turbo-charged Fyxer’s trajectory.

The Growth Engineering Mindset: Why Learning Speed Beats Intuition

Here’s something Kameron will tell you openly: he’s bad at his job for the first few months every time he starts somewhere new. And it’s not just him. It’s everyone in growth.

When Kameron joined Fyxer, his instincts were calibrated to B2C healthtech, his previous role. He defaulted to discount-heavy messaging, pricing-focused copy, and the kind of urgency-driven language that works for consumer subscription boxes. At a B2B SaaS company selling an AI productivity tool to professionals, none of it landed. The only way to close that gap was to get experiments in front of real users and let the data teach him what his intuition couldn’t.

This is the core of the growth engineering mindset at Fyxer: A/B testing isn’t just an optimization tool. It’s a learning tool. And when you’re new to a product, a market, or a customer base, it’s the fastest way to develop the intuition you don’t have yet.

The numbers back this up. Fyxer’s win rate in GrowthBook is 25%. That means 75% of their experiment ideas failed. If they had shipped every idea to 100% of users without testing, the cumulative damage would have been severe. A 50/50 test, even with imperfect sample sizes, beats shipping blind every time.

Kameron pushes back hard on the common startup objection that “we’re not big enough to A/B test yet.” His view: you may not be able to detect 5% lifts, but you can detect 20% or 30% effects, and at a startup, those are exactly the kinds of changes you should be testing. Pricing models, usage limits, core product flows. The risk of getting those wrong without testing is far greater than the cost of running an imperfect experiment.

A key element of Fyxer’s approach is how they think about iteration. Rather than waiting for a fully polished feature, they ship the core experience and then immediately run experiments to improve adoption and engagement. As Kameron puts it, almost nobody uses your new feature on day one. The real work starts after launch, when you test messaging, onboarding flows, and nudges to find what actually drives usage. This iterative approach was central to their PLG breakthroughs later in the year.

Kameron uses a simple framework to evaluate which features could drive viral growth. First, he identifies the actions users are already repeating within the product. Then he asks: every time a user sends an email, schedules a meeting, or triggers a confirmation, is there a way to use that touchpoint to introduce Fyxer to someone new? When the answer is yes, the team builds and tests a loop around it.

Not every loop works. Fyxer has a scheduling feature, similar to Calendly, and Kameron hypothesized that sending booking confirmations could drive recipients back to Fyxer to sign up. In theory, it was a clean growth loop. In practice, users pushed back immediately. Fyxer’s entire value proposition is reducing inbox noise, and here they were adding another email on top of the Google Calendar and Outlook invites that people have already received. They killed the experiment and pivoted to a different approach. That willingness to test assumptions, even ones that look great on a whiteboard, is what separates a growth-minded team from one that ships on conviction alone.

Using AI to Scale Experimentation from Weeks to Hours

The mindset gets you to the right experiments. AI is what lets a team of four run them at startup speed.

Fyxer’s experimentation stack is built around a few key tools, with Claude as the central hub. The growth team shares Claude's skills across the team, so common workflows, like turning a GrowthBook experiment result into a Slack post or generating a hypothesis from a data analysis, are reusable and consistent. They’ve connected Claude to their internal systems through MCP integrations, including GrowthBook’s API, so experiment data flows directly into their AI workflows.

For development, they use Cursor across the full stack. But the real unlock has been Cursor’s desktop mode with virtual environments. Here’s why that matters: traditionally, even a simple experiment requires a developer to write the code, pull it down locally, run the app, and manually check that the new upsell panel or copy change looks right. With Cursor desktop, the tool runs the app in a virtual environment and shows Kameron a video of what the experiment will look like. He reviews it, signs off, and moves on, without ever pulling down the code himself.

This means he can run five or six experiments in parallel, as long as they’re relatively contained changes. For even simpler experiments, like backend configuration changes or one-line feature flag adjustments, they use Claude Opus, Codex, and Tembo to one-shot the implementation entirely.

The AI acceleration extends beyond development. On the data side, Fyxer uses Dot, an AI data analyst that connects to their BigQuery warehouse and lives in Slack. The data team documented their table schemas, columns, and relationships, and Dot uses that context to answer complex questions — segmentation analysis, survival curves, custom queries — from anyone on the team. Non-technical stakeholders can get answers in seconds without waiting for the data team, which unlocked a bottleneck that plagues almost every growing company.

The experimentation lifecycle itself is increasingly automated. Cursor automations fire when PRs are opened, daily jobs check for stale experiment code that should be cleaned up, and product release docs are generated automatically. When a key metric dips unexpectedly, the data team uses the GrowthBook API combined with Claude to cross-reference recent experiment launches and diagnose whether an experiment caused the problem.

The net effect: AI compresses the entire experimentation loop. Research that took days happens in hours. Development that took a week happens in an afternoon. Analysis that requires a data scientist can be done by anyone on the team through Slack. That’s how four engineers run 360 experiments in a year.

What 541 Experiments Actually Produced

Volume without results is just busywork. Here’s what Fyxer’s experimentation program actually delivered:

  • Increasing free-to-paid conversion from 5% to 35% by adding a credit card gate before the free trial.
  • 2.3x-ing the share of paying customers on annual plans, which now accounts for 50% of subscribers.
  • Increasing the trial start rate for personal email users by 65% by segmenting trial lengths based on signup type.
  • Creating a referral growth loop in which 33% of invites are accepted.

None of these were obvious in advance. The credit card gate, for example, contradicts conventional wisdom about reducing friction in signup flows. But Kameron noticed that many AI apps were already asking for credit cards upfront, and Fyxer’s users had high intent because they were connecting their email. They also made the paywall optional during the experiment, drawing design inspiration from Canva’s checkout flow by showing users a clear timeline: what happens today, in 5 days, and in 7 days. The result was essentially free revenue on existing traffic.

The annual plan shift followed a similar pattern. The original UI defaulted to monthly billing with a modest 8% annual discount. Kameron tested defaulting to the yearly plan, increasing the discount to 25%, and displaying the effective monthly price. It’s the kind of change that takes a few hours to implement and test, but has a massive compounding effect on retention and cash flow.

That’s the compounding advantage of high-velocity experimentation: you find the counterintuitive wins that your competitors are leaving on the table because they’re still debating whether to test.

Where Fyxer’s Growth Team Is Headed Next

Fyxer is scaling the growth engineering team from 6 to 13 this year, with a target of 1,000 experiments. But the real multiplier isn’t headcount. It’s a continued investment in AI-powered developer performance: more reusable skills, more automated workflows, and tighter integration between their experimentation platform and their AI tooling.

Their revenue target of $100M to $150M ARR would represent another 3–4x leap. If the pattern holds, that growth won’t come from a single breakthrough. It will come from the compounding effect of hundreds of experiments, most of which will fail, but the ones that win will change the trajectory of the business.

Key Takeaways

  • You don’t need to be big to experiment. You need to be disciplined about testing the things that carry the most risk.
  • A/B testing at a startup is primarily a learning tool. It’s how you build customer intuition fast, especially when you’re new to a market.
  • AI doesn’t just make development faster. It compresses the entire experimentation loop, from hypothesis to analysis, making high-velocity testing possible with a small team.
  • A 25% win rate is a feature, not a bug. It means you’re testing bold ideas and catching the failures before they ship to everyone.
  • The combination of the right mindset and an AI-first approach to tooling is a genuine competitive advantage, and one that’s accessible to any team willing to invest in both.

Want to hear the full conversation? Watch Kameron’s episode on The Experimentation Edge podcast, where he goes deeper on Fyxer’s growth loops, AI tooling stack, and advice for growth engineers starting at a new company.

Fyxer runs its entire experimentation program on GrowthBook, the open-source feature flagging and A/B testing platform. If your team is looking to scale experimentation without scaling headcount, get started for free or request a demo.

How to Migrate from Statsig to GrowthBook
Guides
Feature Flags
Experiments

How to Migrate from Statsig to GrowthBook

Apr 2, 2026
x
min read

The industry's only open-source, warehouse-native experimentation platform gives you predictable pricing, full data ownership, and results you can verify. Here's why Statsig customers are switching to GrowthBook.

When OpenAI acquired Statsig, engineering leaders at hundreds of companies started asking the same question: what happens to our data?

It's a fair question. Statsig routes all event data through its own servers. With that infrastructure now under OpenAI's control — and Statsig's CEO gone — teams that cared about data governance found themselves re-evaluating a platform they'd built their experimentation programs on. Add event-based pricing that climbs as you scale, and the calculus tilts further.

GrowthBook is where many of them land. This post explains what GrowthBook offers, what the migration looks like in practice, and how to decide whether GrowthBook compared to Statsig makes sense for your team.

What is the GrowthBook open-source platform?

GrowthBook is an open-source feature flag, experimentation, and product analytics platform. It is the original warehouse-native platform, trusted by more than 3,000 companies, including Dropbox, Khan Academy, Upstart, Sony, and Wikipedia. It handles over 100 billion feature flag lookups per day.

The warehouse-native architecture is the defining design choice. Rather than copying your data into its own system, GrowthBook queries your data where it already lives — Snowflake, BigQuery, Databricks, Redshift, ClickHouse, Postgres, and more. Analysis runs in your warehouse with read-only access. Every SQL query is visible. Every result is reproducible.

That's the short version. The longer version explains why it matters when you're evaluating a replacement.

4 reasons teams switch from Statsig

1. Lack of data ownership

Statsig's architecture requires sending event data to Statsig's servers for analysis. That worked when Statsig was an independent company. Under OpenAI ownership, with no published data firewall policy between Statsig customer data and OpenAI's AI training, the risk profile changed.

GrowthBook inverts the model. Your data warehouse holds the data. GrowthBook reads aggregate statistics from it, using read-only credentials. Raw PII stays in your environment, under your control, and subject to your compliance policies. For teams operating under GDPR, HIPAA, CCPA, COPPA or  daa residency requirements, this distinction is operational, not philosophical.

John Resig, Chief Software Architect at Khan Academy, described exactly this concern: the ability to retain data ownership was, in his words, "very, very important," because most platforms require passing user data to a third-party service.

GrowthBook's self-hosted deployment takes it further. Deploy within your own infrastructure, behind your own firewall, with zero external data egress. Fully air-gapped deployments are supported for the most sensitive environments.

2. Expensive to scale

Statsig prices on events and traffic. That structure makes sense when you're running a handful of experiments on modest traffic. At scale, it penalizes the behavior you want to encourage: more experiments, more feature flags, more coverage.

Teams using Statsig often end up managing their experimentation volume to manage their bill — sampling down traffic, avoiding flagging minor changes, skipping experiments on low-stakes features. That's the opposite of a healthy experimentation culture.

GrowthBook uses per-seat pricing. A team that runs 10 experiments a month pays the same as one running 100. Feature flag evaluations don't generate a cost event. The experimentation ROI calculator can model your specific usage to show expected savings.

3. Limited visibility into underlying results

Statsig's statistics engine is proprietary. You can see the outputs, but not the logic that produced them. When a result is surprising, your options for investigation are limited to the interfaces Statsig exposes.

GrowthBook's engine is fully open source on GitHub (7,000+ stars). Every calculation is inspectable. Every query is visible in your warehouse. If a result looks off, you can drill into the underlying SQL, check the raw data, and confirm or refute the calculation on your own terms.

Diego Accame, Director of Engineering at Upstart, put it this way: "Our strength is as an AI-powered lending marketplace, not an experimentation framework company. GrowthBook lets us focus our resources where they matter most — on growing our core business."

That confidence comes partly from owning the infrastructure and partly from being able to verify what the infrastructure is doing.

4. Limited statistical depth

GrowthBook supports Bayesian, frequentist, and sequential testing, with CUPED variance reduction and post-stratification. Statsig supports a similar range, but without post-stratification and the ability to inspect or reproduce the calculations.

For data science teams that care about methodology — particularly at companies where an experiment result drives a significant product or business decision — the ability to validate the math is a meaningful advantage.

What the Statsig migration kit covers

GrowthBook ships a migration kit specifically for Statsig customers, including an AI-powered assistant that can transform your existing codebase. Here's what migrates:

Projects, teams, and tags carry over cleanly, preserving your workspace organization so teams can keep working without rebuilding their context.

Feature gates from Statsig map to GrowthBook feature flags, which support multiple environments, targeting rules, gradual rollouts, and instant kill switches.

SDKs migrate automatically. The AI migration assistant points at your codebase and handles the transformation — feature gates, dynamic configs, and user attributes converted to GrowthBook equivalents. JavaScript, TypeScript, and React are supported today, with more coming. 

This is the step that usually takes weeks; the assistant reduces it to minutes.

Experiments transfer, including past experiments run on Statsig. You can generate custom reports from past Statsig experiments in GrowthBook, which preserves institutional knowledge.

Targeting rules transfer with full visibility into conditions and rollouts. GrowthBook includes debugging tools that simulate flag values for specific audiences, making it straightforward to verify that migration behavior matches pre-migration behavior.

Safe rollouts remain a first-class concept. GrowthBook supports gradual exposure with automatic monitoring of guardrail metrics, so regressions trigger alerts before they reach your full user base.

The SDKs themselves don't require replacement during migration. If you're moving from Statsig cloud to GrowthBook cloud, or from Statsig to self-hosted GrowthBook, your feature flag configuration and experiment setup carry over without requiring SDK changes or redeployment of your application code.

The full GrowthBook platform you're migrating to

Migration is the starting line, not the finish line. Here's what GrowthBook offers beyond Statsig feature parity.

Feature flagging that doesn't cost per evaluation

GrowthBook's feature flags run through zero-network-call SDKs. The SDK downloads a payload at startup and evaluates flags locally, so each flag evaluation adds sub-millisecond latency without generating a billable event. You can flag every feature in your product — including low-traffic, experimental, and internal-use features — without worrying about cost.

GrowthBook supports 24+ SDKs: JavaScript, React, React Native, Node.js, Python, Ruby, Go, PHP, Java, Kotlin, Swift, and more. The Chrome debugger lets you inspect flag state and experiment assignment in real time without touching application code.

Experimentation with SQL you write and own

GrowthBook's metric system is SQL-first. You write metrics using your warehouse's SQL dialect, join against any tables in your schema, and apply whatever business logic your team uses. A metric for revenue per activated user might join your experiment assignment table to your payments table to your activation events — all using the same logic your data team uses everywhere else.

Forgot to add a metric before an experiment started? Add it retroactively. The data is already in your warehouse. Just define the metric and run the analysis against the historical assignment data.

Metrics can be standardized in a library, enabling every team to measure success consistently. They can be scoped to specific experiments or applied globally as guardrails.

Deployment on your terms

GrowthBook Cloud runs on AWS with automatic updates, encrypted data at rest and in transit, 99.99% uptime SLA at Enterprise tier, and SOC 2 Type II, ISO 27001, GDPR, COPPA, and CCPA compliance.

GrowthBook Self-Hosted runs on your infrastructure, choose any major cloud provider or on-premises, deployed with Kubernetes or any container platform. Same codebase. Same features. Same development roadmap. The only difference is who manages the infrastructure.

Many teams start on GrowthBook Cloud for the fastest path to running experiments, then migrate to self-hosted when compliance requirements or internal policy require it. GrowthBook's SDK and configuration structure don't change in that migration, so the transition preserves everything you've built.

If you don't have a data warehouse yet, GrowthBook's Managed Warehouse gives you a fully functional environment immediately, with the option to migrate to your own warehouse at any time.

AI-ready experimentation

Three of the five leading AI infrastructure companies use GrowthBook to test and optimize their products. The platform handles the non-deterministic, high-variance nature of AI feature testing well. 

  • Sequential testing reduces false positives
  • CUPED variance reduction accelerates decision-making
  • Fully custom SQL metrics capture what matters for AI outputs (task completion, output acceptance, engagement depth) rather than just clicks

GrowthBook's MCP server connects to Cursor, VS Code, Claude Code, and any other MCP-compatible IDE. Create feature flags and experiments in natural language, query past results, and build agents with your experimentation data as context — all without leaving your editor.

Getting started for free

GrowthBook's offer for current Statsig customers: use GrowthBook for free through your current renewal date, up to one year (up to $100,000 value). The migration kit, including the AI-powered SDK migration assistant, is available immediately.

The practical starting path for most teams:

  1. Connect GrowthBook to your data warehouse. Pre-built SQL templates get you to first results without custom data engineering. Customize from there.
  2. Run the AI migration assistant against your codebase. It transforms Statsig feature gates to GrowthBook equivalents and generates a diff for your team to review.
  3. Import your Statsig experiments. Historical results carry over so you don't lose the record of what you've learned.
  4. Start your first GrowthBook experiment. The Chrome debugger and visual editor make the first experiment accessible to non-engineers.

The decision to switch to GrowthBook

If your team depends on Statsig and the OpenAI acquisition raises questions you can't yet get answered – about data governance, roadmap continuity, or long-term pricing - then GrowthBook is the switch that costs the least to evaluate and offers the most structural independence.

Open source lets you inspect and audit what you're running. Warehouse-native architecture gives you data ownership that doesn't depend on a vendor relationship. Per-seat pricing gives you the freedom to run more experiments without watching a meter.

The migration kit makes the practical barriers manageable. The question is whether the reasons to switch outweigh the friction of switching. For most Statsig customers evaluating the post-acquisition landscape, that math is becoming clearer.

Ready to get started?

Schedule a consultation →

Estimate your savings →

Read the GrowthBook vs. Statsig comparison →

Your Experiment Lift Is An Average — Which Users Actually Benefited?
Experiments
Analytics

Your Experiment Lift Is An Average — Which Users Actually Benefited?

Apr 1, 2026
x
min read

The case for looking beyond the Average Treatment Effect

One number, many stories

You moved the recommendations carousel higher on the product page. After two weeks, the experiment comes back: +1.6% on conversion rate. Stakeholders are happy. You ship. You celebrate. You move on.

That workflow is fine. The Average Treatment Effect (ATE) is the right first thing to look at. It's what experiments in GrowthBook are typically designed to estimate. If you're going to act on a single number, that's probably the one. But if you stop there, you are leaving money on the table.

What is the Average Treatment Effect (ATE)?

The ATE is the difference in average outcome between users in the treatment group and users in the control group. It's the standard summary statistic from a randomized experiment — and the right first number to look at. But it summarises across all individual responses in your experimental sample.

That 1.6% is an average. Your user base includes many different types of people: varying usage patterns, needs, and baseline behaviors. You already know this. You probably already segment users for marketing and personalization, or you wish you did. Yet when the experiment result comes back, all of that diversity collapses into a single number.

The question is: what is that single number hiding? In this post, we look at what the average treatment effect actually represents, why the same average can mask very different realities, and what that means for how you act on your results.

What the Average Treatment Effect (ATE) really means

The average treatment effect is exactly that — an average. Behind it sits a distribution of individual responses: some users who gained a lot, some who barely noticed, some who were actively put off by the change. It is a summary across your entire experiment sample, not necessarily the effect on any particular user. If your metric is binary — the user converts, or they don't — nobody converted 1.6% more times. Some users were pushed over the edge and converted when they otherwise wouldn't have. Others were unaffected. Some may have been put off by the change and thus did not convert. What you observe as +1.6% is the net result after all of these individual responses are averaged together.

You cannot observe any individual user's treatment effect — that's the fundamental problem of causal inference. You only ever see what actually happened to a user, never what would have happened without the treatment. But the underlying distribution of those individual effects is real. The average effect tells you where the center is, but it tells you nothing about the rest.

Why the same experiment result can hide very different realities

To see why this matters, consider three scenarios — all with the same average effect of +1.6%. The distributions below are conceptual illustrations of what might be hiding behind that average. But the question of which one you're in is very real.

Scenario (a): Nearly everyone benefits a little

Distribution of individual treatment effects showing nearly all users benefiting equally from a 1.6% average lift

Think of a pure copy change: rewording a headline on a product page. There's no structural change, no new functionality. The tweak lands roughly the same way for everyone (not all copy changes do, of course, but this one did). A small, diffuse lift. This is what most people implicitly picture when they hear "1.6% lift." It's also the easy case. When the effect is similar for most users, the average tells the whole story, and you can act on it with confidence.

Scenario (b): One subgroup drives the entire effect

Bimodal distribution of individual treatment effects, where one user subgroup drives the entire 1.6% average experiment result

Let's take the carousel for another spin. You moved it higher on the product page. Two types of users now have very different experiences. Browsers, the ones who enjoy discovering new products, engage with the carousel and convert more. Searchers, users who came for a specific item, now have to scroll past content they never asked for. Slightly annoying. The browsers see a meaningful positive effect. The searchers see zero or slightly negative. Most users barely notice. The +1.6% average is real, but it's driven by a single user type, and you are shipping the change to everyone.

Scenario (c): Winners and losers

Skewed treatment effect distribution showing 35% of users negatively affected despite a positive 1.6% average experiment result

You raised the free shipping threshold from $25 to $50. Users who were comfortable buying one or two small items with free shipping now face a delivery charge that feels offensive. Some abandon their carts. Some find the same items as a competitor. Meanwhile, users with larger baskets add a few extra items to clear the new threshold, pushing average order value up. The overall effect is positive, but a sizeable share of users are notably worse off.

The bottom line: from the average effect alone, you cannot tell which of these scenarios you are in. The decision to ship looks the same in all three cases. The implications are very different.¹

Why experiment results vary across user segments — and why it matters

This isn't an academic exercise. The scenario you are in changes what you should do next.

Is the signal real? An effect driven entirely by one segment is either a discovery or a warning sign. If the subgroup is large and the effect is real, you may have found something worth doubling down on. If the group is small and their effect is noisy, your positive result may not replicate.

Why, not just what? Understanding who benefits also generates hypotheses about why the treatment works. That's how you build on experiment results rather than just collecting them. A team that knows the carousel helped browsers but irritated searchers can design a better version: show it on category pages, suppress it on search results. A team that only sees +1.6% moves on to the next test.

Will it last? Your experiment ran at a specific point in time, on the users who happened to be active during that window. If the effect is similar for all your users, it is more likely to hold as your user base evolves. If it's concentrated in one segment, the result is only as durable as that segment's share of your traffic. If the lift came from a seasonal cohort of holiday shoppers, it may not survive into Q1. What will your user base look like next quarter, or next year? An effect that's similar across users and one that's concentrated in one particular segment ages very differently.

From average results to individual insights in experimentation

You don't have to use fancy machine-learning frameworks to start asking these questions. The simplest version is to look at your experiment results across dimensions you already have: geography, platform, user tenure, and purchase frequency. In GrowthBook, that's what dimension splits are for. From a different angle, quantile treatment effects let you compare different percentiles of the outcome distribution across variants — for example, did the free shipping change hurt users at the low end of spend while benefiting those at the top? And with Experiment Dashboards, you can make these breakdowns a default part of every experiment readout, so looking beyond the average becomes standard procedure.

For teams willing to go further, more advanced methods can produce effect estimates at the individual level — the closest you'll get to making those conceptual distributions real.

And once you know the effect varies by segment, you don't have to ship the same experience to everyone. Most experimentation platforms, GrowthBook included, let you target features to specific user segments. The experiment told you who benefits. Targeting lets you act on it.

Slicing data post-hoc does come with real statistical risks, but there are well-established ways to handle them. The next post in this series covers how to navigate these waters: how to slice your experiment data, what to watch out for, and how to tell a real finding from a lucky split.

In the meantime, explore dimension splits and think about what dimensions might be interesting in your experiments. And next time you're building a feature, think about how your different users might respond to it, and who might not like it at all.

As always, beware and have fun!

This is part of a series on treatment effect heterogeneity. The next post is about how to uncover the different ways users respond to the same treatment, without fooling yourself.

¹ This idea is developed more formally by Gelman, Hullman, and Kennedy (2024) as "causal quartets" — different data-generating processes that produce identical average effects. The American Statistician, 78(3), 267–272.

What Is A/B Testing? A Practical Guide for Product & Engineering Teams
Experiments
Guides

What Is A/B Testing? A Practical Guide for Product & Engineering Teams

Apr 1, 2026
x
min read

A/B testing is simple in concept. Split your users, show them different experiences, and measure what happens.

In practice, A/B testing for product teams is rarely that clean. Real products have real constraints in tracking, assignment, and metric definition, quickly making a straightforward test complicated.

While low-velocity teams can absorb slow, isolated mistakes, high-volume experimentation at scale requires mastering the fundamentals, as flaws compound leading to bad, high-confidence product decisions. Fortunately, these failure modes are well-understood and avoidable.

What is A/B Testing

A/B testing, sometimes called split testing, is a randomized experiment in which multiple versions of something are shown to different groups simultaneously. Each group is measured against a defined metric to determine which performs better.

By randomly assigning units to each version, you control for external factors like seasonality, changes in traffic mix, and broader market conditions, so any difference in outcomes can be attributed to your change and nothing else.

What Does A/B Testing Look Like in Practice

In product development, an A/B test runs alongside your normal release process. Rather than shipping a change to everyone at once, you expose a subset of your users to the new experience while the rest continue seeing the existing one. Both groups run simultaneously, and you measure the difference.

  1. Define a hypothesis including the metric you're testing against.
  2. Randomly split your audience into groups, each exposed to a different version.
  3. Analyze the difference between groups using a statistical framework.
  4. Ship the winning variant, or go back to the drawing board with what you learned.

Without that structure, you're left comparing against historical data. Consider a team that ships a new feature and watches new signups drop 8% over the following two weeks. They blame the release and roll it back, but sales stay flat. It turns out it was a seasonal dip that would have happened regardless of what was shipped, and now the team has spent a week in firefighting mode reverting a change that had nothing to do with the decline.

Or consider a team deciding between two redesigns of the same checkout flow. Rather than debating which one to ship, they test both against the current experience simultaneously. One variant performs similarly to the control. The other increases completed purchases by 12%. Without the test, that call comes down to whoever argues most convincingly in the design review.

Why Does A/B Testing Matter

For product teams, the value of A/B testing isn't just finding winning variants. It's making consequential decisions about how your product works based on what users actually do, rather than what your team thinks they'll do. 

It's also one of the few tools that gives teams the ability to push back on the HiPPO (the highest paid person's opinion) with something more than a gut feeling of their own. When the data says otherwise, it says so for everyone in the room.

The Critical Difference: A/B Testing vs Gut/Intuition

Without A/B testing, product decisions tend to default to a familiar set of inputs:

  • HiPPO (Highest Paid Person's Opinion). The person with the most seniority in the room has the most influence over what ships. Experience and instinct have value, but they're not a substitute for knowing what your users actually do.
  • Best practices that may not apply to your audience. What worked for another product, in another market, with a different user base is a starting point at best. Your users are not their users.
  • Assumptions about user behavior. Intuition about how users will respond to a change is useful for generating hypotheses, but assumptions are often wrong.
  • Competitor copying without context. You can see what your competitors ship, but you can't see whether it worked or what they had to give up to get there.

With A/B testing, product decisions are grounded in more reliable inputs:

  • Actual user behavior from your specific audience. Benchmarks and case studies tell you what worked somewhere else. This tells you what works for your users, in your product.
  • Statistically validated results. Results you can trust, reproduce, and build on rather than ones you have to take on faith.
  • Measurable business impact. You can tie the outcome of an experiment directly to the metrics the business cares about, whether that's retention, revenue, or engagement.
  • Continuous learning. Every experiment, whether it wins or loses, tells you something about how your users behave. 

What are the Benefits of A/B Testing in 2026

For modern product teams, the benefits of A/B testing go well beyond finding a winning variant. In 2026, with AI accelerating the pace of product development and raising the bar for what teams can ship, the cost of making bad product decisions has never been higher. Done consistently and rigorously, experimentation touches how teams make decisions, allocate resources, and understand their users.

1. Get More Value from Your Existing Traffic

Customer acquisition costs have climbed as high as 60% since 2023

  • Paid channels are getting more expensive as competition for inventory increases and AI-driven bidding pushes auction prices up. 
  • Organic search is delivering fewer clicks as LLMs answer queries before users leave the results page. 
  • Social platforms are increasingly designed to keep users on-platform rather than send them to yours.

Getting more value out of the traffic you already have is increasingly a business necessity, and A/B testing is how you do it systematically.

2. Reduce the Risk of Rolling Out Major Changes

Every product change carries risk. A change can perform worse than expected for a variety of reasons: a bug that only surfaces under certain conditions, user behavior that didn't match your assumptions, or a change that worked well for one segment while degrading the experience for another. Without feature experimentation, you find out about these issues after the fact, when it has already reached your entire user base.

By feature flagging and exposing a change to a subset of users first, you limit the damage if something goes wrong. A variant that damages an important metric affects 10% of your traffic, not 100%. If it performs well, you can roll it out knowing what to expect. If it doesn't, you can roll it back before most of your users ever see it.

3. Speed Up Product Decision Making 

Product decisions are slow when they rely on opinion. Design reviews stretch into hours as stakeholders debate, and the person with the most seniority often wins, not because they're right, but because they're the loudest voice in the room.

Product experimentation changes how those conversations go. When you have data on how users actually behaved, the debate shifts from “I think" to "here's what we know." As one PM put it: "A/B testing turned our three-hour design debates into 30-minute data reviews."

That speed compounds over time. Teams that can make and validate product decisions faster than their competitors ship more, learn more, and course-correct before small mistakes become expensive ones.

4. Develop a Deeper Understanding of Your Users 

Every experiment tells you something about your users, whether it wins or loses. A variant that underperforms is still evidence. It tells you what your users don't respond to, which is often just as useful as knowing what they do.

Over time, that body of evidence becomes more valuable than any single test result. Teams that maintain a searchable archive of past experiments (GrowthBook does this automatically) stop asking "Didn't we already test this?" and start forming better hypotheses from the outset. This process builds a richer understanding of their users and how they actually behave, leading to better prioritization as the most impactful initiatives become clearer.

5. Uncover Surprising Insights

Not every valuable idea looks valuable before it's tested. A Microsoft engineer once ran a quick A/B test on a low-priority change to how Bing displayed ad headlines (an idea that had sat untouched for over six months). The test showed a 12% increase in revenue, which translated to more than $100 million annually in the US alone. It turned out to be the best revenue-generating idea in Bing's history and it almost never got tested at all.

These insights only surface when you have an A/B testing framework that makes it easy to ship any product change as a controlled experiment.

6. Build a Competitive Advantage

The teams that consistently outperform their competitors aren't necessarily the ones with the best ideas. They're the ones who can validate ideas faster and learn from failures.

Netflix is a well-documented example. The company runs experiments across virtually every aspect of its product, optimizing everything from thumbnails to recommendation algorithms to ensure that data (rather than opinion) drives decisions. That commitment to experimentation at scale is part of what allows a company of that size to keep iterating as fast as it does.

The more consistently you test, the better your decisions get, and the harder that advantage is for competitors to close.

Who Should Use A/B Testing (and Who Shouldn’t)

Most teams can benefit from A/B testing in some form. But the teams that get the most out of it tend to share a few things in common: enough volume to reach statistically meaningful results, the technical infrastructure to instrument changes correctly, and decisions that are frequent enough to make a testing practice worthwhile.

  • Product teams should run experiments to make confident decisions about what to build and how to build it. Does this feature change improve engagement? Does this new experience keep users on the platform longer? Experimentation answers those questions before a change is fully committed. Smaller tests can also validate hypotheses early, before significant product development investment is made, informing broader product strategy along the way. It also gives product teams a clearer read on the actual impact of their work, which is often harder to measure than it looks.
  • Engineering teams should run experiments to ship changes with confidence and get direct visibility into the impact of their work on product outcomes, not just system performance. Does this algorithm change actually improve the metric it was designed to improve? Does this infrastructure change affect user behavior in ways that weren't anticipated? Rather than shipping to everyone at once, changes can be rolled out to a subset of users first, catching unexpected behavior before it reaches your entire user base.
  • Growth and marketing teams should run experiments to validate what actually resonates with their audience before committing to a direction. Does this landing page copy increase signups? Does this email subject line improve open rates? The feedback loops are short, and the metrics are clear, making experimentation a natural fit for fast iteration.
  • Design teams should run experiments to resolve design debates with data (rather than opinion alone) and validate changes before they're fully built. Does this layout change make the key action more obvious? Does this navigation pattern reduce friction or just introduce unfamiliarity? A/B testing gives design teams a way to move forward on contested decisions without waiting for consensus.

A/B testing isn't the right tool in every situation. There are a few conditions where it will either produce unreliable results or simply isn't worth the investment.

  • Product is too early stage. If you're still searching for product-market fit, optimizing individual features is a distraction. The priority at that stage is to learn whether the core product solves a real problem, which requires qualitative research and iteration, not controlled experiments.
  • Not enough units. A/B testing requires enough users moving through the experience you're testing to produce statistically meaningful results within a reasonable timeframe. If your sample size is too small, you'll either run tests for months or make decisions based on underpowered results that don't hold up.
  • Decisions with an obvious right answer. Some changes don't need a test. Accessibility improvements, critical bug fixes, and security patches should be shipped because they're the right thing to do for your users, not because an experiment validated them. Testing these changes introduces unnecessary delay and in some cases raises ethical questions about deliberately exposing a subset of users to an inferior or broken experience.  However, it can still be valuable to run non-inferiority tests to ensure changes don’t introduce new issues that affect the customer experience.  
  • No internal alignment. A/B testing only produces value if the results get acted on. If your team can't agree on what success looks like before the experiment starts, or if stakeholders routinely override data-driven conclusions with opinion, the infrastructure of experimentation exists without the culture to support it. The tool is only as useful as the organization's willingness to trust and act on what it finds.  Getting alignment on the OEC, Overall Evaluation Criteria, is usually a critical first step.  If your teams can’t agree on a north star metric, then it's very difficult to grow the business effectively.
  • Significant brand changes. A/B testing works well for changes with measurable behavioral outcomes, but brand identity isn't that kind of decision. Testing radically different brand expressions simultaneously means that different users see different versions of who you are, creating inconsistency that's difficult to undo. For changes to core brand messaging, tone, or visual identity, market research and qualitative methods are better inputs than a randomized experiment.
  • Regulated industries with constraints on user treatment. In some industries, randomly assigning users to different experiences raises legal or ethical issues. Healthcare, financial services and edtech are all industries where A/B testing requires additional thought.  For example, you don’t want half the students in a class to have one learning experience and the other half having another.  This could be very hard on the teacher and students.  Data privacy is also extremely important in these industries.  This doesn’t mean these industries can’t run A/B testing.  It just means they need to be more thoughtful about their experiment design.  (GrowthBook's self-hosting and privacy-first architecture is specifically designed for teams operating in regulated environments), but they do mean the standard experimentation framework needs to be adapted before it can be applied safely.

What Can You A/B Test?

Most teams start experimenting with the most visible parts of their product and stop there. The reality is that if a change can be measured and randomly assigned, it can be tested. That applies as much to a ranking algorithm or a model prompt as it does to a button label or a checkout flow, and the most sophisticated experimentation programs treat almost every product change as a candidate for a controlled experiment.

User-Facing Product Experiences

Changes to what users see and interact with directly are often the easiest to instrument, the most straightforward to design a clean experiment around, and the most immediately connected to the metrics product teams care about.

Copy and Messaging

The words you use to describe your product, explain a feature, or prompt an action affect how users respond in ways that are hard to predict without testing. This includes headlines, body copy, error messages, empty states, and tooltips. Copy that works well in one context often fails in another, which makes experimentation more reliable than intuition.

Visual Design Elements

Colors, typography, imagery, iconography, and visual hierarchy all affect how users perceive and engage with a product. These elements are worth testing on high-traffic acquisition surfaces where visual choices directly affect first impressions and conversion.

Social Proof and Trust Signals

The placement, format, and type of social proof affects how users evaluate whether to take action. Testimonials, review counts, trust badges, and case study callouts are all worth testing at high-stakes moments in the user journey, like pricing pages or checkout flows, where trust is a meaningful factor in the decision.

Calls to Action

Button text, placement, size, and visual weight all affect whether users take the action you want. The difference between "Start free trial" and "Get started" may seem trivial, but it can produce measurable differences in click-through and conversion rates.

Forms and Data Collection

The number of fields, their order, their labels, and how validation errors are presented all affect completion rates. For teams with signup flows, checkout processes, or any other form-gated experience, this is a productive area for experimentation.

Layout and Navigation

How you organize and present information affects how users move through a product and what they do next. Single versus multi-column layouts, card versus list views, menu structure, and the placement of key actions relative to supporting content are structural decisions that are harder to get right through intuition alone.

Onboarding Flows

What happens in a user's first few sessions shapes everything that comes after. Changes to the number of steps, the order of actions, or the point at which users are asked to commit to something can have measurable downstream effects on activation and retention metrics.

Pricing and Packaging Display

How you present pricing affects conversion without changing the underlying price. Tier ordering, anchoring, and the framing of free versus paid features are all worth testing for any team with a monetization surface, though the effects can take time to manifest.

Backend and Infrastructure

The most impactful experiments a product team can run are often invisible to users. A change to a ranking algorithm or a model prompt can affect user behavior just as much as a redesigned interface, and without a controlled experiment, the effect is nearly impossible to isolate.

Infrastructure and Performance

Performance improvements are generally good for users, but testing them as controlled experiments lets you quantify exactly how much they matter for the metrics you care about. Knowing which specific infrastructure investments moved conversion by 3% and which didn't gives teams a more reliable basis for deciding where to invest next.

Default Settings and Configurations

Most users never change defaults, which means the state you ship with has an outsized effect on how a feature gets used. Testing different default configurations is low-cost to implement and can meaningfully affect adoption and engagement.

Notification Timing and Content

Both the notification you send and what it says affect whether users engage with it. Testing send timing, message length, and the specific action you're prompting can improve open rates and click-through without increasing notification volume.

Product Features and Functionality

Beyond how a feature looks, you can test how it behaves. The results often reveal that users interact with the functionality in ways that don't align with the original design assumptions, which is useful information regardless of which variant wins.

Search and Discovery

Search ranking, autocomplete behavior, and filtering defaults all affect whether users find what they're looking for. Search is often a high-intent surface where small improvements in relevance or presentation directly affect conversion or engagement.

Algorithms and Ranking

Ranking and recommendation algorithms affect every user simultaneously, which makes them worth testing carefully. Small changes to the underlying logic can produce meaningful differences in engagement and retention that aren't visible until you measure them.

AI and ML Models

AI and ML models are particularly hard to evaluate without controlled experiments. A model that scores better on benchmarks doesn't always perform better in production, which makes A/B testing AI the only way to know for sure.  Performance, quality and speed are all important to test.  Slight changes in system prompts also require in-depth testing.

Growth and Acquisition Surfaces

Growth and acquisition surfaces are where most teams first encounter A/B testing, and for good reason. The metrics are clear, the feedback loops are short, and the tests are relatively cheap to run compared to changes deeper in the product.

Email Campaigns

Subject lines, send timing, message length, preview text, and calls to action all affect whether users open, click, and convert. Email is one of the more forgiving surfaces for experimentation because tests are cheap to run and results come in quickly, making it a good starting point before moving into more complex product surfaces.

Paid Ads

Ad creative, copy, targeting parameters, and landing page destinations all affect cost per acquisition and return on ad spend. Testing these systematically rather than relying on platform optimization alone gives teams more control over what's actually driving performance and makes it easier to apply what you learn across campaigns.

Landing Pages

Landing pages connect acquisition and product, which makes them worth testing carefully. Headline copy, hero imagery, social proof placement, form length, and page structure all affect conversion, and improvements here affect the efficiency of every upstream acquisition channel.

Mobile App Stores (ASO)

App store listings are a testable surface that many teams overlook. Screenshots, preview videos, descriptions, and icon design all affect install rates, and both the App Store and Google Play offer native tools for running controlled tests on these elements.

Internal Tools and Systems

Most teams think of A/B testing as something you do on user-facing surfaces. Internal tooling is worth the same rigor. The workflows your team uses, the interfaces they navigate, and the systems that handle billing and support all affect business outcomes in measurable, improvable ways.

Billing Systems

When and how you charge users affects conversion, retention, and revenue in ways that aren't always intuitive. Credit charging timing, trial length, grace periods, and dunning flows are all worth testing, and the effects can be substantial even when the changes seem minor.

Customer Success

The interfaces and workflows your support team uses directly affect both resolution times and the experience customers receive on the other end. Testing different queue structures, response templates, or escalation flows can surface improvements that are invisible from the outside but meaningful to the people doing the work and the customers they're helping.

Dashboard and Reporting Interfaces

How data is presented to internal users affects the decisions they make. Testing different visualizations, metric groupings, or alert thresholds can improve how quickly teams identify issues and act on them.

Internal Search and Navigation

How employees find information and move through internal tools affects productivity in ways that are easy to underestimate. Testing search ranking, navigation structure, and information hierarchy in internal tools follows the same principles as product experimentation, just with a different user base.

Workflow and Process Design

Internal processes are testable too. Whether it's the order of steps in an approval flow, the default assignee for a task, or the trigger conditions for an automated action, small changes to how work moves through a system can have measurable effects on speed and accuracy.

Different Types of A/B Tests

Not all experiments are structured the same way. The standard A/B test is the right tool for most situations, but there are different types of A/B tests for different situations.

A/A Test

An A/A test runs two identical variants against each other. The purpose isn't to find a winner but to confirm your experimentation infrastructure is working correctly. You should test a number of metrics to confirm data is flowing correctly, that you're seeing an equal number of users assigned to each test.  You should expect 1 out of 20 tests to show a statistically significant result with a 95% confidence interval.  

A/B/n test

An A/B/n test extends the standard A/B test to include multiple variants tested simultaneously against a single control. You evaluate several hypotheses in one experiment rather than running them sequentially. Each additional variant requires more units to reach significance, so population requirements scale with the number of variants.  If you have enough traffic, multiple variant tests are a great way to accelerate learning.  

Multivariate Test

A multivariate test changes multiple elements simultaneously and tests combinations of them. If you're testing two headlines and two button colors, a multivariate test runs all four combinations to understand not just which elements perform better individually, but how they interact. The tradeoff is that you need considerably more traffic than a standard A/B test, because the population is split across every combination.

Holdouts

A holdout test withholds a feature from a group of users after it has been fully rolled out to everyone else. The holdout group continues to see the old experience, which lets you measure the long-term effect on retention and engagement that takes time to manifest. A new onboarding flow might look neutral in a two-week test but show meaningful differences in retention at 90 days. Holdouts are also useful for measuring the cumulative effect of many experiments running simultaneously. By comparing the holdout group to the fully treated population over 3–6 months, you can measure the combined effect of all your experiments.

Statistical Approaches to A/B Testing

Most modern experimentation platforms, like GrowthBook, give you a choice between Bayesian and frequentist statistics. Both are good options but understanding the differences can help you decide which approach is best for you. 

Bayesian Statistics

Bayesian statistics handles hypothesis testing by expressing results as probabilities. Instead of a binary significant/not-significant decision, you get a probability distribution: what's the chance variant B is better than variant A, and by how much? This makes results easier to interpret and communicate to non-technical stakeholders. Bayesian methods can also incorporate prior beliefs about the metric being tested, helping avoid over-interpreting results from small samples.

Benefits of Bayesian Statistics

  • Results are expressed as probabilities that are intuitive to act on, like, "There's a 92% chance variant B is best.”
  • The probability distribution shows the full range of likely outcomes, not just a point estimate.
  • Probabilities are well-suited for communicating results to non-technical stakeholders.
  • Using informed priors can help reduce uncertainty in smaller samples.

Drawbacks of Bayesian Statistics

  • Poorly calibrated priors can skew results, particularly with small sample sizes.
  • Not immune to peeking; stopping rules should be defined upfront and followed.

Frequentist Statistics

Frequentist statistics is the more traditional approach to hypothesis testing. It calculates the probability of observing your results if there were no real difference between variants. That probability is the p-value, and it gets compared against a predetermined significance threshold, typically 0.05.

Benefits of Frequentist Statistics

  • Widely understood, with transparent math and familiar outputs.
  • Results are easy to audit or present in contexts where frequentist methods are the established standard.
  • Sequential testing can be used for continuous monitoring without inflating false positive rates.
  • A good fit when your team is more comfortable with p-values and confidence intervals.

Drawbacks of Frequentist Statistics

  • The binary nature of significance decisions can lead to misinterpretation. Teams also frequently misread “not significant" as "no effect" rather than "insufficient evidence to detect an effect."
  • Without sequential testing enabled, results are only valid if you don't peek before reaching a pre-determined sample size.

Concepts Shared by Both Bayesian and Frequentist Statistics

Despite their differences, Bayesian and frequentist statistics share many common concepts:

  • CUPED Compatible: CUPED uses pre-experiment data to reduce noise in metric estimates, allowing you to detect an effect faster with the same sample size. 
  • Random Assignments: Random assignment is what makes an experiment causal. Violations (users assigned to multiple variants, or assignment correlated with the metric) can invalidate results regardless of which framework you use.
  • Statistical Significance and Confidence Level: Both approaches use a threshold to determine when a result is reliable enough to act on. In frequentist statistics this is the significance level, while in Bayesian statistics it's expressed as a probability threshold. In both cases, set the threshold before the experiment starts.
  • Statistical Power and Sample Size: Power is the probability of detecting a real effect when one exists. Most teams aim for 80% as a minimum. Before starting an experiment, both approaches require a power analysis to determine the sample size you need to detect the effect you're looking for. Without one, you risk either stopping too early and acting on noise, or running longer than necessary. While not as prevalent in Bayesian statistics, if you have a stopping criteria, then computing your power to detect that stopping criteria is still valuable.
  • Peeking and False Positive Risk: Both approaches are susceptible to inflated false positive rates if you stop early based on favorable results. (GrowthBook's frequentist stats engine enables sequential testing to safely allow early stopping.)

Which Statistical Approach Should You Use?

Use Bayesian when you want probability-based results or have well-established priors that can reduce uncertainty in smaller samples.

Use Frequentist when results need to meet an established statistical standard, or when you want to enable sequential testing. 

Step-By-Step A/B Testing Process

How you plan and run a test determines whether the results can actually be trusted. Here’s the step-by-step process from developing a hypothesis all the way through to implementing a winning variant. 

Step 1: Research and Identify Opportunities

Good experiments start with a clear understanding of where the opportunity is. For product development teams, that usually means looking at where users drop off, where engagement is lower than expected, or where there's a meaningful gap between how a feature was designed to be used and how it's actually used.

Start with quantitative data like funnel drop-off rates, feature adoption rates to identify potential opportunities, then use qualitative data like user interviews, support tickets, and session recordings to better understand the situation.

How to Prioritize Experiments

Not every problem is worth testing. The best starting point is your team's current roadmap and goals. If you're focused on improving activation this quarter, test things that affect activation. Experiments that don't connect to what your team is actively solving are a distraction, however interesting the hypothesis.

Before committing, use an objective scoring system or prioritization framework like ICE to evaluate each opportunity:

  • Impact: How much could this improve the metrics your team cares about?
  • Confidence: How sure are you it will work, based on the data and research you have?
  • Ease: How much engineering effort does it require to implement and instrument?

Step 2: Form a Strong Hypothesis

A good hypothesis forces you to be specific about what you're changing, why you expect it to work, and how you'll know if it did.

A weak hypothesis sounds like: "Let's try a shorter onboarding flow." 

A strong one sounds like: "Reducing the onboarding flow from five steps to three will increase 7-day activation because users are dropping off at step three."

Here are a few more examples for weak and strong hypotheses. 

Weak Hypothesis Strong Hypothesis
The recommendation widget will increase sales. Adding a personalized recommendation widget below the product description will increase average order value because users who see relevant product suggestions will add more items to their cart before checkout.
We will surface errors more clearly. Replacing generic error messages with specific guidance on how to fix the issue will increase form completion rates, because users currently abandon forms at the error state without understanding what went wrong.
We think the pricing page is confusing. Replacing the feature comparison table with a use-case based pricing guide will increase trial conversion, because users in exit surveys say they can't determine which plan is right for them.
Dark mode would probably improve engagement. Adding a dark mode option to the dashboard will increase daily active usage among power users because power users spend more than 4 hours per day in the product and have requested this feature in support tickets.

Use this structure as a starting point for writing your own hypotheses:

[Specific change] will cause [measurable effect] because [reasoning based on research].

Step 3: Design Your Experiment

Most of the work in running a good experiment happens before you launch. The decisions you make at the experimental design stage will determine how useful your experiment is.

Define Your Measurement Criteria

Before you build anything, be clear on what you're measuring and why. Your primary metric should flow directly from your hypothesis. It's the specific effect you expect to see. If your hypothesis is that reducing onboarding steps will improve 7-day activation, then 7-day activation is your primary metric. 

  • Primary Metric: The single metric that determines whether the variant wins or loses, defined before the test starts and tied directly to your hypothesis. 
  • Secondary Metrics: Metrics that you’re not specifically trying to improve but may help you further understand your experiment's impact including related metrics and lagging indicators.
  • Guardrail Metrics: Metrics that you’re specifically not trying to hurt. 

Here’s what each metric might be for our onboarding experiment example. 

Hypothesis Reducing the onboarding flow from five steps to three will increase 7-day activation rate by 15%, because users are dropping off at step three.
Primary Metric 7-day activation
Secondary Metrics 30-day retention rate

Time to complete onboarding flow

Number of core features activated within 7 days
Guardrail Metrics Invited users (The number of referrals should at least stay the same compared with the control)

Support tickets related to onboarding confusion

Error rate on onboarding steps

Calculate Your Required Sample Size

The best way to ensure good decision making with experiments is to know how much data you need up front. Running an experiment without a sample size calculation is one way to end up not knowing if you can trust your results or when to end an experiment. Most modern experimentation platforms include a power calculator. You'll need four inputs:

  • Baseline Metric Value: Your current metric value, from your recent historical data. For conversion rates, this is a percentage; for continuous metrics like revenue or session duration, it's an average. In GrowthBook, we can compute this for you on historical data filtered down to your likely experiment population.
  • Minimum Meaningful Effect: The smallest improvement worth shipping for; in other words, you don’t care to detect a smaller effect, because it wouldn’t be worth the extra sample size to ship.
  • Confidence Level: Typically 95%
  • Statistical Power: Typically 80%, meaning a 20% chance of missing a real effect.

The calculator will tell you how many units you need per variant. Divide by your average daily volume of that unit to get your required duration. That might be daily active users, daily email sends, or accounts, depending on what you're randomizing on. Make sure you're calculating based only on the population that meets your targeting criteria, not your total user base.

Many tests should run for at least two full business cycles, typically two weeks minimum, to account for day-of-week behavior patterns even if you reach your sample size sooner.

Designing for Trustworthy Results

Experiment implementation is a crucial part of running a clean causal experiment and learning what you actually set out to learn. 

  • Test one feature at a time so results can be attributed to a specific cause.
  • Ensure random, equal traffic distribution between variants.
  • Keep everything else identical between versions.
  • Think upfront about which user segments might respond differently to the change and whether they should be tested separately.
  • Account for novelty effect. Users sometimes behave differently simply because something is new, which can cause early results to look better than they are.
  • Document the experiment in your log before launch, including hypothesis, metrics, targeting criteria, implementation details, and expected end date.

Step 4: Set Up Your Experiment and Validate the Implementation

Before you launch your experiment, validate that your experiment is configured correctly. Problems caught here are easy to fix. Problems caught after two weeks of bad data are not.

  • Confirm the events you need are firing correctly and consistently across platforms and devices.
  • Run an A/A test if you're setting up a new experimentation platform or making changes to your assignment logic. 
  • Check that both variants are free from bugs and function as expected. 

Step 5: Launch and Monitor

Once your experiment is live, your job is mostly to leave it alone. The temptation to check results early is real, especially when there's pressure to ship, but acting on interim results is one of the most common ways teams produce conclusions they can't trust.

Monitor only for:

  • Technical Errors or Bugs: If something is broken, stop the test and fix it.
  • Guardrail Metric Violations: If an important metric is getting meaningfully worse, it may be worth stopping early regardless of significance.
  • Sample Ratio Mismatch: An uneven traffic split is a signal that something is wrong with your assignment logic.

Everything else can wait until the test reaches its required sample size. If you need the flexibility to act on results before that point, enable sequential testing.

Step 6: Analyze Results Properly

When your experiment reaches its required sample size, resist the urge to declare a winner immediately. Good analysis goes beyond the binary question of whether the variant beats the control.

Some metrics require additional waiting time even after the experiment ends. For example, if you're measuring 7-day activation, you need to wait seven days after the last user was exposed before you can analyze that metric. Build this into your timeline upfront.

When it’s time to analyze the results:

  • Confirm the experiment ran as designed and reached the sample size needed to power it properly.
  • Check the confidence level against your predetermined threshold. A result that doesn't reach 95% confidence isn't automatically worthless. If a variant shows a 70% chance of being best with no meaningful guardrail violations and low implementation cost, many teams will ship it.
  • Verify there was no sample ratio mismatch that could invalidate the results.
  • Check secondary and guardrail metrics to confirm the variant didn't improve the primary metric while quietly harming something else.
  • Analyze results by key segments to check whether the overall result is hiding meaningful differences between groups.
  • Look at practical significance along with statistical significance. Statistical significance on its own doesn't tell you if this was a big win or a small win; you can learn a lot about what worked or didn't by looking at the lift directly, and considering how it compares to the cost of building and maintaining this feature.
  • Document regardless of outcome. Losses are often where the most learning happens.  Take the time to try to learn why your users behaved in a way you didn't expect.

Step 7: Implement and Iterate

Every experiment produces an outcome worth acting on, even when your hypothesis is proven wrong.

  • If your new variant wins, fully implement it and monitor post-launch performance. Use what you learned to sharpen the next hypothesis, a winning experiment often reveals opportunities for further improvement. 
  • If your new variant loses, roll back to your control. A losing experiment is still valuable. Analyze why your hypothesis was wrong, what the data suggests about user behavior, and whether a different approach is worth testing. Document the result so the same test doesn't get run again six months later by a different team.
  • If the result is inconclusive, iterate a few times with a learning mindset. An inconclusive result usually means one of three things: the sample size wasn't large enough, the effect is smaller than your minimum detectable effect, or there genuinely isn't a meaningful difference between the variants.

Advanced A/B Testing Strategies 

Once the fundamentals are in place, these advanced techniques can create additional value as your program matures and the questions you're trying to answer get harder.

CUPED 

CUPED (Controlled-experiment using pre-experiment data) is a variance reduction technique that uses pre-experiment metric data to improve the accuracy of your results. By accounting for pre-existing differences between users before the experiment starts, it reduces the noise in your estimates, meaning you can detect smaller effects with the same traffic, or reach the same level of confidence faster

GrowthBook's implementation extends CUPED with post-stratification, which uses user attributes like country or plan tier to further reduce variance by isolating the treatment effect from natural differences between groups. The more correlated your pre-experiment data and attributes are with the metric you're measuring, the more variance reduction you'll see.

The main requirement is that you have pre-experiment data for the metric you're testing. It works best for metrics that are frequently observed (engagement rates, session counts, revenue) and is less effective for new users or rare events where there's little pre-experiment history to draw on.

Example: Netflix reported CUPED reduced variance by roughly 40% for some key engagement metrics. Microsoft reported it was equivalent to adding 20% more traffic for a majority of metrics on one product team.

Quantile Testing

Most A/B tests compare means across variants, which works well when the effect is evenly distributed across users. Quantile testing compares percentiles instead, making it the right tool when you care about what's happening at the extremes. A change that improves average page load time by 50ms might look neutral on a mean test while actually fixing a severe performance problem affecting your slowest 1% of users.

The main consideration is sample size. Extreme quantiles (P99, P99.9) require large samples to produce reliable estimates. It also works best when you have a clear hypothesis about which part of the distribution you're trying to move.

Example: An engineering team testing a backend optimization uses a P99 latency metric to confirm the change reduced worst-case load times by 7ms, even though the mean improvement was too small to detect.

Multi-Armed Bandits

A multi-armed bandit is an adaptive experiment that shifts traffic toward better-performing variants as data comes in, rather than maintaining a fixed split throughout. Unlike a standard A/B test, which waits until the end to declare a winner, a bandit continuously reallocates traffic based on which variant is performing best on a single decision metric. GrowthBook uses Thompson sampling, a Bayesian algorithm that balances exploration (testing all variants) with exploitation (sending more traffic to the best performer).

Bandits work best when you have a clear single metric to optimize, five or more variants to test, and care more about minimizing exposure to poor-performing variants than understanding why each one performed the way it did. They're less suited to situations with long feedback loops, multiple goal metrics, or where statistical rigor matters more than speed.

Example: An ecommerce team testing five different product page layouts uses a bandit to automatically shift traffic toward the best-performing variant. This allows them to quickly capitalize on a winner during time-sensitive, days-long promotions like a Black Friday sale, while also reducing the number of users exposed to lower-converting layouts.

Cluster Experiments

Most experiments randomize at the user level, but some products require randomization at a coarser level of granularity. In B2B software, for example, you might need everyone at a company to see the same experience. Showing different variants to different users within the same organization would create confusion and contaminate results. Cluster experiments solve this by randomizing at the group level (the organization, the school, the household) while still analyzing outcomes at the individual level.

The main challenge is that cluster-level randomization reduces your effective sample size. You're randomizing across a smaller number of clusters than individual users, which means you need more clusters to reach significance. GrowthBook supports cluster experiments natively, handling the statistical complexity of analyzing at a different level than you randomize through its statistics engine.

Example: A B2B SaaS team testing a new dashboard layout randomizes at the organization level so every user within a company sees the same variant, then analyzes individual user engagement to measure impact.

Full-Funnel Testing

Most experiments measure a single metric at a single point in the user journey. Full-funnel testing measures the effect of a change across multiple stages, from initial conversion through to retention, revenue, and long-term engagement. This matters because a change that looks positive at the top of the funnel can have neutral or negative downstream effects that a single-metric test would miss entirely.

The main requirement is having metrics instrumented across the full user journey and enough traffic to detect meaningful differences at each stage. It also requires patience — downstream metrics like 30-day retention take time to manifest, which means full-funnel tests run longer than standard conversion tests.

Example: A team testing 7-day versus 14-day free trial lengths measures not just trial starts but 30-day conversion to paid, finding that the longer trial increased signups but reduced urgency to convert, producing a net negative revenue impact.

Long-Term Holdouts

Individual experiments measure the impact of a single change. Long-term holdouts measure the cumulative impact of all your changes over time. A small group of users is withheld from new features and experiments for an extended period, typically a quarter, while the rest of the product moves forward. Comparing the holdout group to the general population reveals the true long-term value of everything you shipped, including any unexpected interactions between features that individual tests couldn't detect.

The main tradeoff is that a small percentage of users (typically around 5%) experience a degraded product for the duration. 

Example: A product team runs a quarterly holdout and discovers that the cumulative lift from five experiments, each with a 1% lift, is only 3% relative to the holdout group because of diminishing returns.

Incorporating Research

A/B tests tell you what happened, but they can’t tell you why. Combining quantitative experiment results with supplemental research (user interviews, session recordings, surveys, usability testing) gives you both. A variant that wins on conversion but generates support tickets is a signal worth investigating. A variant that loses might reveal through user interviews that the hypothesis was right but the execution was wrong.

Supplemental research is most valuable at two points: before an experiment, to sharpen the hypothesis, and after an inconclusive or surprising result, to understand what the data couldn't tell you and you help generate your next hypothesis.

Example: A team runs an experiment on a new onboarding flow that produces an inconclusive result. User interviews reveal that users understood the new flow better but felt uncertain about committing without seeing the product first, leading to a new hypothesis worth testing.

Common A/B Testing Mistakes (and How to Avoid Them)

Every experimentation program can make mistakes. Learning to recognize common experimentation pitfalls is the first step to not repeating them.

1. Running Experiments Without Big Enough Samples

An underpowered experiment is one that doesn't have enough units to reliably detect the effect you're looking for. It happens when teams skip the power analysis and launch tests without knowing whether their population size or expected traffic is sufficient. Without enough data, you'll either get an inconclusive result or one that looks real but isn't stable enough to act on.

Example: Running tests on pages with fewer than 1,000 visitors per week.

Solution: Focus on the highest-traffic pages or make bolder changes that require smaller samples to detect.

2. Testing Changes That Are Too Small

A change that's too small to produce a detectable effect is a change that's too small to test. It happens when teams focus on incremental tweaks (like a slightly different button shade or a minor copy change) rather than changes that are likely to meaningfully affect user behavior. When these tests do reach significance, the effect size is often too small to justify implementing, and every underpowered test on a trivial change is a test you didn't run on something that could actually move your metrics.

Example: Testing the order of navigation menu dropdowns when users don't understand what your product does.

Solution: Match the boldness to your traffic volume. Smaller sample sizes need bigger swings.

3. Stopping Tests Early

Stopping a test before it reaches its required sample size is one of the most common ways teams produce results they can't trust. It happens when interim results look promising, and there's pressure to ship. The numbers seem to confirm the hypothesis, so stopping feels justified. The problem is that early data is noisier than final data, and a result that looks significant at day five may look very different at day twenty. Stopping early inflates your false positive rate, meaning you'll ship changes that don't actually work.

Example: Ending tests as soon as the p-value hits 0.05.

Solution: Predetermine the sample size and duration, and stick to them. If you need the flexibility to monitor continuously, enable sequential testing.

4. Ignoring External Factors

External factors are events or conditions outside your product that affect user behavior during an experiment. It happens when teams run tests during atypical periods like a seasonal sale, a major product launch, or a news cycle, without accounting for how those conditions might skew results. A winning variant during an unusual period may reflect the context more than the change itself, and applying those results year-round can lead to poor decisions.

Example: Testing during Black Friday and assuming results can be replicated year-round.

Solution: Note external factors and retest important changes during normal periods before permanently implementing them.

5. Shopping Metrics for Significant Results

Metric shopping is when teams run an experiment against many metrics and report whichever ones show significance after the fact. It happens when teams don't define their primary metric before the experiment starts, leaving the door open to interpret results selectively once the data comes in. The more metrics you test, the more likely you are to find a false positive by chance, and a result that emerges from fishing through metrics is not a result you can act on confidently.

Example: Testing 20 metrics, hoping one shows significance.

Solution: Choose your primary metric before starting. Treat others as directional insights rather than stable conclusions.

6. Shipping a Winner Without Checking Segment Performance

Aggregate results can hide meaningful differences in how different groups of users respond to a change. It happens when teams declare a winner based on overall performance without breaking results down by segment. A new checkout flow might increase conversion overall but frustrate returning users who've built habits around the old one, or perform well on desktop while degrading the experience on mobile. Shipping without checking may help some users at the expense of others.

Example: The overall winner performs worse for important segments.

Solution: Always analyze results by key segments, like new versus returning users or mobile versus desktop, before implementing.

7. Ignoring Implementation Cost

Not all winning variants are worth shipping. It happens when teams evaluate experimental outcomes purely by metric lift, without accounting for the actual cost of building and maintaining the change. A variant that requires significant refactoring, introduces dependencies on other systems, or creates an ongoing maintenance burden may not be worth implementing even if the results are strong. The lift needs to justify not just the initial build but the long-term cost of owning the change.

Example: A lead categorization model can improve onboarding success, but implementing it requires rebuilding the underlying data model and all downstream dependencies.

Solution: Factor implementation cost into your hypothesis prioritization.

8. Optimizing for Short-Term Metrics

A test can show an increase in conversion while masking longer-term damage. It happens when teams optimize for metrics that look good in a two-week experiment window without considering what happens to users afterward. Dark patterns that trick users into taking actions they might not otherwise take can boost immediate conversion rates while increasing refund rates, reducing retention, and eroding brand trust over time. A confused user and a genuinely converted user can look identical in the short term.

Example: An Edtech team test that shortened a mandatory course tutorial showed a 20% gain in time to first lesson completion, but decreased the final exam pass rate by 7%.Solution: Use guardrail metrics to catch downstream damage. If a winning variant hurts retention or drives up support volume, it's not a real win.

9. Running One Test and Moving On

Experimentation compounds over time, but only if teams treat it as a continuous practice rather than a series of isolated events. The mistake happens when shipping a winner feels like the finish line. The variant performed better, the change ships, and the team moves on to the next project without asking what they learned or what to test next. A single experiment answers a single question, but the real magic comes from using that answer to sharpen the next hypothesis, and the next one after that.

Example: A new recommendation algorithm that increases platform engagement, but no one investigates which content types drove the increase to further refine it.

Solution: Create a regular testing cadence with iteration built in.

10. Not Documenting Results

Without a record of what was tested, what the hypothesis was, and what the outcome was, institutional knowledge disappears when people leave, and teams waste time re-running experiments that have already been answered. It happens when documentation is treated as an afterthought rather than part of the process.

Example: A variant everyone was confident would win loses. A few months later, a different team has the same idea and runs the same test.

Solution: Maintain a searchable experiment archive and share learnings broadly.

How to Build a Culture of Experimentation

Winning a single A/B test is straightforward. The harder work is building an organization where evidence, not opinion, drives decisions consistently across teams, product areas, and levels of seniority.

An experimentation culture means controlled experiments are the default way of resolving uncertainty. Companies with a strong experimentation culture, recognize that their win rate is only 20% and that they are terrible at predicting which features will win or lose.  That insight creates humility and a determination to test everything. They recognize that the results of a test carry more weight than the instinct of the most senior person in the room, and losing an experiment is treated as useful information rather than a failure.

What a Strong Experimentation Culture Looks Like in Practice

In his book Experimentation Works, Harvard Business School Professor Stefan Thomke identifies seven attributes that characterize organizations where experimentation is genuinely embedded. They're worth understanding not as a checklist but as a description of what the mature state actually looks like.

  1. A Learning Mindset: Experimentation is treated as a continuous process, not a one-time validation. Most experiments won't produce dramatic results, and teams that have internalized this don't treat inconclusive results as wasted effort. They treat them as the cost of learning.
  2. Rewards Consistent With Values: Teams are rewarded for running good experiments, not just winning ones. When compensation is tied to metrics that make experimentation difficult, or when people are punished for null results, the culture quietly dies regardless of what leadership says about it.
  3. Humility: In a true experimentation organization, even the most senior person's assumptions get tested. Leadership's job shifts from making top-down calls to creating the conditions for good experiments and accepting what they find.
  4. Experiments Have Integrity: Strict guidelines govern how experiments are designed and run. This means pre-registered hypotheses, appropriate sample sizes, and agreed-upon metrics before the experiment starts, not after the results come in.
  5. Tools are Trusted: Experimentation only works if people trust what it produces. If teams routinely question the validity of results or find workarounds to avoid acting on them, the infrastructure exists, but the culture doesn't.
  6. Exploration and Exploitation are Balanced: There's an inherent tension between running experiments to learn and shipping product to grow. Organizations that only exploit what they already know stop learning. Organizations that only explore never ship. Senior leadership has to manage that balance deliberately.
  7. Leadership Actively Promotes It: Companies tend to become less innovative as they grow, as the distance between senior leadership and the teams doing the work increases. Experimentation cultures require leaders who actively champion the practice, not just endorse it in all-hands presentations.

Where Does Your Team Sit on the Experimentation Maturity Model?

Thomke and his colleagues describe five stages of experimentation maturity. These are a great way to honestly assess where your organization is and what it would take to move forward.

Stage 1: Awareness

At the awareness stage, leadership values experimentation, but there are no processes, tools, or infrastructure in place. Decisions are still mostly based on experience and intuition. If your team occasionally runs an experiment when a decision is particularly contested, this is probably where you are.

Stage 2: Belief

At the belief stage, leadership accepts that a more disciplined approach is needed and starts investing in tools and dedicated teams. The impact on day-to-day decision-making is still minimal, but the direction is set.

Stage 3: Commitment

At the commitment stage, experimentation becomes core to how the team operates. Some product decisions and roadmap calls now require data from experiments, and the impact on business outcomes is becoming measurable. 

Stage 4: Diffusion

At the diffusion stage, large-scale experimentation is recognized as necessary, and formal standards are rolled out across the organization, supported by tooling and training. Individual teams are no longer the bottleneck.

Stage 5: Embeddedness

At the embeddedness stage, experimentation is fully democratized. Teams design and run their own experiments without central oversight, results are shared automatically across the organization, and the institutional memory of past experiments actively informs new ones.

When Harvard Business School's Baker Research Services compared the stock performance of companies with strong experimentation cultures against the S&P 500 over ten years, those companies outperformed the index by a wide margin. The group included Amazon, Etsy, Facebook, Google, Microsoft, Booking Holdings, and Netflix, organizations that had spent years building the infrastructure and culture for large-scale experimentation.

The Future of A/B Testing

Experimentation is evolving fast. The tools and techniques available today look very different from what existed five years ago, and the next five years will likely bring even more significant shifts. A few trends worth paying attention to:

AI-Powered Testing

AI coding tools are accelerating development velocity in ways that are changing how product teams need to think about experimentation. When engineers can ship features faster, the volume of changes hitting production increases. More features shipping faster means more opportunities for something to hurt retention, conversion, or engagement before anyone catches it. Gradual rollouts and a rigorous experimentation practice matter more as shipping velocity increases.

There's also an entirely new category of things to test. Teams building AI-powered features like recommendation systems, content generation tools, and AI tutors face a challenge that standard A/B testing wasn't designed for. LLMs are non-deterministic: the same input doesn't always produce the same output, and measuring quality requires different metrics than measuring clicks or conversions. Testing whether one model prompt produces better learning outcomes than another requires an experimentation platform that can handle that kind of measurement.

GrowthBook's approach is to accelerate every step of the experimentation lifecycle from directly within the tools developers already use. AI integrations built into the platform include automatic results summaries, hypothesis validation before a test launches, similar experiment detection using vector embeddings, metric definition generation, and SQL generation for data exploration. MCP integration lets you connect your own tools and agents directly to GrowthBook via the MCP server.

Learn more about how to test AI with this practical guide.

Real-Time Personalization

Traditional A/B testing delivers the same experience to everyone in a variant. The next evolution is moving beyond fixed variants toward delivering the optimal experience for each individual user in real time, based on their behavior, context, and predicted response. Multi-armed bandits are an early version of this idea, but the direction is toward much more granular personalization.

Causal Inference

As experimentation programs mature, teams are increasingly using advanced statistical methods to understand cause and effect more precisely, particularly in situations where traditional randomized experiments are difficult or impossible to run. Techniques like difference-in-differences, synthetic control, and instrumental variables are becoming more accessible to product and data teams.

Cross-Channel Orchestration

Many experimentation programs are siloed by channel. A web team runs web experiments, a mobile team runs mobile experiments, and the combined effect of changes across both is rarely measured. The direction is toward experimentation infrastructure that can orchestrate and measure tests across web, mobile, email, and other touchpoints simultaneously.

Privacy-First Experimentation

Privacy regulations and the deprecation of third-party tracking are forcing experimentation platforms to adapt. The approaches gaining traction are those that minimize data movement, work with aggregated rather than individual-level data, and can operate within strict compliance requirements. Platforms that support self-hosting are well-positioned for this shift.

How to Get Started with A/B Testing

Getting started with A/B testing doesn't require a mature experimentation platform or a dedicated data science team, but the setup decisions you make early will either accelerate or constrain your program as it grows.

Get Your Instrumentation Right

Before you can run reliable experiments, you need confidence that the metrics you care about are being tracked correctly and consistently. This means checking that your event logging is complete, that events fire consistently across platforms and devices, and that your data pipeline is reliable.

Skipping this step is one of the most common reasons early experimentation programs produce results no one trusts. A test result is only as good as the data behind it, and discovering instrumentation gaps after a test has run is a frustrating way to learn that lesson.

Start With One Test

Pick a high-traffic surface where you have a clear hypothesis and a metric you can measure. Don't start with the most complex change or the most ambitious idea. Start with something where the feedback loop is short, the instrumentation is straightforward, and you have a reasonable chance of seeing a result. Early wins help build organizational buy-in. Run the test for long enough to reach your required sample size, analyze the results honestly, and document what you learned regardless of the outcome.

Chances are, your team already has hypotheses worth testing. Look at support tickets, session recordings, and drop-off points in your funnel. If you want external inspiration, resources like GoodUI.org and the Baymard Institute publish evidence-based UX patterns that can serve as a starting point for simple but effective test ideas.

Find a Leadership Sponsor

Experimentation programs that stick have a senior champion: someone with enough organizational influence to protect the team's time, push back when results are inconvenient, and make the case for investing in the infrastructure. Without one, a single bad test result or a quarter of inconclusive experiments is enough to kill the program before it gets traction.

It’s ok if your sponsor isn’t technical, but they need to believe that making decisions based on evidence is worth the investment, and be willing to say so publicly when the HiPPO in the room disagrees with the data.

Go Deeper

One of the most useful things you can do early is learn from teams that have already built mature experimentation programs.

  • Trustworthy Online Controlled Experiments by Ronny Kohavi, Diane Tang, and Ya Xu: The most rigorous and practical book on running experiments at scale, written by the people who built experimentation programs at Microsoft, Google, and LinkedIn.
  • Experimentation Works by Stefan Thomke: This book takes a broader look at building an experimentation culture, grounded in research across dozens of organizations.
  • GrowthBook Docs: Detailed guides covering everything from getting started to advanced statistical methods, with a practical guide to scaling experimentation at your company.

Join the Community

Having access to people who have already solved the problems you're facing is one of the most underrated resources in experimentation, and the experimentation community loves to share knowledge and help each other out. 

  • Trustworthy A/B Patterns: GrowthBook is partnering with industry pioneers Ronny Kohavi, Lukas Vermeer, and Jakub Linowski to offer e-commerce companies with over 1 million monthly active users free expert assistance in designing and executing high-impact A/B tests in exchange for the right to publish the results.
  • GrowthBook Slack: A free Slack community where you can ask questions, share learnings, and connect with other teams running experiments from those just getting started to those running robust programs at scale. 
  • Test & Learn Community: A free community of over 2,000 practitioners across experimentation, product, analytics, and research. Members meet regularly on Zoom to discuss topics, hear from industry leaders, and help each other solve real problems. 

How to Choose an A/B Testing Platform

The experimentation platform you choose now will shape your experimentation program for years. It determines what you can test, how fast you can move, and how much you can trust your results. And because experimentation infrastructure becomes deeply embedded in your codebase and data pipelines over time, switching platforms is expensive and disruptive enough that most teams avoid it, so it's worth getting right the first time.

Technical Fit and Developer Experience

The platform needs to work with how your team already builds. A tool that requires significant engineering lift to integrate, or doesn't support your tech stack, will create friction from day one and limit who can actually run experiments.

  • Target Use Cases: Was this platform built for product and engineering teams or marketing-led CRO? The answer shapes everything from the SDK architecture to the statistical methods available, and a tool designed for visual editing and landing page optimization will quickly run into issues testing algorithms or server-side features.
  • SDK Coverage: The platform needs to integrate cleanly into how your team already builds, without requiring significant backend engineering every time a new test is created. Look for SDKs that evaluate locally with no network requests (keeping performance impact minimal and ensuring experiments work regardless of connectivity), and that cover your full stack.GrowthBook offers 24+ SDKs covering frontend, backend, mobile, and edge environments, working with virtually any stack.
  • Feature Flag Integration: The best experimentation platforms combine feature flags and experiments in a single tool. This lets you use the same flag to run an experiment, do a phased rollout, and kill a change if something goes wrong, without switching between systems.
  • Integration Complexity: How long does it take to instrument your first experiment? A good platform should have clear documentation and a quick start path that doesn't require backend engineering every time a new test is created.
  • Scalability: Can it handle your traffic volume without degrading performance or requiring you to limit how much of your user base is exposed to experiments? Per-event pricing models can create pressure to under-test at scale.
  • Environment and Release Management: Does it support separate staging and production environments, and can you roll out changes incrementally without redeploying code?
  • AI and MCP integration: Does the platform support AI-assisted workflows so your team can work smarter and faster, or support MCP so you can integrate your own tools and agents?

Statistical Rigor and Data Ownership

The platform needs to produce results you can actually trust, and that means being transparent about how the statistics work.

  • Statistical Transparency: Can you see the methodology behind the results? Look for platforms that support both Bayesian and frequentist approaches, publish their stats engine openly, and don't hide their calculations behind proprietary black boxes.
  • Warehouse-Native Analysis: The best platforms let you analyze data directly in your existing warehouse (Snowflake, BigQuery, Redshift, Databricks) rather than requiring you to send data to a third-party system. This means your experiment data lives alongside your product data, you define metrics using SQL you control, and there's no duplicate data pipeline to maintain.
  • Managed Warehouse:  If you don’t have your own data warehouse, some vendors offer a pre-configured data warehouse.  This allows you to start out from day 1 with an industry standard data warehouse without needing to set one up or create your own data pipelines.
  • Metrics Definition and Governance: Who defines the metrics and how? Look for platforms that let your data team define metrics centrally using your own data definitions, rather than forcing you to redefine them inside the tool.
  • Data Ownership: When you stop using the platform, do you keep your experiment history and learnings? Proprietary platforms that hold your data hostage create switching costs that go beyond the tool itself.
  • Targeting and Segmentation: Can you randomize at the level that makes sense for your product (user, account, organization, session) and analyze results by segment without the platform limiting how you slice the data?

Security, Compliance, and Deployment Options

Security and compliance requirements vary widely across industries, but the cost of getting this wrong is high regardless. Data residency issues, compliance violations, and PII exposure can all stem from choosing a platform that wasn't built with these constraints in mind.

  • Self-Hosting: Can you run the platform on your own infrastructure? Cloud-only platforms create data residency issues for teams with strict compliance requirements and require you to send user data to a third-party system. Self-hosting gives you full control over where your data lives.
  • Privacy and PII handling: Does the platform require you to send personally identifiable information to its servers to run experiments? Look for platforms where experiments are assigned locally, with no user data leaving your infrastructure.
  • Open Source vs Proprietary: Open-source platforms allow you to audit the code, customize the platform to your needs, and avoid vendor lock-in, but they require engineering resources for maintenance.
  • Compliance and Regulatory Requirements: If you operate in healthcare, financial services, education or other regulated industries, the platform you choose should support your compliance requirements out of the box.

Accessibility and Collaboration

Experimentation only scales when the whole team can participate. A platform that creates friction for non-technical users will limit how often experiments get run and who benefits from the results.

  • Ease of Use for Non-Technical Teams: Can a product manager set up and launch an experiment without engineering support? Look for intuitive interfaces, clear result summaries, and workflows that don't require SQL or statistics knowledge to navigate.
  • Result Sharing and Reporting: How easy is it to share experiment results with stakeholders? Look for shareable dashboards, exportable reports, and result summaries that translate statistical outcomes into plain language.
  • Experiment Documentation: Does the platform make it easy to document hypotheses, decisions, and learnings in a way that's searchable and accessible to the whole team? A searchable experiment archive is one of the most valuable things an experimentation program can build over time.
  • Permissions and Governance: As your program grows, you need the ability to control who can create, approve, and ship experiments. Look for role-based permissions and approval workflows that you can tailor to how your organization actually operates.

Pricing and Total Cost of Ownership

With experimentation platforms, the sticker price is rarely the full cost. How a platform charges you shapes how much you can experiment, and the wrong pricing model can quietly constrain your program as it grows.

  • Pricing Model: Does the platform charge per event or based on traffic volume? These models create a direct conflict between running more experiments and controlling costs, often forcing teams to test on a fraction of their traffic to avoid overage fees. Look for predictable pricing that doesn't penalize you for growing.
  • Build vs Buy: Are you better off building or buying an experimentation platform? Building an in-house solution gives you full control but most companies underestimate the complexity and risks of doing so. Most teams underestimate this cost until they're already committed, and the opportunity cost of those engineers not working on the product is rarely factored in.
  • Modular vs All-in-one Pricing: Some platforms charge separately for server-side, client-side, and feature flag capabilities. What starts as one tool quickly becomes multiple SKUs with compounding costs.
  • Switching Costs: What happens if you outgrow the platform or want to move? Proprietary data formats, locked-in experiment history, and deep SDK integrations all make switching painful. Factor this into your evaluation upfront rather than after you've committed.

Why GrowthBook

GrowthBook is the open-source feature flagging and experimentation platform built for product and engineering teams. It's used by over 3000 companies, from early-stage startups running their first experiments to enterprises processing billions of feature flag evaluations per day. Here’s why you should consider using GrowthBook:

  • No per-traffic or per-event pricing, so you can run experiments on as much of your traffic as you want without watching costs balloon. Learn more about GrowthBook pricing
  • Analysis runs directly on your existing data warehouse (Snowflake, BigQuery, Redshift, Databricks), with no need to send data to a third-party system.
  • The platform is open source, your data stays yours, and you can self-host if you need full control.
  • Feature flags and experimentation are unified in a single platform, so you're not juggling separate tools for rollouts and tests. 
  • The stats engine supports both Bayesian and frequentist approaches, CUPED, post-stratification, sequential testing, and advanced techniques like cluster experiments and holdouts.
  • Built-in tools for experimentation culture and deep insights: a searchable experiment archive, shareable dashboards, and an interface designed for the whole team, not just data scientists.
  • With 24+ SDKs covering frontend, backend, mobile, and edge environments, it works with virtually any stack.

You can start for free and scale from there. The free tier gives you everything you need to run your first experiments, while the Enterprise plan adds advanced features like holdouts and the governance tools that mature programs need.

Start A/B Testing the Right Way

A/B testing doesn't replace judgment, but it gives judgment something solid to work with. The teams that get the most out of it aren't running the most experiments. They're asking sharper questions, defining better metrics, and building enough rigor into their process that results can actually be trusted.

That's harder to build than it sounds. But the organizations that do it stop having the same arguments about what to ship. They stop reverting changes based on noise. They stop leaving product decisions to whoever made the most compelling case in the last meeting.

In 2026, the companies pulling ahead are the ones replacing guesswork with evidence.

How Khan Academy Optimizes AI Tutoring with Experimentation
Experiments
Platform

How Khan Academy Optimizes AI Tutoring with Experimentation

Mar 22, 2026
x
min read

Kelli Hill gave a standout presentation at The Conference known as Experimentation Island on February 24, 2026, walking the audience through Khan Academy's evolution from intuition-based testing to running A/B tests on generative AI features in production. If you missed it, the good news is Kelli will be joining us for a webinar on April 16, 2026. I'd highly encourage you to register here. Below are my key takeaways from her talk.

A Quick Word on Khan Academy

Khan Academy is a nonprofit with a mission to provide a free, world-class education for anyone, anywhere. They have nearly 200 million registered users and have logged over 63 billion learning minutes on their platform. In 2023, they launched Khanmigo, a generative AI-powered tutor and teaching assistant built on top of their massive library of exercises, articles, and instructional content. Khanmigo is the focus of much of their current experimentation work, and the context for everything Kelli shared.

From Homegrown to a Real Experimentation Stack

Khan Academy has been running experiments since 2011, when they built their first in-house platform on Google App Engine. At their peak, they had hundreds of A/B tests running simultaneously. But over time, the homegrown system slowed down, and when they rewrote their entire backend in 2019 (a million lines of code, migrating off Python 2), they made a deliberate decision not to port their old experimentation tooling.

Instead, they evaluated what was available. Building a new platform in-house was tempting, but they recognized that experimentation infrastructure wasn't their core competency. Buying an enterprise solution would have required downsampling their data, which was a non-starter. They ultimately chose GrowthBook, self-hosting it and connecting it to their existing data warehouse and eventing pipelines. Their chief architect's top priority was that the tool not slow down a site serving a million daily active users, and GrowthBook delivered on that.

The lesson here is one we see repeatedly: organizations that try to build their own experimentation platform almost always end up spending more than expected, moving slower than they'd like, and eventually switching to something purpose-built. Khan Academy's journey is a textbook case of making that transition well.

How Evals Evolved from Vibes to Automated A/B Testing

The most fascinating part of Kelli's talk was the four-phase journey Khan Academy went through to figure out how to measure AI quality. When you're building an AI tutor, you can't just measure click-through rates. The goals are harder: 

  • Increased cognitive engagement
  • An increase in skills on their way to proficiency 
  • Measurable learning gains on external assessments. 

And LLMs make measurement even harder because they're non-deterministic. The same prompt can produce wildly different outputs each time.

How Khan Academy evolved their AI evaluation techniques

Phase 1: Intuition-driven testing. In September 2022, before ChatGPT had even launched publicly, OpenAI gave Khan Academy early access to GPT-4 via Slack. The team's first experiments were literally typing prompts into Slack and reading the outputs. They quickly discovered problems (GPT-4 confidently told a user that 9 + 5 = 15, then gave the correct answer ten minutes later). Good enough for building intuition about how LLMs behave, but not for building a product.

Phase 2: Structured manual testing. With a deadline to launch alongside GPT-4's public announcement in March 2023, they built an internal prompt playground for more repeatable testing. Faster than Slack, but still relied on humans to read outputs and judge quality.

Phase 3: Automated post-hoc evals. This is where things got serious. They assembled a team of PhDs in education to define what good tutoring actually looks like, then had human raters apply that rubric to chat transcripts, targeting 85% inter-rater agreement. Once they had that ground-truth dataset, they used it to train an LLM-as-judge to label transcripts at scale. The key insight: many teams spin up LLM-as-judge systems with no ground truth, resulting in unreliable results. Khan Academy invested in the hard work of human annotation first. Once the machine matched human accuracy, they scaled it to process thousands of interactions nightly.

Phase 4: A/B testing in production. With reliable automated evals in place, they could finally run controlled experiments on prompt changes, system instructions, and even entire model swaps, all measured against metrics like cognitive engagement, item performance, undesirable tutoring behaviors (like giving away answers), and latency as a guardrail. This is the stage they're in now, with 64 completed experiments, 29 running, and 13 queued as of February 2026.

The takeaway: as AI products mature, your evaluation methods need to mature with them. You can't skip straight to production A/B testing without the foundation of knowing what "good" looks like.

The Math Agent Story: What Iterative AI Experimentation Actually Looks Like

Kelli shared a concrete example that perfectly illustrates how A/B testing enables teams to "hill climb" toward better AI quality. The problem: Khanmigo had a math agent, essentially a calculator it could call to verify computations. Great for accuracy, but it added latency that was painful in classroom settings.

Here's how the iterations played out:

Iteration 1: Remove the math agent entirely. Latency improved, but math errors doubled. Rolled back immediately.

Iteration 2: Switch to GPT-5. Latency decreased, but math accuracy still suffered. Rolled back.

Iteration 3: Optimize the math agent's prompts. They tightened the system instructions to be more efficient. Latency dropped by three seconds, and math accuracy held. A real win.

Iteration 4: Give the math agent a faster model. Reduced latency by another 300 milliseconds with stable accuracy.

Iteration 5: Time-box the math agent's execution. Further latency reduction, accuracy still stable.

Without A/B testing, the team might have shipped Iteration 1 or 2 and unknowingly degraded the learning experience. The experiments gave them the confidence to reject changes that looked good on one metric but failed on the one that mattered most. This is what "hill climbing" looks like in practice: hypothesis, test, measure, iterate. No single change was transformative. The cumulative effect was.

From Speed Bump to Safety Net: The Cultural Shift

Perhaps the most important takeaway from Kelli's talk was about culture. Before Khanmigo, experimentation at Khan Academy was seen as a speed bump. Product teams wanted to ship based on strong founder intuition and internal conviction. Running an A/B test felt like an obstacle to velocity.

Generative AI changed that completely. LLMs are unpredictable enough that even small changes to prompts or system instructions can produce dramatically different outputs. Teams quickly learned that shipping without testing was genuinely risky. The same engineers who once resisted experimentation now actively request it.

Experimentation went from being perceived as something that slows you down to being the safety net that gives teams the confidence to move fast. That cultural transformation, more than any individual experiment result, may be the most valuable outcome of Khan Academy's journey.

Want to hear the full story from Kelli? She'll be joining us for a live webinar on April 16, 2026, where she'll share this full story. 

Feature Flags: What They Are, How They Work, and Why They Matter
Feature Flags

Feature Flags: What They Are, How They Work, and Why They Matter

Mar 2, 2026
x
min read

It’s Friday, quarter to 5:00 PM. Your team deploys a major checkout redesign to production. Within minutes, its error rates start spiking.

Your Slack’s on fire and your CEO is asking a ton of questions. Next thing you know, you’re staring down a long night of reverting commits and explaining what happened.

Now imagine the same scenario with one change. You disable the feature in 10 seconds with a single click, without redeploying code. 

That’s the difference between deploying code and deploying code behind a feature flag.

In this guide, we’ll cover what feature flags are and how product and engineering teams can use them successfully.

What Are Feature Flags?

A feature flag is a conditional mechanism in your code that lets you toggle application behavior at runtime, without deploying new code.

You wrap a feature in a flag, deploy it in an “off” state, and turn it on when you’re ready. If something goes wrong, you turn it off, which rolls back the deployed feature.

Feature flags are also referred to as feature toggles or feature switches, but they all describe the same mechanism. At their simplest, feature flags are if/else blocks that check a configuration value to decide which code path runs:

const newCheckout = useFeatureIsOn(“new-checkout”)if (newCheckout) {  return <NewCheckoutFlow />;} else {  return <LegacyCheckout />;}

You deploy this code with the feature turned off. So, everyone sees the legacy checkout. When you’re ready, you flip the flag to deploy the feature and flip it back off again if something goes wrong or you’re done testing it.

How Do Feature Flags Work?

Feature flag systems have three main components that work together to give you complete control over your features:

  1. Flag Configuration
  2. Flag Delivery
  3. Flag Evaluation

1. Flag Configuration

Flag configuration is where you define your flags and the rules that govern them. That can be as simple as a config file or as sophisticated as a dedicated feature management platform with a user interface (UI), audit logs, and role-based access.

Here’s an example configuration for a new-checkout flag:

{
  "new-checkout": {
    "defaultValue": false,
    "rules": [
      {
        "condition": {
          "beta": true
        },
        "force": true
      }
    ]
  }
}


This config enables the new checkout only for beta testers, so you can test the new flow with a small set of users before rolling out the final version.

Note: If you’re wondering why you can’t set environment variables for this, it’s because those variables require redeployment to change. Flags don’t.

2. Flag Delivery

Once you’ve defined your flags, the configuration needs to reach your application. This can be through an included file at build time, API calls, streaming updates, or a mix of all three.

The method you choose decides how fast the changes propagate. That’s why platforms like GrowthBook deliver via server-sent events (SSE) to push changes immediately. The changes come through within milliseconds.

3. Flag Evaluation

The final component is where your application actually resolves a flag’s value. The SDK (or your custom code) takes the user’s attributes and evaluates them against the flag’s rules. Based on that, the SDK returns the appropriate value.

// Beta userconst gb = new GrowthBook({  attributes: { id: “user_123”, beta: true }});gb.isOn(“new-checkout”) // Returns: true// Regular user  const gb = new GrowthBook({  attributes: { id: “user_456”, beta: false }});gb.isOn(“new-checkout”) // Returns: false (uses defaultValue)

In this example, the beta user’s attributes match the rule. So, the flag evaluates to true, and the new checkout renders. If it doesn’t, the flag falls back to the default value of false.

Note: When a flag is disabled (not just set to false, but turned off entirely), most platforms—including GrowthBook—evaluate it as null rather than false. Here, the fallback value will be used if you’ve added one. The boolean one won’t give you an error, it’ll render as false.

What Are the Types of Feature Flags?

All feature flags don’t serve the same purpose. And they aren’t meant to live for the same period either.

We can categorize feature flags across three axes:

  1. Their time span
  2. Their purpose
  3. Their scope
  4. Their value type

Here are the types of feature flags:

Feature Flag Type Lifespan Primary purpose Value type Scope Retirement
Deployment 2–4 weeks Testing and debugging code Boolean System-level Yes, after testing code
Release 2–8 weeks Progressive rollout of new features Boolean User-level Yes, always
Experiment 2–6 weeks A/B/n testing and variation assignment String or JSON User-level Yes, after shipping the winner
Ops / Kill Switch Long-lived Emergency control, runtime config Boolean or number System or user-level No, adjust as needed
Permission Permanent Access control by tier, role or geography Boolean User-level No, part of business logic

By Time Span

There are two types of feature flags based on when or if you retire them:

  1. Short-lived flags: These toggles exist for days to weeks. You create them for a specific purpose—for instance, to ship a feature or run a test, and then remove them when you’re done. These flags are often the source of technical debt because people tend to forget to clean them up.
  2. Long-lived flags: These flags live in your codebase for months or permanently. They’re part of your app’s ongoing behavior. For example, you might need a kill switch to turn off a part of the app or an entitlement flag to control which users see certain features.

By Purpose

Feature flags can be classified into five types based on usage:

1. Release Flags

You can use release flags to control the rollout of new features during the release process.

if (gb.isOn("new-checkout-flow")) {
  return <NewCheckoutFlow />;
}


return <LegacyCheckout />;

A typical lifecycle looks like this:

  1. Create the flag
  2. Test with internal users
  3. Run a progressive rollout (5%, 25%, 50%, 100%)
  4. Confirm everything is stable
  5. Remove the flag and the old code path entirely

Most release flags should live for 2 to 8 weeks. If yours has been around longer, it’s time to clean up. Platforms like GrowthBook include stale feature flag detection to surface flags that haven't been evaluated recently, so you know which ones are overdue for cleanup.

2. Experiment Flags

You should use experiment flags to assign users to variations for A/B testing. The main difference is that consistent assignment matters because the same user should always see the same variation for both UX consistency and accurate measurement. Platforms like GrowthBook handle this automatically by hashing the user's ID, so assignment is stable without any extra work on your end.

For A/B testing, it’s also very important to ensure that users are randomly being assigned to both the control and variant groups.

const variant = gb.getFeatureValue("checkout-cta-experiment", "control");
// Returns "control", "variation_a", or "variation_b"

switch (variant) {
  case "variation_a":
    return <Button>Complete Purchase</Button>;
  case "variation_b":
    return <Button>Place Order</Button>;
  default: // control
    return <Button>Buy Now</Button>;
}

Typically, you’ll leave these on for as long as your experiments run—usually 2 to 6 weeks. But it depends on your traffic volume and the level of statistical power required. Once you’ve shipped the winning variant, remove the flag.

3. Operational Flags

Operational flags control system behavior and provide emergency shutoffs. Think about cases like circuit breakers, graceful degradation under load, and runtime configuration changes that don’t warrant a full deployment.

if (gb.isOn("enable-recommendation-engine")) {
  recommendations = await fetchRecommendations(userId);
}

const cacheTimeout = gb.getFeatureValue("redis-cache-ttl", 300);
const rateLimit = gb.getFeatureValue("api-rate-limit", 1000);

These flags are mostly permanent in nature. You can trigger them manually during incidents or automatically through monitoring or feature flagging systems. A kill switch is an excellent example of such a flag.

4. Permission or Entitlement Flags

Permission flags control feature access based on subscription tier, user role, geography, or account status. These depend on the business logic and aren’t necessarily used for development or testing purposes.

// Flag targeting evaluates user's plan attribute
if (gb.isOn("advanced-analytics")) {
  return <AdvancedAnalyticsDashboard />;
}

return <BasicAnalyticsDashboard />;

A permission flag evaluates the user’s attributes (like their plan tier) to determine access, but your application database remains the source of truth for those attributes. It doesn’t mean that the flag stores data, it just evaluates conditions. 

A warehouse-native platform like GrowthBook can evaluate these attributes directly from your existing data without requiring data duplication or schema changes. Your warehouse is already the source of truth for plan tiers and user roles so you don’t have to bring them into another platform again.

Tip: Always evaluate entitlement flags server-side using verified data. If you do this client-side, the user can inspect your flag configuration in the browser’s dev tools and potentially change it to bypass controls.

5. Development Flags

Development flags are usually used to turn a feature on or off to test and debug code. These are short-lived, and you should turn them off after completing the QA or testing process.

By Scope

Depending on the scope, you can categorize it into two types:

  1. System-level flags: These flags affect your entire application uniformly. A kill switch that disables a service for all users, or a config flag that changes your cache TTL globally. These don’t care who the user is—they’re binary for the whole system.
  2. User-level flags: These flags evaluate differently per user based on their attributes—for example, user ID, plan tier, geography, device type, or behavioral signals. User-level flags are used in targeting, rollout, and experimentation because these are based on user attributes. Let’s say you’re launching a 10% rollout, you’re hashing user IDs to consistently assign each person to a cohort.

By Value Type

Value types describe what a flag returns. Most flags start as simple booleans, but as your use cases mature, you'll reach for more advanced types, such as:

  1. Boolean flags: These flags return true or false. This is the default for most feature flags, such as on/off toggles and kill switches. If you’re wrapping a feature in a flag for the first time, this is where you start.
  2. String flags: These flags return a text value. Use these when you need to serve different variations of content like button text in an A/B test ("Buy Now" vs. "Add to Cart"), or a theme identifier ("dark" vs. "light").
  3. Number flags: These flags return a numeric value. They’re useful for tuning runtime parameters such as cache TTLs, rate limits, pagination sizes, and retry counts without redeploying.
  4. JSON flags: These flags return structured data. A single JSON flag can control an entire component's behavior. For instance, returning { "layout": "grid", "rows": 10, "showFilters": true } to configure a UI layout without deploying new code. They’re also useful for complex experiment variations where each variant needs multiple parameters, or for configuration bundles that you want to manage as a single unit.

Who Uses Feature Flags and For What Purpose?

Even though feature flags started out as a developer practice, they’re no longer limited to the engineering team. If you implement them, even non-technical users can work with them.

Here’s how that works:

Technical Teams

  1. Developers and engineers: Development teams implement feature flags in code, manage rollouts, and use kill switches during incidents. The goal is to deploy with confidence by making every release reversible. They’re also responsible for maintaining and cleaning up unused or old flags.
  2. QA and testing: These teams use flags to validate features in production with real data, traffic patterns, and third-party integrations. Since staging environments can never fully replicate those conditions, feature flags allow them to get a sense of what will actually happen when the feature is live.
  3. DevOps and site reliability engineering (SRE): These teams rely on operational flags for circuit breakers, infrastructure migrations, and system configuration changes. For instance, if a service degrades, they can disable non-critical features to preserve core functionality.
  4. Data analysts: Analysts use flags to launch experiments, create targeting rules (who will be part of the experiment), and then randomly assign users to a variation. When a feature is launched as an experiment, analysts get clean and randomized data in their warehouse. For example, assignment records alongside behavioral events without having to add experimentation individually."
  5. Security and compliance: These teams audit flag changes to maintain a record of who released what, to whom, and when. Features like approval workflows and audit logs matter the most so that they can access that information. Also, if new regulatory requirements take effect, they can disable non-compliant features immediately.

Business and GTM Teams

  1. Product managers: Product teams use feature flags to control release timing and manage beta programs. Feature flags give product managers autonomy to ship code when the business is ready, not just when the code is. They also help data science teams with experimentation—for example, when they need to test how features perform with different audiences.
  2. Marketing: Typically, product marketing teams use flags to time feature releases to campaigns or run promotional experiments. Personalization is another key use case where they offer curated experiences based on audience (user attributes).

Note: While feature flags help non-technical users time feature releases, it doesn’t mean every feature flagging platform is intuitive enough to use. Consider using a platform like GrowthBook that lets non-technical team members create and manage feature flags without writing code or filing engineering tickets.

What Are the Benefits of Using Feature Flags?

Most product and engineering teams adopt feature flags to address a specific problem. Usually, it’s a painful deployment that went sideways. But there are several benefits of using feature flags, including:

Decouple Deployment from Release

Feature flags break the assumption that deploying code means releasing a feature. Your main branch can contain unreleased features safely wrapped in flags. And engineers can merge continuously without worrying about exposing incomplete work.

In short, your engineering team can deploy 10 times a day while releasing features weekly or whatever cadence the business needs.This is particularly beneficial when different teams are contributing to a feature.  For example, if the back-end team delivers new functionality ahead of the front-end team, they check that code in behind a feature flag instead of keeping it in a branch.

Enable Instant Rollbacks

When things go wrong (and they will, usually at the worst possible time), you can disable a feature immediately. You don’t have to deploy new code or revert commits. As a result, you also recover from incidents much faster.  Without a feature flag, engineering teams are often forced to create a new build that removes buggy code while keeping stable features that were included in the previous build.  This can be a painful, time-consuming process, especially if the bug is hurting the live customer experience.

In fact, the State of DevOps 2024 report found that only 19% of engineering organizations recover from failed deployments in less than an hour. These “elite” teams tend to focus on continuous delivery practices, which are usually enabled by feature flags.

image.png

Reduce Risk with Progressive Rollouts

Instead of releasing to everyone at once, you can start small. First, roll out to 5% of users and monitor key metrics like error rates and performance. If everything looks good, gradually increase to 25%, then 50%, then 100%. If issues arise, the blast radius is limited to a small subset of users.

Test in Production Safely

Staging environments never perfectly mirror production. They lack real user behavior, real data volumes, and real traffic patterns, and you can’t make decisions if you’re testing in it.

For instance, if you’re testing a new payment processor integration, the staging environment can’t replicate the complexity of real payment flows or peak traffic loads. But if you use feature flags, you can test it with real user transactions, which gives you concrete data on what’s working (and not).

Increase Team Velocity

When you start using feature flags, velocity is a second-order benefit that you’ll experience eventually. Nobody’s waiting on shared release windows anymore, so they deploy code when it makes sense for them. So, teams ship faster and with more confidence in the long run.

In fact, according to research from DORA, higher deployment frequency correlates with higher software quality and stability. And it all comes down to feature flags that enable continuous delivery.

Enable Trunk-Based Development

Long-lived feature branches are a tax on your engineering team. They diverge from the main branch and accumulate merge conflicts over time, which causes more issues the longer they live.

That’s why sophisticated engineering teams have started adopting trunk-based development. 

In this method, they merge incomplete code to main behind a flag where it’s deployed but never executed. So you get the benefit of continuous integration without the risk of shipping unfinished features to users.

Build a Foundation for Experimentation

Once you can control who sees what, the next question is: which version is actually better?

Feature flags give you the ability to test that. While they act as the delivery mechanism, experiments act as the measurement layer.

const variant = gb.getFeatureValue(“checkout-cta-text”, “Buy Now”);
// Returns: “Buy Now”, “Purchase”, or “Add to Cart”
// Assignment is stable per user for valid experiment results

Together, they move your team from “We shipped it and hope it works” to “We shipped it, measured it, and know it works.”

What Are the Use Cases of Feature Flags?

Here are the most common use cases of feature flags for product and engineering teams:

Release Management

You wrap a new feature in a flag, deploy it to production in an off-state, and progressively roll it out. You can use it for:

  • Internal dogfooding with your team
  • Beta access for a select group of users
  • 5% canary release to catch issues early
  • Gradual ramp to 25%, 50%, 100%

At each stage, you monitor metrics and can halt or roll back if problems appear. This transforms launches from high-stakes events into controlled, iterative processes.

Kill Switches and Operational Control

Sometimes the most important thing a feature flag does is turn something off. It’s usually used when incidents happen, and you need to respond quickly. This drastically reduces your mean time to recovery (MTTR).

Infrastructure Migrations

Big-bang deployments are becoming a thing of the past. You don’t need a whole ceremony to move from one database to another.

Let’s say you’re migrating from PostgreSQL to CockroachDB. All you have to do is route 1% of read queries to the new database and monitor its performance. If everything looks good, ramp up to 10% and so on and so forth until it’s complete.

A/B Testing and Experimentation

Feature flags are the natural foundation for experimentation. Once you can consistently assign users to different feature variations, you can measure which version performs better with statistical rigor.

This is becoming especially relevant for teams building AI and GenAI features. When your recommendation engine uses a large language model (LLM) or your search results rely on an embedding model, you can’t just eyeball whether the new version is better.

You need controlled experiments with guardrail metrics and feature flags that provide the infrastructure to run them in production safely.

Personalization and Targeting

Feature flags let you deliver different experiences based on user attributes, geography, device type, or behavioral signals. You don’t need to maintain separate codebases for each attribute because the targeting rules handle the variation.

Target users based on specific attributes in GrowthBook

Entitlements and Access Control

If you run a multi-tier SaaS product, feature flags can manage which plans can access certain features. For example, you can automatically offer a premium integration for Enterprise users when they upgrade.

Also, if you need control over your data, use a feature flag platform that’s self-hosted or air-gapped. So, your flag evaluation data never leaves your network and ensures you’re compliant with regulations like HIPAA, GDPR, and SOC 2.

Refactoring Code

Feature flags reduce the risk of large-scale refactors by letting you run old and new implementations side by side. Route 5% of traffic to the refactored code path, compare outputs and performance against the original, and gradually shift over once you’re confident.

This is especially useful during monolith-to-microservices migrations, where you can flag-control which service handles each request and roll back individual routes without reverting the entire migration.

Compliance and Regulatory Control

Regulatory requirements change, sometimes quickly. Feature flags let you respond without waiting for a development cycle.

When a new data protection rule takes effect, you can disable a non-compliant feature across affected jurisdictions immediately. When your compliance team needs a four-eyes approval process for production changes, approval workflows on flag modifications implement that principle directly.

What Advanced Feature Flagging Strategies Can You Use?

Once you’re comfortable with basic on/off flags, you’ll quickly run into situations where a simple toggle isn’t enough. You need to roll out to a specific percentage of users. Or target enterprise accounts in a particular region. 

These strategies build on each other. Here are a few examples:

Percentage Rollouts with Persistent Assignment

Percentage rollouts let you gradually release a feature to a random sample of users—5%, then 25%, then 50%—while monitoring for issues at each stage. The critical detail is persistent assignment

When a user lands in the 10% cohort, they need to stay there as you ramp up to 50% and eventually 100%. Most platforms handle this by hashing the user’s ID against the flag key, which produces a consistent, deterministic assignment without storing state.

Release features using percentage rollouts in GrowthBook

Use percentage rollouts when you’re releasing a new feature and want to limit your blast radius. If something breaks at 5%, you’ve affected 5% of users.

Force Rules and Complex Targeting

Force rules let you target specific user segments based on combinations of attributes. For example, geography, device type, account age, company name, subscription tier, or any custom property you pass to your SDK.

Target specific user segments using Force Rules in GrowthBook

For example, you might want to enable a feature for enterprise accounts in Australia with an account age of greater than three months.  Or certain tax rules might only apply in a few countries.

Safe Rollouts with Guardrail Metrics

A safe rollout combines a percentage rollout with automatic metric monitoring. You define guardrail metrics like page load time, click rate, conversion rate, error rate, revenue per user, or whatever matters for this feature. And the system watches them as you ramp up.

If guardrails breach your thresholds, the rollout automatically reverses. The feature goes back to 0% while you investigate.

Monitor metrics in real time when using percentage rollouts in GrowthBook

Multi-Environment Flag Management

Your new checkout feature might need to be:

  • Always on in development (so your team can build against it)
  • 50% rollout in staging (to test the progressive rollout logic itself)
  • Off in production (not ready for customers yet)

This is where the relationship between projects, environments, and SDK connections matters. In GrowthBook, projects are the top-level organizational units (e.g., your mobile app vs. your web app). 

Within each project, you have environments (production, staging, development). Each flag can have different values and rules per environment, and the SDK connection determines which flags your application actually receives.

When Should a Company Adopt Feature Flags?

The short answer is earlier than you think. Most teams wait until they’ve been burned by a botched deployment or a release that broke critical functionality. By that time, everything you do is reactive in nature—and you’re retrofitting them into a codebase that’s already complex.

If you’re seeing these signals, it’s definitely time to adopt feature flags:

  • Every deployment feels high-stakes: Everyone’s on Slack watching dashboards, ready to hit rollback. If deploying makes your team nervous, you have a release process problem that flags can solve.
  • Rollbacks take hours to complete: If your recovery time is measured in hours, a single toggle would have saved you time.
  • Multiple teams are blocked on release windows: “We can’t ship until Backend deploys” shouldn’t be a weekly conversation. Using feature flags helps you decouple these dependencies.
  • You can’t test with real production traffic: If you don’t have a way to expose features to real users in a controlled way before launching them, you’re guessing.
  • Product decisions are based on opinions: You want to run A/B tests but lack the infrastructure to do so. In these cases, feature flags act as the delivery mechanism to make experimentation possible.
  • You’re growing the engineering team: As your team grows, so does its deployment complexity. It’s easier to coordinate releases with two engineers, but when you add more to the mix, the room for errors increases.
  • You deploy more than once a week (or want to): High-frequency deployment without feature flags is high-frequency risk. Flags make it safe.

If you checked 2+ of the above criteria, feature flags will immediately improve your workflow.

When Should You Not Use Feature Flags?

Knowing when not to use a flag is just as important as knowing when to use it. Here are a few reasons why you shouldn’t:

  • Don’t use flags for static configuration: If changing the value requires a full restart, it belongs in your config, not your flag system. Feature flags are for runtime decisions, so mixing the two adds unnecessary complexity.
  • Don’t use flags for secrets or sensitive data: You should never pass personally identifiable data (PII), API keys, or tokens through your feature flag system. This is especially critical for client-side applications because your configurations and targeting rules can be sent to the user’s browser, where anyone can inspect them. If you need to target based on sensitive attributes like email addresses, evaluate the flag server-side using verified data, or use hashed and anonymized attributes for client-side evaluation.
  • Don’t use flags for core business logic: If your subscription tier logic or pricing rules permanently live inside a feature flag, your core business functionality now depends on the availability of an external flag service. Once an experiment or rollout is complete, migrate the winning variant into your application code or a dedicated entitlement service.
  • Be cautious if your app traffic is low: Feature flags for simple on/off releases work at any scale. But if you’re planning to run A/B tests and your app has 100 users a month, you won’t reach statistical significance in any reasonable timeframe. The flag infrastructure still has value for release management and kill switches—just don’t expect experimentation to pay off until your traffic can support it.
  • Don’t adopt flags without clear processes: Unless and until you have the right processes—for example, naming conventions, ownership docs, governance controls, and cleanup processes in place, don’t use flags. Otherwise, you’ll end up with too much technical debt in the long run.

What Are The Best Practices for Using Feature Flags?

To avoid spending months cleaning up avoidable issues, follow these best practices:

Use Clear, Descriptive Names

Six months from now, nobody will remember what ff-123 or test-flag means. So, choose a clear naming convention and stick to it. For example, {feature-name}-{type} works well (checkout-redesign-release, cta-color-experiment).

// ❌ Unclear - what does this control?
gb.isOn("ff-123")
gb.isOn("test")
gb.isOn("experiment_2")
gb.isOn("new-thing")

// ✅ Self-documenting
gb.isOn("new-checkout-flow")
gb.isOn("holiday-2024-promo-banner")
gb.isOn("pricing-page-v2-experiment")
gb.isOn("premium-analytics-entitlement")

Note: Platforms like GrowthBook let you enforce naming patterns with regex validation to prevent duplication and enforce governance.

Clean Up Old Flags Ruthlessly

Every flag in your codebase adds a conditional branch. For instance, 10 flags create 1,024 possible code paths, but 20 flags create over a million. These create blind spots, so after you roll out a feature, do the following:

  1. Remove the flag check from your code
  2. Remove the old code path entirely
  3. Delete the flag from your management platform
  4. Document why it was removed

Turn it into a team ritual and also implement monthly or quarterly cleanup rituals to reduce technical debt. If you’re using a platform like GrowthBook, it’ll automatically detect stale flags and show you where these flags live in your codebase.

Remove old flags with automatic staleness detection in GrowthBook

Set Expiration Dates on Temporary Flags

It’s easy for seemingly temporary flags to become permanent. If there’s no clear deadline to clean it up, it’ll continue to sit in your codebase unnoticed—while its cleanup gets deprioritized sprint after sprint.

That’s why we recommend setting a calendar reminder for 30 or 60 days whenever you create a new flag. Better yet, create a Jira ticket or GitHub issue linked to the flag due two weeks after the target completion date.

Note: GrowthBook also supports flag scheduling, so you can set flags to automatically enable or disable at a specific date and time. This is useful for both feature launches and scheduled cleanup. If you prefer creating a Jira ticket, our Jira integration lets you link flags directly to tickets, so you can track these cleanup tasks within your existing workflow.

Start Small With Your Rollouts

It’s easy to skip steps when you’re confident about a feature. Resist the urge and default to a progressive delivery method.

Start with your internal team, then 1% of the traffic, then 10%, and so on. Continuously monitor changes or unusual behavior at each step—and only remove the flag when you confirm stability.

Monitor Business Metrics Too

A feature can do everything right. It can have zero errors or sub-100ms response times, but it can still tank your conversion rate.

When you set up monitoring for a rollout, watch both layers:

  • Technical guardrails: Error rate, response time (p95 and p99), resource usage, API failures
  • Business guardrails: Conversion rate, revenue per user, support ticket volume

If a new feature is technically flawless but users keep raising tickets right after launch, something’s wrong. You’ll have to look under the hood to understand what happened.

Document Flag Purpose and Ownership

You don’t want to be rummaging through hundreds of Slack threads or Jira tickets to find out what a flag does. At a minimum, every flag should have:

  • What it controls (one sentence)
  • Who owns it (team or individual)
  • Expected cleanup date
  • What metrics indicate a problem
  • Rollback procedure (usually “set to 0%” or “disable“)

Template:

Flag: new-checkout-flow

Purpose: Progressive rollout of redesigned checkout experience

Owner: @growth-team (Primary: @jane)

Created: 2026-01-15

Expected cleanup: 2026-03-01

Rollback procedure: Set to 0% immediately if conversion drops >5%

Success metrics:

  • Checkout completion rate improves by 3%+
  • P95 checkout latency stays under 2s
  • Support tickets don’t increase

Current status: 25% rollout, monitoring for 1 week before increasing

Use Role-Based Access Control

Role-based access control (RBAC) allows you to control which user can access specific flags. Use RBAC to define roles that map to your risk model, including who can:

  • Create flags
  • Modify targeting rules
  • Approve changes to production
  • Publish

When you combine RBAC with four-eyes approval workflows and audit logs, you’ll have everything you need to remain compliant.

Understand How Feature Flags Affect Performance

Feature flags add an evaluation step to every request, so you need to know where that evaluation happens and what it costs. Most modern SDKs run flag evaluations locally, including GrowthBook.

On client-side implementations, the SDK initializes asynchronously, which means users may briefly see the default experience before flags are evaluated: a “flicker.”   You can mitigate this through server-side rendering and anti-flicker support.

Similarly, if you have hundreds of flags with complex targeting rules, it can bloat the initial SDK payload. Within GrowthBook, you can use project-scoping so each SDK connection receives only the relevant flags, and use Saved Groups to reference large ID lists rather than inlining them.

What Mistakes To Avoid While Using Feature Flags?

Here are the most common ways teams shoot themselves in the foot (and how to avoid it):

Reusing Flag Names

In 2012, Knight Capital dealt with a software glitch that bankrupted the company. When an engineer reused the name of a deprecated feature flag to launch a new feature, the app ran trades based on an old functionality. This happened because the old flag’s code was still present in an unpatched server and this mistake eventually cost the company $440 million, leading to its closure within a week.

It was one of the biggest coding errors we’ve ever seen. That’s why we recommend creating new flags for every feature you roll out. It takes 30 seconds, and you avoid the risk of activating code paths you or your team has forgotten about.

Using Client-Side Flags for Security

Feature flags control what to show. They don’t control who has permission. This distinction matters for client-side applications where flag values are visible in browser dev tools.

// ❌ WRONG: Anyone can enable this in browser dev tools
if (gb.isOn("admin-panel")) {
  showAdminPanel();
}

// ✅ RIGHT: Verify permissions server-side
const isAdmin = await checkAdminPermissions(user.id);
if (isAdmin) {
  showAdminPanel();
}

If you’re doing anything involving money, data access, permission, or privileged APIs, you’re better off using server-side flags to do it.

Ignoring Rollback Procedures

Typically, rollbacks seem simple. You flip the flag back to off, and the problem is solved. But sometimes it’s not that simple. In 2020, Slack experienced an outage because a feature flag rollout triggered a performance bug. Even though the team rolled back the feature in 3 minutes, it left a stale HAProxy state that led to a six-hour outage.

Before rolling back a flag, you should know:

  • What metrics indicate a problem
  • Who has permission to roll back
  • What the downstream effects of rollback might be
  • Whether the rollback itself has been tested

Not Testing Both Flag States

Your Continuous Integration and Continuous Delivery (CI/CD) pipeline probably tests your application with your current production flag configuration. But does it test with the new flag turned on? Does it test with the new flag turned off again (the rollback scenario)?

If you only test one state, you’re assuming the other works. So, test three configurations:

  • Current production state
  • Intended release state
  • Rollback state

If you can’t test all three in Continuous Integration (CI), at least smoke test the rollback in staging before you push the flag live.

How to Use Feature Flags for Experimentation

Most teams start with feature flags for release safety. But once you can control who sees what, a natural question follows: which version is actually better?

Without experimentation, you’re essentially shipping features based on intuition. For instance, you might think a signup form could be cleaner with fewer fields. But only a real test can tell you if there’s an uptick or fall in conversions. Feature flags give you the ability to run these tests easily.

How It Works

In GrowthBook, an experiment is a rule you add to an existing feature flag. You don’t need to migrate SDKs or add new code.

const variant = gb.getFeatureValue(“checkout-optimization”, “control”);switch (variant) {  case “control”:    return <StandardCheckout />;  case “streamlined”:    return <StreamlinedCheckout />;  case “express”:    return <ExpressCheckout />;}

Users are randomly assigned to a variation and their assignment is stable—they always see the same version. GrowthBook tracks which variation each user saw, then joins that with your existing analytics events (purchases, signups, clicks) to calculate which version performed best.

Run experiments using feature flags as the mechanism

The progression from flag to experiment typically looks like this:

  1. Simple toggle: Feature is on or off for everyone
  2. Percentage rollout: Feature reaches a growing slice of users
  3. Safe rollout: Percentage rollout with guardrail monitoring and auto-rollback
  4. Full A/B test: Controlled experiment with statistical analysis and winner selection

By the time you reach step 4, you already know the feature doesn’t break anything. Now you’re asking a different question: does it actually improve anything?

GrowthBook’s Warehouse-Native Approach

Most experimentation platforms require you to export data to their system, send tracking events to their infrastructure, or download results and crunch them in spreadsheets. All of these create data silos and increase costs.

That’s why GrowthBook connects directly to your existing data warehouse. You can integrate with platforms like Snowflake, BigQuery, Redshift, or Databricks and run the analysis there. Your data never leaves your infrastructure, which simplifies SOC 2, GDPR, and HIPAA compliance significantly. 

And because it has access to your full warehouse, you can segment experiment results by any dimension you already track. For example, LTV cohort, acquisition channel, device type.

Should You Build or Buy a Feature Flagging Tool?

The answer depends on your organization’s size and needs. Here’s an easy framework to help you decide:

CapabilityBuild Your OwnUse a Platform
Setup timeHoursMinutes
Basic on/off flags✅ Easy✅ Easy
Percentage rollouts⚠️ Custom code✅ Built-in
User targeting⚠️ Custom code✅ Built-in
A/B testing❌ Requires analytics integration✅ Built-in
Non-engineer access❌ Not without building UI✅ Web dashboard
Audit logs❌ Custom implementation✅ Built-in
Multi-environment⚠️ Manual management✅ Built-in
SDKs for multiple languages❌ You build them✅ Provided
Ongoing maintenance⏰ Significant⏰ Minimal
Monthly cost$0 (but engineering time)$50-500+ (depends on scale)
Time to advanced featuresMonths of developmentImmediate

Legend: ✅ Fully supported | ⚠️ Possible with effort | ❌ Not practical

When To Build Your Own Feature Flag Tool

Building makes sense when your needs are genuinely simple:

  • You need fewer than 10–20 simple on/off flags.
  • You have strict compliance requirements preventing any third-party services.
  • You have dedicated engineering time for ongoing maintenance.
  • You only need basic on/off functionality without targeting or experimentation.
  • You want full control and have the resources to maintain it.

A config file or a database table can work fine at this scale. But in our experience, before you know it, you’ll be building dashboards and complex functionality just to maintain the flags.

When To Use a Feature Flagging Platform

A platform pays for itself quickly once any of these apply:

  • You need targeting beyond simple on/off (user segments, percentages, complex conditions).
  • You want experimentation and A/B testing capabilities.
  • You’d rather spend engineering time on your product than on internal infrastructure.
  • Non-engineers (PMs, marketing, data analysts) need to manage flags.
  • You require audit logs, role-based access, or compliance features.
  • You want debugging tools, flag lifecycle management, or third-party integrations.
  • You plan to scale flag usage across multiple teams and services.

Note: If you need to run experiments, it’s always better to go with a feature flagging platform. It’ll give you full control over what’s being tested, and you can be sure of its statistical rigor. For instance, GrowthBook includes a suite of developer tools for testing and debugging feature flags. The DevTools Chrome Extension lets you inspect flag evaluations and simulate different user attributes directly in your browser.

How To Create Feature Flags in GrowthBook

GrowthBook supports 24+ languages and frameworks. But here’s how to implement your first feature flag in under 10 minutes using React:

1. Get Your SDK Client Key

Go to SDK Configuration in GrowthBook, create a new SDK Connection, and copy the Client Key (it starts with sdk-).

2. Install the SDK

npm install @growthbook/growthbook-react

3. Wrap Your App With GrowthBook Provider

import { GrowthBook, GrowthBookProvider } from “@growthbook/growthbook-react”;import { thirdPartyTrackingPlugin, autoAttributesPlugin } from “@growthbook/growthbook/plugins”;// Create a GrowthBook instanceconst gb = new GrowthBook({  clientKey: “sdk-abc123”, // Your SDK client key  enableDevMode: true, // Shows helpful debug info in development  plugins: [    thirdPartyTrackingPlugin(), // Optional, sends “Experiment Viewed” events via GrowthBook Managed Warehouse, Google Analytics, Google Tag Manager, and Segment.    autoAttributesPlugin(), // Optional, sets common attributes (browser, session_id, etc.)  ],});// Load feature definitions from the GrowthBook APIgb.init();export default function App() {  return (    <GrowthBookProvider growthbook={gb}>      <MyApp />    </GrowthBookProvider>  );}

4. Create a Flag in GrowthBook

In GrowthBook’s dashboard:

  1. Navigate to FeaturesAdd Feature
  2. Set a unique feature key: new-onboarding
  3. Choose value type: boolean
  4. Default value is false (off by default)

Your flag is now live.

5. Use the Flag in Your Code

import { useFeatureIsOn } from "@growthbook/growthbook-react";

function OnboardingFlow() {
  const showNewOnboarding = useFeatureIsOn("new-onboarding");

  if (showNewOnboarding) {
    return <NewOnboardingFlow />;
  }

  return <LegacyOnboardingFlow />;
}

That’s it. The flag defaults to false, so everyone sees the classic onboarding. Toggle it to true in the dashboard, and the new version appears instantly. Toggle it back, and you’ve rolled back in seconds.

From here, you can add targeting rules, percentage rollouts, safe rollouts with guardrail metrics, or full A/B experiments within the same dashboard, without changing your code.

Ready to start? Try GrowthBook Cloud free, or check out the documentation for integration guides across all 24+ SDKs. For self-hosting, the GitHub repo has everything you need.

Frequently Asked Questions

1. What is the difference between feature flags and feature management?

Feature flags are the technical mechanism—the if/else statements in your code that check configuration values. Feature management is the broader practice of using flags strategically across the software lifecycle, including targeting rules, progressive rollouts, experimentation, governance, and lifecycle management.

2. What is the difference between feature flags and feature toggles?

“Feature flags” and “feature toggles” are synonyms for the same concept. You’ll also see “feature switches,” “feature flippers,” and “feature gates.”

3. What is the difference between feature flags and experiments?

Feature flags control who sees what. Experiments measure which version performs better. So, flags act as the delivery mechanism to run your experiments, and experiments give you the measurement layer to see the results.

4. What is the difference between feature flags and branches?

Git branches manage code versions during development, while feature flags manage feature visibility in production. With branches alone, you can’t deploy a feature until the branch merges and deploys. But with feature flags, the code merges to main immediately, but the flag keeps it hidden until you’re ready to release.

5. What is feature testing?

Feature testing means validating that a feature works correctly before releasing it broadly. With feature flags, you can enable a feature only for QA accounts or internal users and test it in production with real data and traffic patterns.

6. How do feature flags help with continuous delivery?

Feature flags separate deployment from release, so you can merge code continuously and deploy multiple times a day with new features safely wrapped in flags. Without them, you can’t deploy continuously because the feature itself might be incomplete or unvalidated.

7. What is progressive delivery, and how do feature flags enable it?

Progressive delivery is the practice of gradually releasing features to larger user segments while monitoring for issues at each stage. Instead of a binary release (off for everyone, then on for everyone), you incrementally increase exposure. For example, releasing it to the internal team first, then 5% of real users, until you reach 100% of users.

8. How do feature flags differ from configuration files?

Configuration files are static. If you have to change them, you’ll have to redeploy the code or restart the whole service. But feature flags evaluate at runtime. You have to flip a switch to ensure your changes propagate to your application within seconds via streaming updates.

9. How can you deploy and manage feature flags at scale?

To deploy flags at scale, you need the following features and capabilities:

  • Centralized feature flag management across all services
  • SDKs for every language in your stack
  • Streaming updates where changes propagate instantly
  • Governance controls like audit logs, RBAC, and approval workflows
  • Lifecycle management, such as stale flag detection, ownership tracking, and enforcement of cleanup cadence

10. What are client-side feature flags?

Client-side flags evaluate in the browser or mobile app rather than on your server. They’re useful for UI experiments, frontend rollouts, and A/B tests on visual elements. They’re usually visible to users, so don’t use them for PII, sensitive data, or access control.

11. What are the benefits of an open source feature flag platform?

Here’s why open source platforms are the better choice today:

  • You can inspect exactly how flags are evaluated, how experiments are analyzed, and how your data is processed. Plus, you can audit security practices before deploying to production.
  • You can fork the codebase if the project changes direction. You can also self-host indefinitely without an ongoing vendor relationship.
  • You can deploy the app within your own infrastructure. This is critical for regulations like HIPAA, FedRAMP, SOC 2, and GDPR.
  • With open source platforms, you pay for the infrastructure you choose, so you don’t rely on the vendor’s infrastructure pricing.

You can take advantage of OpenFeature, a Cloud Native Computing Foundation (CNCF) incubating project that creates a vendor-agnostic API standard for feature flagging.

Your React Feature Flags Are Probably Broken (Here's How to Fix Them with TypeScript)
Feature Flags

Your React Feature Flags Are Probably Broken (Here's How to Fix Them with TypeScript)

Feb 23, 2026
x
min read

Your checkout component renders perfectly. The layout looks right, the tax rate loads, the payment methods appear. Everything passes your visual check — and then you deploy, and users get the experimental beta layout when they shouldn't.

The bug? A feature flag with the value "false" — a string, not a boolean. In JavaScript, a non-empty string is truthy. So the flag that was supposed to disable the experimental UI was quietly enabling it for every single user. TypeScript had no idea, because it had no idea your feature flags existed at all.

This is the quiet danger of untyped feature flags. They fail silently, they're hard to reproduce in tests, and they tend to surface at the worst possible moment. Here's how to close that gap with generated TypeScript types — and a few additional best practices that make your flags easier to maintain as your codebase grows.

Why TypeScript Doesn't Save You (By Default)

Most React developers working with feature flags write something like this:

const { isExperimental, taxRate, headline, paymentMethods } = useGrowthBook();

This looks reasonable. But TypeScript has no way of knowing:

  • Whether isExperimental is a real flag name in GrowthBook
  • Whether it should be a boolean, a string, or a number
  • Whether your fallback value type matches the default defined in your dashboard

Without type definitions, the SDK treats everything as any. You can pass a string where a boolean belongs, misspell a flag name, or reference a flag that's been deleted — and your code compiles without complaint. The result is a whole category of bugs that are genuinely hard to catch: everything looks fine at the TypeScript layer, but the runtime behavior is wrong.

The Fix: Generated Type Definitions for Your Feature Flags

The solution is to give the GrowthBook React SDK a TypeScript interface that describes all your flags — their names and their value types — so the compiler can enforce correctness for you.

GrowthBook provides a CLI tool to generate these types. But if you're using Cursor or another AI-assisted editor with MCP support, there's an even faster path: you can generate the types directly through GrowthBook's MCP server without leaving your editor.

Either way, the result is a file called app-features.ts that contains a complete TypeScript interface for every flag in your GrowthBook account:

export interface AppFeatures {  checkout_experimental_layout: boolean;  headline: string;  shipping_tax: number;  payment_methods: string[];}

Every flag. Every type. Automatically generated from your actual GrowthBook configuration — not hand-written and left to drift.

Using the Flag Types in Your Component

Once you have app-features.ts, import it and pass it to the useGrowthBook hook as a generic type parameter:

import { AppFeatures } from './appfeatures';import { useGrowthBook } from '@growthbook/growthbook-react'const gb = useGrowthBook<AppFeatures>();

That one change unlocks the full power of TypeScript's type checker against your feature flags. The moment you do this, errors that were previously invisible become immediately visible — right in your editor, before you run anything.

The Errors You'll Actually See

When we applied types to a real checkout component with four flags, TypeScript surfaced several problems immediately:

Wrong flag name. The component was using isExperimental as a flag key. The actual flag in GrowthBook is checkout.experimental_layout. Without types, this compiled and ran fine — it just returned the fallback value every time, silently. With types, it's a compiler error on the spot.

Wrong value type. The fallback for checkout.experimental_layout was "false" — a string. The actual flag type is boolean. This is the bug from the opening: because "false" is a truthy string, the experimental layout was enabled for every user. TypeScript catches this the moment you add the type definition.

Mismatched default values. The component assumed payment_methods defaulted to just credit card. The actual default in GrowthBook includes Bitcoin. With the MCP server, you can verify that your fallback values match your GrowthBook defaults directly in the editor — and even have the agent update the code for you.

These aren't hypothetical bugs. They're the kind of thing that gets deployed on a Friday.

Keeping Flag Types in Sync

Generating types once is useful. Keeping them current is what makes this a real system.

When you generate types via the GrowthBook CLI or MCP server, it also adds a script to your package.json:

"scripts": {  "generate-flag-types": "growthbook generate-types"}

Run this any time you add, remove, or change a flag in GrowthBook. It takes seconds and ensures your TypeScript definitions never drift from your actual configuration. A good practice: add it to your CI pipeline, or at minimum to your pre-release checklist. Stale type definitions are better than none, but fresh ones are what give you the full safety guarantee.

Three More Feature-Flag Best Practices Worth Adding

Type safety solves the hardest category of feature flag bugs, but there are a few additional practices that will save you headaches as your flag usage grows.

Handle Loading States Explicitly

When your app initializes, GrowthBook fetches flag values from the server. During this brief window, the SDK relies on local fallback values. If not handled explicitly, this can result in a "flash of unstyled content" (FOUC) where users see the wrong UI state for a split second.

To solve this, the GrowthBook React SDK provides the <FeaturesReady> helper component. It allows you to render a loading state until your features are fully loaded:

<FeaturesReady timeout={500} fallback={<LoadingSpinner/>}>  <ComponentThatUsesFeatures/></FeaturesReady>

Don't skip this. While loading is often near-instant in development, the "flash" becomes painfully obvious for production users on slower connections

Use Descriptive, Consistent Flag Names

Flag names like ff-123 or new-ui become unmaintainable fast. When you have 50 flags, you need to know at a glance what each one controls, which team owns it, and whether it's still active.

A naming convention that works well: {scope}-{description}-{date}

  • Example: checkout-experimental-layout-2025-03, pricing-annual-discount-enabled-2026-01, onboarding-video-modal-shown-2026-02.

It's more characters, but it's searchable, scannable, and self-documenting.

Paired with TypeScript autocomplete (which you now have), good naming means you can find the right flag in seconds rather than hunting through a dashboard.

Know What Client-Side Flags Can and Can't Do

Feature flags evaluated in the browser are visible to users — anyone with DevTools can inspect the flag values your app receives. This is fine for UI experiments and gradual rollouts, but it means you should never use client-side feature flags to gate access to sensitive features or enforce permissions.

For anything security-sensitive — premium features, admin capabilities, access control — validate on the server. Client-side flags are for experience control, not authorization.

Client Side Feature Flagging

Understand the advantages and pitfalls of client-side flagging and how to avoid many of the issues.

GrowthBook BlogGraham McNicoll

Getting Started

If you're using GrowthBook with React, here's the short path to type-safe flags:

  1. Generate your types using the GrowthBook CLI (npx growthbook features generate-types) or through the MCP server in Cursor
  2. Import and apply the types to your useGrowthBook hook
  3. Fix the errors TypeScript surfaces — treat each one as a bug caught before production
  4. Add loading state handling so users don't see flashes of the wrong UI
  5. Standardize your flag naming convention before your flag count grows
  6. Add the generation script to package.json and run it whenever your flags change

The type setup takes under 10 minutes. The bugs it prevents can take hours to diagnose after the fact — and the ones that reach users can take down conversions quietly for days before anyone notices.

GrowthBook has full documentation on TypeScript type generation for React and every other supported SDK. If you run into questions, the GrowthBook Slack community is active and helpful for anything experimentation-related.

Ready to ship faster?

No credit card required. Start with feature flags, experimentation, and product analytics—free.