The Uplift Blog

Subscribe
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
What Is A/B Testing? A Practical Guide for Product & Engineering Teams
Experiments
Guides

What Is A/B Testing? A Practical Guide for Product & Engineering Teams

Apr 1, 2026
x
min read

A/B testing is simple in concept. Split your users, show them different experiences, and measure what happens.

In practice, A/B testing for product teams is rarely that clean. Real products have real constraints in tracking, assignment, and metric definition, quickly making a straightforward test complicated.

While low-velocity teams can absorb slow, isolated mistakes, high-volume experimentation at scale requires mastering the fundamentals, as flaws compound leading to bad, high-confidence product decisions. Fortunately, these failure modes are well-understood and avoidable.

What is A/B Testing

A/B testing, sometimes called split testing, is a randomized experiment in which multiple versions of something are shown to different groups simultaneously. Each group is measured against a defined metric to determine which performs better.

By randomly assigning units to each version, you control for external factors like seasonality, changes in traffic mix, and broader market conditions, so any difference in outcomes can be attributed to your change and nothing else.

What Does A/B Testing Look Like in Practice

In product development, an A/B test runs alongside your normal release process. Rather than shipping a change to everyone at once, you expose a subset of your users to the new experience while the rest continue seeing the existing one. Both groups run simultaneously, and you measure the difference.

  1. Define a hypothesis including the metric you're testing against.
  2. Randomly split your audience into groups, each exposed to a different version.
  3. Analyze the difference between groups using a statistical framework.
  4. Ship the winning variant, or go back to the drawing board with what you learned.

Without that structure, you're left comparing against historical data. Consider a team that ships a new feature and watches new signups drop 8% over the following two weeks. They blame the release and roll it back, but sales stay flat. It turns out it was a seasonal dip that would have happened regardless of what was shipped, and now the team has spent a week in firefighting mode reverting a change that had nothing to do with the decline.

Or consider a team deciding between two redesigns of the same checkout flow. Rather than debating which one to ship, they test both against the current experience simultaneously. One variant performs similarly to the control. The other increases completed purchases by 12%. Without the test, that call comes down to whoever argues most convincingly in the design review.

Why Does A/B Testing Matter

For product teams, the value of A/B testing isn't just finding winning variants. It's making consequential decisions about how your product works based on what users actually do, rather than what your team thinks they'll do. 

It's also one of the few tools that gives teams the ability to push back on the HiPPO (the highest paid person's opinion) with something more than a gut feeling of their own. When the data says otherwise, it says so for everyone in the room.

The Critical Difference: A/B Testing vs Gut/Intuition

Without A/B testing, product decisions tend to default to a familiar set of inputs:

  • HiPPO (Highest Paid Person's Opinion). The person with the most seniority in the room has the most influence over what ships. Experience and instinct have value, but they're not a substitute for knowing what your users actually do
  • Best practices that may not apply to your audience. What worked for another product, in another market, with a different user base is a starting point at best. Your users are not their users.
  • Assumptions about user behavior. Intuition about how users will respond to a change is useful for generating hypotheses, but assumptions are often wrong.
  • Competitor copying without context. You can see what your competitors ship, but you can't see whether it worked or what they had to give up to get there.

With A/B testing, product decisions are grounded in more reliable inputs:

  • Actual user behavior from your specific audience. Benchmarks and case studies tell you what worked somewhere else. This tells you what works for your users, in your product.
  • Statistically validated results. Results you can trust, reproduce, and build on rather than ones you have to take on faith.
  • Measurable business impact. You can tie the outcome of an experiment directly to the metrics the business cares about, whether that's retention, revenue, or engagement.
  • Continuous learning. Every experiment, whether it wins or loses, tells you something about how your users behave. 

What Are the Benefits of A/B Testing in 2026

For modern product teams, the benefits of A/B testing go well beyond finding a winning variant. In 2026, with AI accelerating the pace of product development and raising the bar for what teams can ship, the cost of making bad product decisions has never been higher. Done consistently and rigorously, experimentation touches how teams make decisions, allocate resources, and understand their users.

1. Get More Value from Your Existing Traffic

Customer acquisition costs have climbed as high as 60% since 2023

  • Paid channels are getting more expensive as competition for inventory increases and AI-driven bidding pushes auction prices up. 
  • Organic search is delivering fewer clicks as LLMs answer queries before users leave the results page. 
  • Social platforms are increasingly designed to keep users on-platform rather than send them to yours.

Getting more value out of the traffic you already have is increasingly a business necessity, and A/B testing is how you do it systematically.

2. Reduce the Risk of Rolling Out Major Changes

Every product change carries risk. A change can perform worse than expected for a variety of reasons: a bug that only surfaces under certain conditions, user behavior that didn't match your assumptions, or a change that worked well for one segment while degrading the experience for another. Without feature experimentation, you find out about these issues after the fact, when it has already reached your entire user base.

By feature flagging and exposing a change to a subset of users first, you limit the damage if something goes wrong. A variant that damages an important metric affects 10% of your traffic, not 100%. If it performs well, you can roll it out knowing what to expect. If it doesn't, you can roll it back before most of your users ever see it.

3. Speed Up Product Decision Making 

Product decisions are slow when they rely on opinion. Design reviews stretch into hours as stakeholders debate, and the person with the most seniority often wins, not because they're right, but because they're the loudest voice in the room.

Product experimentation changes how those conversations go. When you have data on how users actually behaved, the debate shifts from “I think" to "here's what we know." As one PM put it: "A/B testing turned our three-hour design debates into 30-minute data reviews."

That speed compounds over time. Teams that can make and validate product decisions faster than their competitors ship more, learn more, and course-correct before small mistakes become expensive ones.

4. Develop a Deeper Understanding of Your Users 

Every experiment tells you something about your users, whether it wins or loses. A variant that underperforms is still evidence. It tells you what your users don't respond to, which is often just as useful as knowing what they do.

Over time, that body of evidence becomes more valuable than any single test result. Teams that maintain a searchable archive of past experiments (GrowthBook does this automatically) stop asking "Didn't we already test this?" and start forming better hypotheses from the outset. This process builds a richer understanding of their users and how they actually behave, leading to better prioritization as the most impactful initiatives become clearer.

5. Uncover Surprising Insights

Not every valuable idea looks valuable before it's tested. A Microsoft engineer once ran a quick A/B test on a low-priority change to how Bing displayed ad headlines (an idea that had sat untouched for over six months). The test showed a 12% increase in revenue, which translated to more than $100 million annually in the US alone. It turned out to be the best revenue-generating idea in Bing's history and it almost never got tested at all.

These insights only surface when you have an A/B testing framework that makes it easy to ship any product change as a controlled experiment.

6. Build a Competitive Advantage

The teams that consistently outperform their competitors aren't necessarily the ones with the best ideas. They're the ones who can validate ideas faster and learn from failures.

Netflix is a well-documented example. The company runs experiments across virtually every aspect of its product, optimizing everything from thumbnails to recommendation algorithms to ensure that data (rather than opinion) drives decisions. That commitment to experimentation at scale is part of what allows a company of that size to keep iterating as fast as it does.

The more consistently you test, the better your decisions get, and the harder it is for competitors to close that advantage.

Who Should Use A/B Testing (and Who Shouldn’t)

Most teams can benefit from A/B testing in some form. But the teams that get the most out of it tend to share a few things in common: enough volume to reach statistically meaningful results, the technical infrastructure to instrument changes correctly, and decisions that are frequent enough to make a testing practice worthwhile.

Product teams

Product teams should run experiments to make confident decisions about what to build and how to build it. Does this feature change improve engagement? Does this new experience keep users on the platform longer? Experimentation answers those questions before a change is fully committed. Smaller tests can also validate hypotheses early, before significant product development investment is made, informing broader product strategy along the way. It also gives product teams a clearer read on the actual impact of their work, which is often harder to measure than it looks.

Engineering teams

Engineering and dev teams should run experiments to ship changes with confidence and get direct visibility into the impact of their work on product outcomes, not just system performance. Does this algorithm change actually improve the metric it was designed to improve? Does this infrastructure change affect user behavior in ways that weren't anticipated? Rather than shipping to everyone at once, changes can be rolled out to a subset of users first, catching unexpected behavior before it reaches your entire user base.

Growth and marketing teams

Growth and marketing teams should run experiments to validate what actually resonates with their audience before committing to a direction. Does this landing page copy increase signups? Does this email subject line improve open rates? The feedback loops are short, and the metrics are clear, making experimentation a natural fit for fast iteration.

Design teams

Design teams should run experiments to resolve design debates with data (rather than opinion alone) and validate changes before they're fully built. Does this layout change make the key action more obvious? Does this navigation pattern reduce friction or just introduce unfamiliarity? A/B testing gives design teams a way to move forward on contested decisions without waiting for consensus.

When NOT to Use A/B Testing

A/B testing isn't the right tool in every situation. There are a few conditions where it will either produce unreliable results or simply isn't worth the investment.

  • Product is too early stage. If you're still searching for product-market fit, optimizing individual features is a distraction. The priority at that stage is to learn whether the core product solves a real problem, which requires qualitative research and iteration, not controlled experiments.
  • Not enough units. A/B testing requires enough users moving through the experience you're testing to produce statistically meaningful results within a reasonable timeframe. If your sample size is too small, you'll either run tests for months or make decisions based on underpowered results that don't hold up.
  • Decisions with an obvious right answer. Some changes don't need a test. Accessibility improvements, critical bug fixes, and security patches should be shipped because they're the right thing to do for your users, not because an experiment validated them. Testing these changes introduces unnecessary delay and in some cases raises ethical questions about deliberately exposing a subset of users to an inferior or broken experience.  However, it can still be valuable to run non-inferiority tests to ensure changes don’t introduce new issues that affect the customer experience.  
  • No internal alignment. A/B testing only produces value if the results get acted on. If your team can't agree on what success looks like before the experiment starts, or if stakeholders routinely override data-driven conclusions with opinion, the infrastructure of experimentation exists without the culture to support it. The tool is only as useful as the organization's willingness to trust and act on what it finds.  Getting alignment on the OEC, Overall Evaluation Criteria, is usually a critical first step.  If your teams can’t agree on a north star metric, then it's very difficult to grow the business effectively.
  • Significant brand changes. A/B testing works well for changes with measurable behavioral outcomes, but brand identity isn't that kind of decision. Testing radically different brand expressions simultaneously means that different users see different versions of who you are, creating inconsistency that's difficult to undo. For changes to core brand messaging, tone, or visual identity, market research and qualitative methods are better inputs than a randomized experiment.
  • Regulated industries with constraints on user treatment. In some industries, randomly assigning users to different experiences raises legal or ethical issues. Healthcare, financial services and edtech are all industries where A/B testing requires additional thought.  For example, you don’t want half the students in a class to have one learning experience and the other half having another.  This could be very hard on the teacher and students.  Data privacy is also extremely important in these industries.  This doesn’t mean these industries can’t run A/B testing.  It just means they need to be more thoughtful about their experiment design.  (GrowthBook's self-hosting and privacy-first architecture is specifically designed for teams operating in regulated environments), but they do mean the standard experimentation framework needs to be adapted before it can be applied safely.

What Can You A/B Test?

Most teams start experimenting with the most visible parts of their product and stop there. The reality is that if a change can be measured and randomly assigned, it can be tested. That applies as much to a ranking algorithm or a model prompt as it does to a button label or a checkout flow, and the most sophisticated experimentation programs treat almost every product change as a candidate for a controlled experiment.

User-Facing Product Experiences

Changes to what users see and interact with directly are often the easiest to instrument, the most straightforward to design a clean experiment around, and the most immediately connected to the metrics product teams care about.

Copy and Messaging

The words you use to describe your product, explain a feature, or prompt an action affect how users respond in ways that are hard to predict without testing. This includes headlines, body copy, error messages, empty states, and tooltips. Copy that works well in one context often fails in another, which makes experimentation more reliable than intuition.

Visual Design Elements

Colors, typography, imagery, iconography, and visual hierarchy all affect how users perceive and engage with a product. These elements are worth testing on high-traffic acquisition surfaces where visual choices directly affect first impressions and conversion.

Social Proof and Trust Signals

The placement, format, and type of social proof affects how users evaluate whether to take action. Testimonials, review counts, trust badges, and case study callouts are all worth testing at high-stakes moments in the user journey, like pricing pages or checkout flows, where trust is a meaningful factor in the decision.

Calls to Action

Button text, placement, size, and visual weight all affect whether users take the action you want. The difference between "Start free trial" and "Get started" may seem trivial, but it can produce measurable differences in click-through and conversion rates.

Forms and Data Collection

The number of fields, their order, their labels, and how validation errors are presented all affect completion rates. For teams with signup flows, checkout processes, or any other form-gated experience, this is a productive area for experimentation.

Layout and Navigation

How you organize and present information affects how users move through a product and what they do next. Single versus multi-column layouts, card versus list views, menu structure, and the placement of key actions relative to supporting content are structural decisions that are harder to get right through intuition alone.

Onboarding Flows

What happens in a user's first few sessions shapes everything that comes after. Changes to the number of steps, the order of actions, or the point at which users are asked to commit to something can have measurable downstream effects on activation and retention metrics.

Pricing and Packaging Display

How you present pricing affects conversion without changing the underlying price. Tier ordering, anchoring, and the framing of free versus paid features are all worth testing for any team with a monetization surface, though the effects can take time to manifest.

Backend and Infrastructure

The most impactful experiments a product team can run are often invisible to users. A change to a ranking algorithm or a model prompt can affect user behavior just as much as a redesigned interface, and without a controlled experiment, the effect is nearly impossible to isolate.

Infrastructure and Performance

Performance improvements are generally good for users, but testing them as controlled experiments lets you quantify exactly how much they matter for the metrics you care about. Knowing which specific infrastructure investments moved conversion by 3% and which didn't gives teams a more reliable basis for deciding where to invest next.

Default Settings and Configurations

Most users never change defaults, which means the state you ship with has an outsized effect on how a feature gets used. Testing different default configurations is low-cost to implement and can meaningfully affect adoption and engagement.

Notification Timing and Content

Both the notification you send and what it says affect whether users engage with it. Testing send timing, message length, and the specific action you're prompting can improve open rates and click-through without increasing notification volume.

Product Features and Functionality

Beyond how a feature looks, you can test how it behaves. The results often reveal that users interact with the functionality in ways that don't align with the original design assumptions, which is useful information regardless of which variant wins.

Search and Discovery

Search ranking, autocomplete behavior, and filtering defaults all affect whether users find what they're looking for. Search is often a high-intent surface where small improvements in relevance or presentation directly affect conversion or engagement.

Algorithms and Ranking

Ranking and recommendation algorithms affect every user simultaneously, which makes them worth testing carefully. Small changes to the underlying logic can produce meaningful differences in engagement and retention that aren't visible until you measure them.

AI and ML Models

AI and ML models are particularly hard to evaluate without controlled experiments. A model that scores better on benchmarks doesn't always perform better in production, which makes A/B testing AI the only way to know for sure.  Performance, quality and speed are all important to test.  Slight changes in system prompts also require in-depth testing.

Growth and Acquisition Surfaces

Growth and acquisition surfaces are where most teams first encounter A/B testing, and for good reason. The metrics are clear, the feedback loops are short, and the tests are relatively cheap to run compared to changes deeper in the product.

Email Campaigns

Subject lines, send timing, message length, preview text, and calls to action all affect whether users open, click, and convert. Email is one of the more forgiving surfaces for experimentation because tests are cheap to run and results come in quickly, making it a good starting point before moving into more complex product surfaces.

Paid Ads

Ad creative, copy, targeting parameters, and landing page destinations all affect cost per acquisition and return on ad spend. Testing these systematically rather than relying on platform optimization alone gives teams more control over what's actually driving performance and makes it easier to apply what you learn across campaigns.

Landing Pages

Landing pages connect acquisition and product, which makes them worth testing carefully. Headline copy, hero imagery, social proof placement, form length, and page structure all affect conversion, and improvements here affect the efficiency of every upstream acquisition channel.

Mobile App Stores (ASO)

App store listings are a testable surface that many teams overlook. Screenshots, preview videos, descriptions, and icon design all affect install rates, and both the App Store and Google Play offer native tools for running controlled tests on these elements.

Internal Tools and Systems

Most teams think of A/B testing as something you do on user-facing surfaces. Internal tooling is worth the same rigor. The workflows your team uses, the interfaces they navigate, and the systems that handle billing and support all affect business outcomes in measurable, improvable ways.

Billing Systems

When and how you charge users affects conversion, retention, and revenue in ways that aren't always intuitive. Credit charging timing, trial length, grace periods, and dunning flows are all worth testing, and the effects can be substantial even when the changes seem minor.

Customer Success

The interfaces and workflows your support team uses directly affect both resolution times and the experience customers receive on the other end. Testing different queue structures, response templates, or escalation flows can surface improvements that are invisible from the outside but meaningful to the people doing the work and the customers they're helping.

Dashboard and Reporting Interfaces

How data is presented to internal users affects the decisions they make. Testing different visualizations, metric groupings, or alert thresholds can improve how quickly teams identify issues and act on them.

Internal Search and Navigation

How employees find information and move through internal tools affects productivity in ways that are easy to underestimate. Testing search ranking, navigation structure, and information hierarchy in internal tools follows the same principles as product experimentation, just with a different user base.

Workflow and Process Design

Internal processes are testable too. Whether it's the order of steps in an approval flow, the default assignee for a task, or the trigger conditions for an automated action, small changes to how work moves through a system can have measurable effects on speed and accuracy.

Different Types of A/B Tests

Not all experiments are structured the same way. The standard A/B test is the right tool for most situations, but there are different types of A/B tests for different situations.

A/A Test

An A/A test runs two identical variants against each other. The purpose isn't to find a winner but to confirm your experimentation infrastructure is working correctly. You should test a number of metrics to confirm data is flowing correctly, that you're seeing an equal number of users assigned to each test.  You should expect 1 out of 20 tests to show a statistically significant result with a 95% confidence interval.  

A/B/n test

An A/B/n test extends the standard A/B test to include multiple variants tested simultaneously against a single control. You evaluate several hypotheses in one experiment rather than running them sequentially. Each additional variant requires more units to reach significance, so population requirements scale with the number of variants.  If you have enough traffic, multiple variant tests are a great way to accelerate learning.  

Multivariate Test

A multivariate test changes multiple elements simultaneously and tests combinations of them. If you're testing two headlines and two button colors, a multivariate test runs all four combinations to understand not just which elements perform better individually, but how they interact. The tradeoff is that you need considerably more traffic than a standard A/B test, because the population is split across every combination.

Holdouts

A holdout test withholds a feature from a group of users after it has been fully rolled out to everyone else. The holdout group continues to see the old experience, which lets you measure the long-term effect on retention and engagement that takes time to manifest. A new onboarding flow might look neutral in a two-week test but show meaningful differences in retention at 90 days. Holdouts are also useful for measuring the cumulative effect of many experiments running simultaneously. By comparing the holdout group to the fully treated population over 3–6 months, you can measure the combined effect of all your experiments.

Statistical Approaches to A/B Testing

Most modern experimentation platforms, like GrowthBook, give you a choice between Bayesian and frequentist statistics. Both are good options but understanding the differences can help you decide which approach is best for you. 

Bayesian Statistics

Bayesian statistics handles hypothesis testing by expressing results as probabilities. Instead of a binary significant/not-significant decision, you get a probability distribution: what's the chance variant B is better than variant A, and by how much? This makes results easier to interpret and communicate to non-technical stakeholders. Bayesian methods can also incorporate prior beliefs about the metric being tested, helping avoid over-interpreting results from small samples.

Benefits of Bayesian Statistics

  • Results are expressed as probabilities that are intuitive to act on, like, "There's a 92% chance variant B is best.”
  • The probability distribution shows the full range of likely outcomes, not just a point estimate.
  • Probabilities are well-suited for communicating results to non-technical stakeholders.
  • Using informed priors can help reduce uncertainty in smaller samples.

Drawbacks of Bayesian Statistics

  • Poorly calibrated priors can skew results, particularly with small sample sizes.
  • Not immune to peeking; stopping rules should be defined upfront and followed.

Frequentist Statistics

Frequentist statistics is the more traditional approach to hypothesis testing. It calculates the probability of observing your results if there were no real difference between variants. That probability is the p-value, which is compared against a predetermined significance threshold, typically 0.05.

Benefits of Frequentist Statistics

  • Widely understood, with transparent math and familiar outputs.
  • Results are easy to audit or present in contexts where frequentist methods are the established standard.
  • Sequential testing can be used for continuous monitoring without inflating false positive rates.
  • A good fit when your team is more comfortable with p-values and confidence intervals.

Drawbacks of Frequentist Statistics

  • The binary nature of significance decisions can lead to misinterpretation. Teams also frequently misread “not significant" as "no effect" rather than "insufficient evidence to detect an effect."
  • Without sequential testing enabled, results are only valid if you don't peek before reaching a pre-determined sample size.

Concepts Shared by Both Bayesian and Frequentist Statistics

Despite their differences, Bayesian and frequentist statistics share many common concepts:

  • CUPED Compatible: CUPED uses pre-experiment data to reduce noise in metric estimates, allowing you to detect an effect faster with the same sample size. 
  • Random Assignments: Random assignment is what makes an experiment causal. Violations (users assigned to multiple variants, or assignment correlated with the metric) can invalidate results regardless of which framework you use.
  • Statistical Significance and Confidence Level: Both approaches use a threshold to determine when a result is reliable enough to act on. In frequentist statistics this is the significance level, while in Bayesian statistics it's expressed as a probability threshold. In both cases, set the threshold before the experiment starts.
  • Statistical Power and Sample Size: Power is the probability of detecting a real effect when one exists. Most teams aim for 80% as a minimum. Before starting an experiment, both approaches require a power analysis to determine the sample size you need to detect the effect you're looking for. Without one, you risk either stopping too early and acting on noise, or running longer than necessary. While not as prevalent in Bayesian statistics, if you have a stopping criteria, then computing your power to detect that stopping criteria is still valuable.
  • Peeking and False Positive Risk: Both approaches are susceptible to inflated false positive rates if you stop early based on favorable results. (GrowthBook's frequentist stats engine enables sequential testing to safely allow early stopping.)

Which Statistical Approach Should You Use?

Use Bayesian when you want probability-based results or have well-established priors that can reduce uncertainty in smaller samples.

Use Frequentist when results need to meet an established statistical standard, or when you want to enable sequential testing. 

Step-By-Step A/B Testing Process

How you plan and run a test determines whether the results can actually be trusted. Here’s the step-by-step process from developing a hypothesis all the way through to implementing a winning variant. 

Step 1: Research and Identify Opportunities

Good experiments start with a clear understanding of where the opportunity is. For product development teams, that usually means looking at where users drop off, where engagement is lower than expected, or where there's a meaningful gap between how a feature was designed to be used and how it's actually used.

Start with quantitative data like funnel drop-off rates, feature adoption rates to identify potential opportunities, then use qualitative data like user interviews, support tickets, and session recordings to better understand the situation.

How to Prioritize Experiments

Not every problem is worth testing. The best starting point is your team's current roadmap and goals. If you're focused on improving activation this quarter, test things that affect activation. Experiments that don't connect to what your team is actively solving are a distraction, however interesting the hypothesis.

Before committing, use an objective scoring system or prioritization framework like ICE to evaluate each opportunity:

  • Impact: How much could this improve the metrics your team cares about?
  • Confidence: How sure are you it will work, based on the data and research you have?
  • Ease: How much engineering effort does it require to implement and instrument?

Step 2: Form a Strong Hypothesis

A good hypothesis forces you to be specific about what you're changing, why you expect it to work, and how you'll know if it did.

A weak hypothesis sounds like: "Let's try a shorter onboarding flow." 

A strong one sounds like: "Reducing the onboarding flow from five steps to three will increase 7-day activation because users are dropping off at step three."

Here are a few more examples for weak and strong hypotheses. 

Weak Hypothesis Strong Hypothesis
The recommendation widget will increase sales. Adding a personalized recommendation widget below the product description will increase average order value because users who see relevant product suggestions will add more items to their cart before checkout.
We will surface errors more clearly. Replacing generic error messages with specific guidance on how to fix the issue will increase form completion rates, because users currently abandon forms at the error state without understanding what went wrong.
We think the pricing page is confusing. Replacing the feature comparison table with a use-case based pricing guide will increase trial conversion, because users in exit surveys say they can't determine which plan is right for them.
Dark mode would probably improve engagement. Adding a dark mode option to the dashboard will increase daily active usage among power users because power users spend more than 4 hours per day in the product and have requested this feature in support tickets.

Use this structure as a starting point for writing your own hypotheses:

[Specific change] will cause [measurable effect] because [reasoning based on research].

Step 3: Design Your Experiment

Most of the work in running a good experiment happens before you launch. The decisions you make at the experimental design stage will determine how useful your experiment is.

Define Your Measurement Criteria

Before you build anything, be clear on what you're measuring and why. Your primary metric should flow directly from your hypothesis. It's the specific effect you expect to see. If your hypothesis is that reducing onboarding steps will improve 7-day activation, then 7-day activation is your primary metric. 

  • Primary Metric: The single metric that determines whether the variant wins or loses, defined before the test starts and tied directly to your hypothesis. 
  • Secondary Metrics: Metrics that you’re not specifically trying to improve but may help you further understand your experiment's impact including related metrics and lagging indicators.
  • Guardrail Metrics: Metrics that you’re specifically not trying to hurt. 

Here’s what each metric might be for our onboarding experiment example. 

Hypothesis Reducing the onboarding flow from five steps to three will increase 7-day activation rate by 15%, because users are dropping off at step three.
Primary Metric 7-day activation
Secondary Metrics 30-day retention rate

Time to complete onboarding flow

Number of core features activated within 7 days
Guardrail Metrics Invited users (The number of referrals should at least stay the same compared with the control)

Support tickets related to onboarding confusion

Error rate on onboarding steps

Calculate Your Required Sample Size

The best way to ensure good decision making with experiments is to know how much data you need up front. Running an experiment without a sample size calculation is one way to end up not knowing if you can trust your results or when to end an experiment. Most modern experimentation platforms include a power calculator. You'll need four inputs:

  • Baseline Metric Value: Your current metric value, from your recent historical data. For conversion rates, this is a percentage; for continuous metrics like revenue or session duration, it's an average. In GrowthBook, we can compute this for you on historical data filtered down to your likely experiment population.
  • Minimum Meaningful Effect: The smallest improvement worth shipping for; in other words, you don’t care to detect a smaller effect, because it wouldn’t be worth the extra sample size to ship.
  • Confidence Level: Typically 95%
  • Statistical Power: Typically 80%, meaning a 20% chance of missing a real effect.

The calculator will tell you how many units you need per variant. Divide by your average daily volume of that unit to get your required duration. That might be daily active users, daily email sends, or accounts, depending on what you're randomizing on. Make sure you're calculating based only on the population that meets your targeting criteria, not your total user base.

Many tests should run for at least two full business cycles, typically two weeks minimum, to account for day-of-week behavior patterns even if you reach your sample size sooner.

Designing for Trustworthy Results

Experiment implementation is a crucial part of running a clean causal experiment and learning what you actually set out to learn. 

  • Test one feature at a time so results can be attributed to a specific cause.
  • Ensure random, equal traffic distribution between variants.
  • Keep everything else identical between versions.
  • Think upfront about which user segments might respond differently to the change and whether they should be tested separately.
  • Account for novelty effect. Users sometimes behave differently simply because something is new, which can cause early results to look better than they are.
  • Document the experiment in your log before launch, including hypothesis, metrics, targeting criteria, implementation details, and expected end date.

Step 4: Set Up Your Experiment and Validate the Implementation

Before you launch your experiment, validate that your experiment is configured correctly. Problems caught here are easy to fix. Problems caught after two weeks of bad data are not.

  • Confirm the events you need are firing correctly and consistently across platforms and devices.
  • Run an A/A test if you're setting up a new experimentation platform or making changes to your assignment logic. 
  • Check that both variants are free from bugs and function as expected. 

Step 5: Launch and Monitor

Once your experiment is live, your job is mostly to leave it alone. The temptation to check results early is real, especially when there's pressure to ship, but acting on interim results is one of the most common ways teams produce conclusions they can't trust.

Monitor only for:

  • Technical Errors or Bugs: If something is broken, stop the test and fix it.
  • Guardrail Metric Violations: If an important metric is getting meaningfully worse, it may be worth stopping early regardless of significance.
  • Sample Ratio Mismatch: An uneven traffic split is a signal that something is wrong with your assignment logic.

Everything else can wait until the test reaches its required sample size. If you need the flexibility to act on results before that point, enable sequential testing.

Step 6: Analyze Results Properly

When your experiment reaches its required sample size, resist the urge to declare a winner immediately. Good analysis goes beyond the binary question of whether the variant beats the control.

Some metrics require additional waiting time even after the experiment ends. For example, if you're measuring 7-day activation, you need to wait seven days after the last user was exposed before you can analyze that metric. Build this into your timeline upfront.

When it’s time to analyze the results:

  • Confirm the experiment ran as designed and reached the sample size needed to power it properly.
  • Check the confidence level against your predetermined threshold. A result that doesn't reach 95% confidence isn't automatically worthless. If a variant shows a 70% chance of being best with no meaningful guardrail violations and low implementation cost, many teams will ship it.
  • Verify there was no sample ratio mismatch that could invalidate the results.
  • Check secondary and guardrail metrics to confirm the variant didn't improve the primary metric while quietly harming something else.
  • Analyze results by key segments to check whether the overall result is hiding meaningful differences between groups.
  • Look at practical significance along with statistical significance. Statistical significance on its own doesn't tell you if this was a big win or a small win; you can learn a lot about what worked or didn't by looking at the lift directly, and considering how it compares to the cost of building and maintaining this feature.
  • Document regardless of outcome. Losses are often where the most learning happens.  Take the time to try to learn why your users behaved in a way you didn't expect.

Step 7: Implement and Iterate

Every experiment produces an outcome worth acting on, even when your hypothesis is proven wrong.

  • If your new variant wins, fully implement it and monitor post-launch performance. Use what you learned to sharpen the next hypothesis, a winning experiment often reveals opportunities for further improvement. 
  • If your new variant loses, roll back to your control. A losing experiment is still valuable. Analyze why your hypothesis was wrong, what the data suggests about user behavior, and whether a different approach is worth testing. Document the result so the same test doesn't get run again six months later by a different team.
  • If the result is inconclusive, iterate a few times with a learning mindset. An inconclusive result usually means one of three things: the sample size wasn't large enough, the effect is smaller than your minimum detectable effect, or there genuinely isn't a meaningful difference between the variants.

Advanced A/B Testing Strategies 

Once the fundamentals are in place, these advanced techniques can create additional value as your program matures and the questions you're trying to answer get harder.

CUPED 

CUPED (Controlled-experiment using pre-experiment data) is a variance reduction technique that uses pre-experiment metric data to improve the accuracy of your results. By accounting for pre-existing differences between users before the experiment starts, it reduces the noise in your estimates, meaning you can detect smaller effects with the same traffic, or reach the same level of confidence faster

GrowthBook's implementation extends CUPED with post-stratification, which uses user attributes like country or plan tier to further reduce variance by isolating the treatment effect from natural differences between groups. The more correlated your pre-experiment data and attributes are with the metric you're measuring, the more variance reduction you'll see.

The main requirement is that you have pre-experiment data for the metric you're testing. It works best for metrics that are frequently observed (engagement rates, session counts, revenue) and is less effective for new users or rare events where there's little pre-experiment history to draw on.

Example: Netflix reported CUPED reduced variance by roughly 40% for some key engagement metrics. Microsoft reported it was equivalent to adding 20% more traffic for a majority of metrics on one product team.

Quantile Testing

Most A/B tests compare means across variants, which works well when the effect is evenly distributed across users. Quantile testing compares percentiles instead, making it the right tool when you care about what's happening at the extremes. A change that improves average page load time by 50ms might look neutral on a mean test while actually fixing a severe performance problem affecting your slowest 1% of users.

The main consideration is sample size. Extreme quantiles (P99, P99.9) require large samples to produce reliable estimates. It also works best when you have a clear hypothesis about which part of the distribution you're trying to move.

Example: An engineering team testing a backend optimization uses a P99 latency metric to confirm the change reduced worst-case load times by 7ms, even though the mean improvement was too small to detect.

Multi-Armed Bandits

A multi-armed bandit is an adaptive experiment that shifts traffic toward better-performing variants as data comes in, rather than maintaining a fixed split throughout. Unlike a standard A/B test, which waits until the end to declare a winner, a bandit continuously reallocates traffic based on which variant is performing best on a single decision metric. GrowthBook uses Thompson sampling, a Bayesian algorithm that balances exploration (testing all variants) with exploitation (sending more traffic to the best performer).

Bandits work best when you have a clear single metric to optimize, five or more variants to test, and care more about minimizing exposure to poor-performing variants than understanding why each one performed the way it did. They're less suited to situations with long feedback loops, multiple goal metrics, or where statistical rigor matters more than speed.

Example: An ecommerce team testing five different product page layouts uses a bandit to automatically shift traffic toward the best-performing variant. This allows them to quickly capitalize on a winner during time-sensitive, days-long promotions like a Black Friday sale, while also reducing the number of users exposed to lower-converting layouts.

Cluster Experiments

Most experiments randomize at the user level, but some products require randomization at a coarser level of granularity. In B2B software, for example, you might need everyone at a company to see the same experience. Showing different variants to different users within the same organization would create confusion and contaminate results. Cluster experiments solve this by randomizing at the group level (the organization, the school, the household) while still analyzing outcomes at the individual level.

The main challenge is that cluster-level randomization reduces your effective sample size. You're randomizing across a smaller number of clusters than individual users, which means you need more clusters to reach significance. GrowthBook supports cluster experiments natively, handling the statistical complexity of analyzing at a different level than you randomize through its statistics engine.

Example: A B2B SaaS team testing a new dashboard layout randomizes at the organization level so every user within a company sees the same variant, then analyzes individual user engagement to measure impact.

Full-Funnel Testing

Most experiments measure a single metric at a single point in the user journey. Full-funnel testing measures the effect of a change across multiple stages, from initial conversion through to retention, revenue, and long-term engagement. This matters because a change that looks positive at the top of the funnel can have neutral or negative downstream effects that a single-metric test would miss entirely.

The main requirement is having metrics instrumented across the full user journey and enough traffic to detect meaningful differences at each stage. It also requires patience — downstream metrics like 30-day retention take time to manifest, which means full-funnel tests run longer than standard conversion tests.

Example: A team testing 7-day versus 14-day free trial lengths measures not just trial starts but 30-day conversion to paid, finding that the longer trial increased signups but reduced urgency to convert, producing a net negative revenue impact.

Long-Term Holdouts

Individual experiments measure the impact of a single change. Long-term holdouts measure the cumulative impact of all your changes over time. A small group of users is withheld from new features and experiments for an extended period, typically a quarter, while the rest of the product moves forward. Comparing the holdout group to the general population reveals the true long-term value of everything you shipped, including any unexpected interactions between features that individual tests couldn't detect.

The main tradeoff is that a small percentage of users (typically around 5%) experience a degraded product for the duration. 

Example: A product team runs a quarterly holdout and discovers that the cumulative lift from five experiments, each with a 1% lift, is only 3% relative to the holdout group because of diminishing returns.

Incorporating Research

A/B tests tell you what happened, but they can’t tell you why. Combining quantitative experiment results with supplemental research (user interviews, session recordings, surveys, usability testing) gives you both. A variant that wins on conversion but generates support tickets is a signal worth investigating. A variant that loses might reveal through user interviews that the hypothesis was right but the execution was wrong.

Supplemental research is most valuable at two points: before an experiment, to sharpen the hypothesis, and after an inconclusive or surprising result, to understand what the data couldn't tell you and you help generate your next hypothesis.

Example: A team runs an experiment on a new onboarding flow that produces an inconclusive result. User interviews reveal that users understood the new flow better but felt uncertain about committing without seeing the product first, leading to a new hypothesis worth testing.

Common A/B Testing Mistakes (and How to Avoid Them)

Every experimentation program can make mistakes. Learning to recognize common experimentation pitfalls is the first step to not repeating them.

1. Running Experiments Without Big Enough Samples

An underpowered experiment is one that doesn't have enough units to reliably detect the effect you're looking for. It happens when teams skip the power analysis and launch tests without knowing whether their population size or expected traffic is sufficient. Without enough data, you'll either get an inconclusive result or one that looks real but isn't stable enough to act on.

Example: Running tests on pages with fewer than 1,000 visitors per week.

Solution: Focus on the highest-traffic pages or make bolder changes that require smaller samples to detect.

2. Testing Changes That Are Too Small

A change that's too small to produce a detectable effect is a change that's too small to test. It happens when teams focus on incremental tweaks (like a slightly different button shade or a minor copy change) rather than changes that are likely to meaningfully affect user behavior. When these tests do reach significance, the effect size is often too small to justify implementing, and every underpowered test on a trivial change is a test you didn't run on something that could actually move your metrics.

Example: Testing the order of navigation menu dropdowns when users don't understand what your product does.

Solution: Match the boldness to your traffic volume. Smaller sample sizes need bigger swings.

3. Stopping Tests Early

Stopping a test before it reaches its required sample size is one of the most common ways teams produce results they can't trust. It happens when interim results look promising, and there's pressure to ship. The numbers seem to confirm the hypothesis, so stopping feels justified. The problem is that early data is noisier than final data, and a result that looks significant at day five may look very different at day twenty. Stopping early inflates your false positive rate, meaning you'll ship changes that don't actually work.

Example: Ending tests as soon as the p-value hits 0.05.

Solution: Predetermine the sample size and duration, and stick to them. If you need the flexibility to monitor continuously, enable sequential testing.

4. Ignoring External Factors

External factors are events or conditions outside your product that affect user behavior during an experiment. It happens when teams run tests during atypical periods like a seasonal sale, a major product launch, or a news cycle, without accounting for how those conditions might skew results. A winning variant during an unusual period may reflect the context more than the change itself, and applying those results year-round can lead to poor decisions.

Example: Testing during Black Friday and assuming results can be replicated year-round.

Solution: Note external factors and retest important changes during normal periods before permanently implementing them.

5. Shopping Metrics for Significant Results

Metric shopping is when teams run an experiment against many metrics and report whichever ones show significance after the fact. It happens when teams don't define their primary metric before the experiment starts, leaving the door open to interpret results selectively once the data comes in. The more metrics you test, the more likely you are to find a false positive by chance, and a result that emerges from fishing through metrics is not a result you can act on confidently.

Example: Testing 20 metrics, hoping one shows significance.

Solution: Choose your primary metric before starting. Treat others as directional insights rather than stable conclusions.

6. Shipping a Winner Without Checking Segment Performance

Aggregate results can hide meaningful differences in how different groups of users respond to a change. It happens when teams declare a winner based on overall performance without breaking results down by segment. A new checkout flow might increase conversion overall but frustrate returning users who've built habits around the old one, or perform well on desktop while degrading the experience on mobile. Shipping without checking may help some users at the expense of others.

Example: The overall winner performs worse for important segments.

Solution: Always analyze results by key segments, like new versus returning users or mobile versus desktop, before implementing.

7. Ignoring Implementation Cost

Not all winning variants are worth shipping. It happens when teams evaluate experimental outcomes purely by metric lift, without accounting for the actual cost of building and maintaining the change. A variant that requires significant refactoring, introduces dependencies on other systems, or creates an ongoing maintenance burden may not be worth implementing even if the results are strong. The lift needs to justify not just the initial build but the long-term cost of owning the change.

Example: A lead categorization model can improve onboarding success, but implementing it requires rebuilding the underlying data model and all downstream dependencies.

Solution: Factor implementation cost into your hypothesis prioritization.

8. Optimizing for Short-Term Metrics

A test can show an increase in conversion while masking longer-term damage. It happens when teams optimize for metrics that look good in a two-week experiment window without considering what happens to users afterward. Dark patterns that trick users into taking actions they might not otherwise take can boost immediate conversion rates while increasing refund rates, reducing retention, and eroding brand trust over time. A confused user and a genuinely converted user can look identical in the short term.

Example: An edtech team test that shortened a mandatory course tutorial showed a 20% gain in time-to-first-lesson completion, but a 7% decrease in the final exam pass rate.

Solution: Use guardrail metrics to catch downstream damage. If a winning variant hurts retention or drives up support volume, it's not a real win.

9. Running One Test and Moving On

Experimentation compounds over time, but only if teams treat it as a continuous practice rather than a series of isolated events. The mistake happens when shipping a winner feels like the finish line. The variant performed better, the change ships, and the team moves on to the next project without asking what they learned or what to test next. A single experiment answers a single question, but the real magic comes from using that answer to sharpen the next hypothesis, and the next one after that.

Example: A new recommendation algorithm that increases platform engagement, but no one investigates which content types drove the increase to further refine it.

Solution: Create a regular testing cadence with iteration built in.

10. Not Documenting Results

Without a record of what was tested, what the hypothesis was, and what the outcome was, institutional knowledge disappears when people leave, and teams waste time re-running experiments that have already been answered. It happens when documentation is treated as an afterthought rather than part of the process.

Example: A variant everyone was confident would win loses. A few months later, a different team has the same idea and runs the same test.

Solution: Maintain a searchable experiment archive and share learnings broadly.

How to Build a Culture of Experimentation

Winning a single A/B test is straightforward. The harder work is building an organization where evidence, not opinion, drives decisions consistently across teams, product areas, and levels of seniority.

An experimentation culture means controlled experiments are the default way of resolving uncertainty. Companies with a strong experimentation culture, recognize that their win rate is only 20% and that they are terrible at predicting which features will win or lose.  That insight creates humility and a determination to test everything. They recognize that the results of a test carry more weight than the instinct of the most senior person in the room, and losing an experiment is treated as useful information rather than a failure.

What a Strong Experimentation Culture Looks Like in Practice

In his book Experimentation Works, Harvard Business School Professor Stefan Thomke identifies seven attributes that characterize organizations where experimentation is genuinely embedded. They're worth understanding not as a checklist but as a description of what the mature state actually looks like.

  1. A Learning Mindset: Experimentation is treated as a continuous process, not a one-time validation. Most experiments won't produce dramatic results, and teams that have internalized this don't treat inconclusive results as wasted effort. They treat them as the cost of learning.
  2. Rewards Consistent With Values: Teams are rewarded for running good experiments, not just winning ones. When compensation is tied to metrics that make experimentation difficult, or when people are punished for null results, the culture quietly dies regardless of what leadership says about it.
  3. Humility: In a true experimentation organization, even the most senior person's assumptions get tested. Leadership's job shifts from making top-down calls to creating the conditions for good experiments and accepting what they find.
  4. Experiments Have Integrity: Strict guidelines govern how experiments are designed and run. This means pre-registered hypotheses, appropriate sample sizes, and agreed-upon metrics before the experiment starts, not after the results come in.
  5. Tools are Trusted: Experimentation only works if people trust what it produces. If teams routinely question the validity of results or find workarounds to avoid acting on them, the infrastructure exists, but the culture doesn't.
  6. Exploration and Exploitation are Balanced: There's an inherent tension between running experiments to learn and shipping product to grow. Organizations that only exploit what they already know stop learning. Organizations that only explore never ship. Senior leadership has to manage that balance deliberately.
  7. Leadership Actively Promotes It: Companies tend to become less innovative as they grow, as the distance between senior leadership and the teams doing the work increases. Experimentation cultures require leaders who actively champion the practice, not just endorse it in all-hands presentations.

Where Does Your Team Sit on the Experimentation Maturity Model?

Thomke and his colleagues describe five stages of experimentation maturity. These are a great way to honestly assess where your organization is and what it would take to move forward.

Stage 1: Awareness

At the awareness stage, leadership values experimentation, but no processes, tools, or infrastructure in place. Decisions are still mostly based on experience and intuition. If your team occasionally runs an experiment when a decision is particularly contested, this is probably where you are.

Stage 2: Belief

At the belief stage, leadership accepts that a more disciplined approach is needed and starts investing in tools and dedicated teams. The impact on day-to-day decision-making is still minimal, but the direction is set.

Stage 3: Commitment

At the commitment stage, experimentation becomes core to how the team operates. Some product decisions and roadmap calls now require data from experiments, and the impact on business outcomes is becoming measurable. 

Stage 4: Diffusion

At the diffusion stage, large-scale experimentation is recognized as necessary, and formal standards are rolled out across the organization, supported by tooling and training. Individual teams are no longer the bottleneck.

Stage 5: Embeddedness

At the embeddedness stage, experimentation is fully democratized. Teams design and run their own experiments without central oversight, results are shared automatically across the organization, and the institutional memory of past experiments actively informs new ones.

When Harvard Business School's Baker Research Services compared the stock performance of companies with strong experimentation cultures against the S&P 500 over ten years, those companies outperformed the index by a wide margin. The group included Amazon, Etsy, Facebook, Google, Microsoft, Booking Holdings, and Netflix, organizations that had spent years building the infrastructure and culture for large-scale experimentation.

The Future of A/B Testing

Experimentation is evolving fast. The tools and techniques available today look very different from what existed five years ago, and the next five years will likely bring even more significant shifts. A few trends worth paying attention to:

AI-Powered Testing

AI coding tools are accelerating development velocity in ways that are changing how product teams need to think about experimentation. When engineers can ship features faster, the volume of changes hitting production increases. More features shipping faster means more opportunities for something to hurt retention, conversion, or engagement before anyone catches it. Gradual rollouts and a rigorous experimentation practice matter more as shipping velocity increases.

There's also an entirely new category of things to test. Teams building AI-powered features like recommendation systems, content generation tools, and AI tutors face a challenge that standard A/B testing wasn't designed for. LLMs are non-deterministic: the same input doesn't always produce the same output, and measuring quality requires different metrics than measuring clicks or conversions. Testing whether one model prompt produces better learning outcomes than another requires an experimentation platform that can handle that kind of measurement.

GrowthBook's approach is to accelerate every step of the experimentation lifecycle from directly within the tools developers already use. AI integrations built into the platform include automatic results summaries, hypothesis validation before a test launches, similar experiment detection using vector embeddings, metric definition generation, and SQL generation for data exploration. MCP integration lets you connect your own tools and agents directly to GrowthBook via the MCP server.

Learn more about how to test AI with this practical guide.

Real-Time Personalization

Traditional A/B testing delivers the same experience to everyone in a variant. The next evolution is moving beyond fixed variants toward delivering the optimal experience for each individual user in real time, based on their behavior, context, and predicted response. Multi-armed bandits are an early version of this idea, but the direction is toward much more granular personalization.

Causal Inference

As experimentation programs mature, teams are increasingly using advanced statistical methods to understand cause and effect more precisely, particularly in situations where traditional randomized experiments are difficult or impossible to run. Techniques like difference-in-differences, synthetic control, and instrumental variables are becoming more accessible to product and data teams.

Cross-Channel Orchestration

Many experimentation programs are siloed by channel. A web team runs web experiments, a mobile team runs mobile experiments, and the combined effect of changes across both is rarely measured. The direction is toward experimentation infrastructure that can orchestrate and measure tests across web, mobile, email, and other touch points simultaneously.

Privacy-First Experimentation

Privacy regulations and the deprecation of third-party tracking are forcing experimentation platforms to adapt. The approaches gaining traction are those that minimize data movement, work with aggregated rather than individual-level data, and can operate within strict compliance requirements. Platforms that support self-hosting are well-positioned for this shift.

How to Get Started with A/B Testing

Getting started with A/B testing doesn't require a mature experimentation platform or a dedicated data science team, but the setup decisions you make early will either accelerate or constrain your program as it grows.

Get Your Instrumentation Right

Before you can run reliable experiments, you need confidence that the metrics you care about are being tracked correctly and consistently. This means checking that your event logging is complete, that events fire consistently across platforms and devices, and that your data pipeline is reliable.

Skipping this step is one of the most common reasons early experimentation programs produce results no one trusts. A test result is only as good as the data behind it, and discovering instrumentation gaps after a test has run is a frustrating way to learn that lesson.

Start With One Test

Pick a high-traffic surface where you have a clear hypothesis and a metric you can measure. Don't start with the most complex change or the most ambitious idea. Start with something where the feedback loop is short, the instrumentation is straightforward, and you have a reasonable chance of seeing a result. Early wins help build organizational buy-in. Run the test for long enough to reach your required sample size, analyze the results honestly, and document what you learned regardless of the outcome.

Chances are, your team already has hypotheses worth testing. Look at support tickets, session recordings, and drop-off points in your funnel. If you want external inspiration, resources like GoodUI.org and the Baymard Institute publish evidence-based UX patterns that can serve as a starting point for simple but effective test ideas.

Find a Leadership Sponsor

Experimentation programs that stick have a senior champion: someone with enough organizational influence to protect the team's time, push back when results are inconvenient, and make the case for investing in the infrastructure. Without one, a single bad test result or a quarter of inconclusive experiments is enough to kill the program before it gets traction.

It’s ok if your sponsor isn’t technical, but they need to believe that making decisions based on evidence is worth the investment, and be willing to say so publicly when the HiPPO in the room disagrees with the data.

Go Deeper

One of the most useful things you can do early is learn from teams that have already built mature experimentation programs.

  • Trustworthy Online Controlled Experiments by Ronny Kohavi, Diane Tang, and Ya Xu: The most rigorous and practical book on running experiments at scale, written by the people who built experimentation programs at Microsoft, Google, and LinkedIn.
  • Experimentation Works by Stefan Thomke: This book takes a broader look at building an experimentation culture, grounded in research across dozens of organizations.
  • GrowthBook Docs: Detailed guides covering everything from getting started to advanced statistical methods, with a practical guide to scaling experimentation at your company.

Join the Community

Having access to people who have already solved the problems you're facing is one of the most underrated resources in experimentation, and the experimentation community loves to share knowledge and help each other out. 

  • Trustworthy A/B Patterns: GrowthBook is partnering with industry pioneers Ronny Kohavi, Lukas Vermeer, and Jakub Linowski to offer e-commerce companies with over 1 million monthly active users free expert assistance in designing and executing high-impact A/B tests in exchange for the right to publish the results.
  • GrowthBook Slack: A free Slack community where you can ask questions, share learnings, and connect with other teams running experiments from those just getting started to those running robust programs at scale. 
  • Test & Learn Community: A free community of over 2,000 practitioners across experimentation, product, analytics, and research. Members meet regularly on Zoom to discuss topics, hear from industry leaders, and help each other solve real problems. 

How to Choose an A/B Testing Platform

The experimentation platform you choose now will shape your experimentation program for years to come. It determines what you can test, how fast you can move, and how much you can trust your results. And because experimentation infrastructure becomes deeply embedded in your codebase and data pipelines over time, switching platforms is expensive and disruptive enough that most teams avoid it, so it's worth getting right the first time.

Technical Fit and Developer Experience

The platform needs to work with how your team already builds. A tool that requires significant engineering lift to integrate, or doesn't support your tech stack, will create friction from day one and limit who can actually run experiments.

  • Target Use Cases: Was this platform built for product and engineering teams or marketing-led CRO? The answer shapes everything from the SDK architecture to the statistical methods available, and a tool designed for visual editing and landing page optimization will quickly run into issues testing algorithms or server-side features.
  • SDK Coverage: The platform needs to integrate cleanly into how your team already builds, without requiring significant backend engineering every time a new test is created. Look for SDKs that evaluate locally with no network requests (keeping performance impact minimal and ensuring experiments work regardless of connectivity), and that cover your full stack. GrowthBook offers 24+ SDKs covering frontend, backend, mobile, and edge environments, working with virtually any stack.
  • Feature Flag Integration: The best experimentation platforms combine feature flags and experiments in a single tool. This lets you use the same flag to run an experiment, do a phased rollout, and kill a change if something goes wrong, without switching between systems.
  • Integration Complexity: How long does it take to instrument your first experiment? A good platform should have clear documentation and a quick-start path that doesn't require backend engineering for every new test.
  • Scalability: Can it handle your traffic volume without degrading performance or requiring you to limit how much of your user base is exposed to experiments?
  • Environment and Release Management: Does it support separate staging and production environments, and can you roll out changes incrementally without redeploying code?
  • AI and MCP integration: Does the platform support AI-assisted workflows so your team can work smarter and faster, or support MCP so you can integrate your own tools and agents?

Statistical Rigor and Data Ownership

The platform needs to produce results you can actually trust, and that means being transparent about how the statistics work.

  • Statistical Transparency: Can you see the methodology behind the results? Look for platforms that support both Bayesian and frequentist approaches, publish their stats engine openly, and don't hide their calculations behind proprietary black boxes.
  • Warehouse-Native Analysis: The best platforms let you analyze data directly in your existing warehouse (Snowflake, BigQuery, Redshift, Databricks) rather than requiring you to send data to a third-party system. This means your experiment data lives alongside your product data, you define metrics using SQL you control, and there's no duplicate data pipeline to maintain.
  • Managed Warehouse:  If you don’t have your own data warehouse, some vendors offer a pre-configured data warehouse.  This allows you to start out from day 1 with an industry-standard data warehouse without setting one up or creating your own data pipelines.
  • Metrics Definition and Governance: Who defines the metrics and how? Look for platforms that let your data team define metrics centrally using your own data definitions, rather than forcing you to redefine them inside the tool.
  • Data Ownership: When you stop using the platform, do you keep your experiment history and learnings? Proprietary platforms that hold your data hostage create switching costs that go beyond the tool itself.
  • Targeting and Segmentation: Can you randomize at the level that makes sense for your product (user, account, organization, session) and analyze results by segment without the platform limiting how you slice the data?

Security, Compliance, and Deployment Options

Security and compliance requirements vary widely across industries, but the cost of getting this wrong is high regardless. Data residency issues, compliance violations, and PII exposure can all stem from choosing a platform that wasn't built with these constraints in mind.

  • Self-Hosting: Can you run the platform on your own infrastructure? Cloud-only platforms create data residency issues for teams with strict compliance requirements and require you to send user data to a third-party system. Self-hosting gives you full control over where your data lives.
  • Privacy and PII handling: Does the platform require you to send personally identifiable information to its servers to run experiments? Look for platforms that assign experiments locally, with no user data leaving your infrastructure.
  • Open Source vs Proprietary: Open-source platforms allow you to audit the code, customize the platform to your needs, and avoid vendor lock-in, but they require engineering resources for maintenance.
  • Compliance and Regulatory Requirements: If you operate in healthcare, financial services, education or other regulated industries, the platform you choose should support your compliance requirements out of the box.

Accessibility and Collaboration

Experimentation only scales when the whole team can participate. A platform that creates friction for non-technical users will limit how often experiments get run and who benefits from the results.

  • Ease of Use for Non-Technical Teams: Can a product manager set up and launch an experiment without engineering support? Look for intuitive interfaces, clear result summaries, and workflows that don't require SQL or statistics knowledge to navigate.
  • Result Sharing and Reporting: How easy is it to share experiment results with stakeholders? Look for shareable dashboards, exportable reports, and result summaries that translate statistical outcomes into plain language.
  • Experiment Documentation: Does the platform make it easy to document hypotheses, decisions, and learnings in a way that's searchable and accessible to the whole team? A searchable experiment archive is one of the most valuable things an experimentation program can build over time.
  • Permissions and Governance: As your program grows, you need the ability to control who can create, approve, and ship experiments. Look for role-based permissions and approval workflows that you can tailor to how your organization actually operates.

Pricing and Total Cost of Ownership

With experimentation platforms, the sticker price is rarely the full cost. How a platform charges you shapes how much you can experiment, and the wrong pricing model can quietly constrain your program as it grows.

  • Pricing Model: Does the platform charge per event or based on traffic volume? These models create a direct conflict between running more experiments and controlling costs, often forcing teams to test on a fraction of their traffic to avoid overage fees. Look for predictable pricing that doesn't penalize you for growing.
  • Build vs Buy: Are you better off building or buying an experimentation platform? Building an in-house solution gives you full control but most companies underestimate the complexity and risks of doing so. Most teams underestimate this cost until they're already committed, and the opportunity cost of those engineers not working on the product is rarely factored in.
  • Modular vs All-in-one Pricing: Some platforms charge separately for server-side, client-side, and feature flag capabilities. What starts as one tool quickly becomes multiple SKUs with compounding costs.
  • Switching Costs: What happens if you outgrow the platform or want to move? Proprietary data formats, locked-in experiment history, and deep SDK integrations all make switching painful. Factor this into your evaluation upfront rather than after you've committed.

Why GrowthBook

GrowthBook is the warehouse-native feature flagging, experimentation, and product analytics platform built for product and engineering teams. It's used by over 3000 companies, from early-stage startups running their first experiments to enterprises processing billions of feature flag evaluations per day. Here’s why you should consider using GrowthBook:

  • No per-traffic or per-event pricing, so you can run experiments on as much of your traffic as you want without watching costs balloon. Learn more about GrowthBook pricing
  • Analysis runs directly on your existing data warehouse (Snowflake, BigQuery, Redshift, Databricks), with no need to send data to a third-party system.
  • The platform is open source, your data stays yours, and you can self-host if you need full control.
  • Feature flags and experimentation are unified in a single platform, so you're not juggling separate tools for rollouts and tests. 
  • The stats engine supports both Bayesian and frequentist approaches, CUPED, post-stratification, sequential testing, and advanced techniques like cluster experiments and holdouts.
  • Built-in tools for experimentation culture and deep insights: a searchable experiment archive, shareable dashboards, and an interface designed for the whole team, not just data scientists.
  • With 24+ SDKs covering frontend, backend, mobile, and edge environments, it works with virtually any stack.

You can start for free and scale from there. The free tier gives you everything you need to run your first experiments, while the Enterprise plan adds advanced features like holdouts and the governance tools that mature programs need.

Start A/B Testing the Right Way

A/B testing doesn't replace judgment, but it gives judgment something solid to work with. The teams that get the most out of it aren't running the most experiments. They're asking sharper questions, defining better metrics, and building enough rigor into their process that results can actually be trusted.

That's harder to build than it sounds. But the organizations that do it stop having the same arguments about what to ship. They stop reverting changes based on noise. They stop leaving product decisions to whoever made the most compelling case in the last meeting.

In 2026, the companies pulling ahead are the ones replacing guesswork with evidence.

How Khan Academy Optimizes AI Tutoring with Experimentation
Experiments
Platform

How Khan Academy Optimizes AI Tutoring with Experimentation

Mar 22, 2026
x
min read

Kelli Hill gave a standout presentation at The Conference known as Experimentation Island on February 24, 2026, walking the audience through Khan Academy's evolution from intuition-based testing to running A/B tests on generative AI features in production. If you missed it, the good news is Kelli will be joining us for a webinar on April 16, 2026. I'd highly encourage you to register here. Below are my key takeaways from her talk.

A Quick Word on Khan Academy

Khan Academy is a nonprofit with a mission to provide a free, world-class education for anyone, anywhere. They have nearly 200 million registered users and have logged over 63 billion learning minutes on their platform. In 2023, they launched Khanmigo, a generative AI-powered tutor and teaching assistant built on top of their massive library of exercises, articles, and instructional content. Khanmigo is the focus of much of their current experimentation work, and the context for everything Kelli shared.

From Homegrown to a Real Experimentation Stack

Khan Academy has been running experiments since 2011, when they built their first in-house platform on Google App Engine. At their peak, they had hundreds of A/B tests running simultaneously. But over time, the homegrown system slowed down, and when they rewrote their entire backend in 2019 (a million lines of code, migrating off Python 2), they made a deliberate decision not to port their old experimentation tooling.

Instead, they evaluated what was available. Building a new platform in-house was tempting, but they recognized that experimentation infrastructure wasn't their core competency. Buying an enterprise solution would have required downsampling their data, which was a non-starter. They ultimately chose GrowthBook, self-hosting it and connecting it to their existing data warehouse and eventing pipelines. Their chief architect's top priority was that the tool not slow down a site serving a million daily active users, and GrowthBook delivered on that.

The lesson here is one we see repeatedly: organizations that try to build their own experimentation platform almost always end up spending more than expected, moving slower than they'd like, and eventually switching to something purpose-built. Khan Academy's journey is a textbook case of making that transition well.

How Evals Evolved from Vibes to Automated A/B Testing

The most fascinating part of Kelli's talk was the four-phase journey Khan Academy went through to figure out how to measure AI quality. When you're building an AI tutor, you can't just measure click-through rates. The goals are harder: 

  • Increased cognitive engagement
  • An increase in skills on their way to proficiency 
  • Measurable learning gains on external assessments. 

And LLMs make measurement even harder because they're non-deterministic. The same prompt can produce wildly different outputs each time.

How Khan Academy evolved their AI evaluation techniques

Phase 1: Intuition-driven testing. In September 2022, before ChatGPT had even launched publicly, OpenAI gave Khan Academy early access to GPT-4 via Slack. The team's first experiments were literally typing prompts into Slack and reading the outputs. They quickly discovered problems (GPT-4 confidently told a user that 9 + 5 = 15, then gave the correct answer ten minutes later). Good enough for building intuition about how LLMs behave, but not for building a product.

Phase 2: Structured manual testing. With a deadline to launch alongside GPT-4's public announcement in March 2023, they built an internal prompt playground for more repeatable testing. Faster than Slack, but still relied on humans to read outputs and judge quality.

Phase 3: Automated post-hoc evals. This is where things got serious. They assembled a team of PhDs in education to define what good tutoring actually looks like, then had human raters apply that rubric to chat transcripts, targeting 85% inter-rater agreement. Once they had that ground-truth dataset, they used it to train an LLM-as-judge to label transcripts at scale. The key insight: many teams spin up LLM-as-judge systems with no ground truth, resulting in unreliable results. Khan Academy invested in the hard work of human annotation first. Once the machine matched human accuracy, they scaled it to process thousands of interactions nightly.

Phase 4: A/B testing in production. With reliable automated evals in place, they could finally run controlled experiments on prompt changes, system instructions, and even entire model swaps, all measured against metrics like cognitive engagement, item performance, undesirable tutoring behaviors (like giving away answers), and latency as a guardrail. This is the stage they're in now, with 64 completed experiments, 29 running, and 13 queued as of February 2026.

The takeaway: as AI products mature, your evaluation methods need to mature with them. You can't skip straight to production A/B testing without the foundation of knowing what "good" looks like.

The Math Agent Story: What Iterative AI Experimentation Actually Looks Like

Kelli shared a concrete example that perfectly illustrates how A/B testing enables teams to "hill climb" toward better AI quality. The problem: Khanmigo had a math agent, essentially a calculator it could call to verify computations. Great for accuracy, but it added latency that was painful in classroom settings.

Here's how the iterations played out:

Iteration 1: Remove the math agent entirely. Latency improved, but math errors doubled. Rolled back immediately.

Iteration 2: Switch to GPT-5. Latency decreased, but math accuracy still suffered. Rolled back.

Iteration 3: Optimize the math agent's prompts. They tightened the system instructions to be more efficient. Latency dropped by three seconds, and math accuracy held. A real win.

Iteration 4: Give the math agent a faster model. Reduced latency by another 300 milliseconds with stable accuracy.

Iteration 5: Time-box the math agent's execution. Further latency reduction, accuracy still stable.

Without A/B testing, the team might have shipped Iteration 1 or 2 and unknowingly degraded the learning experience. The experiments gave them the confidence to reject changes that looked good on one metric but failed on the one that mattered most. This is what "hill climbing" looks like in practice: hypothesis, test, measure, iterate. No single change was transformative. The cumulative effect was.

From Speed Bump to Safety Net: The Cultural Shift

Perhaps the most important takeaway from Kelli's talk was about culture. Before Khanmigo, experimentation at Khan Academy was seen as a speed bump. Product teams wanted to ship based on strong founder intuition and internal conviction. Running an A/B test felt like an obstacle to velocity.

Generative AI changed that completely. LLMs are unpredictable enough that even small changes to prompts or system instructions can produce dramatically different outputs. Teams quickly learned that shipping without testing was genuinely risky. The same engineers who once resisted experimentation now actively request it.

Experimentation went from being perceived as something that slows you down to being the safety net that gives teams the confidence to move fast. That cultural transformation, more than any individual experiment result, may be the most valuable outcome of Khan Academy's journey.

Want to hear the full story from Kelli? She'll be joining us for a live webinar on April 16, 2026, where she'll share this full story. 

Feature Flags: What They Are, How They Work, and Why They Matter
Feature Flags

Feature Flags: What They Are, How They Work, and Why They Matter

Mar 2, 2026
x
min read

It’s Friday, quarter to 5:00 PM. Your team deploys a major checkout redesign to production. Within minutes, its error rates start spiking.

Your Slack’s on fire and your CEO is asking a ton of questions. Next thing you know, you’re staring down a long night of reverting commits and explaining what happened.

Now imagine the same scenario with one change. You disable the feature in 10 seconds with a single click, without redeploying code. 

That’s the difference between deploying code and deploying code behind a feature flag.

In this guide, we’ll cover what feature flags are and how product and engineering teams can use them successfully.

What Are Feature Flags?

A feature flag is a conditional mechanism in your code that lets you toggle application behavior at runtime, without deploying new code.

You wrap a feature in a flag, deploy it in an “off” state, and turn it on when you’re ready. If something goes wrong, you turn it off, which rolls back the deployed feature.

Feature flags are also referred to as feature toggles or feature switches, but they all describe the same mechanism. At their simplest, feature flags are if/else blocks that check a configuration value to decide which code path runs:

const newCheckout = useFeatureIsOn(“new-checkout”)if (newCheckout) {  return <NewCheckoutFlow />;} else {  return <LegacyCheckout />;}

You deploy this code with the feature turned off. So, everyone sees the legacy checkout. When you’re ready, you flip the flag to deploy the feature and flip it back off again if something goes wrong or you’re done testing it.

How Do Feature Flags Work?

Feature flag systems have three main components that work together to give you complete control over your features:

  1. Flag Configuration
  2. Flag Delivery
  3. Flag Evaluation

1. Flag Configuration

Flag configuration is where you define your flags and the rules that govern them. That can be as simple as a config file or as sophisticated as a dedicated feature management platform with a user interface (UI), audit logs, and role-based access.

Here’s an example configuration for a new-checkout flag:

{
  "new-checkout": {
    "defaultValue": false,
    "rules": [
      {
        "condition": {
          "beta": true
        },
        "force": true
      }
    ]
  }
}


This config enables the new checkout only for beta testers, so you can test the new flow with a small set of users before rolling out the final version.

Note: If you’re wondering why you can’t set environment variables for this, it’s because those variables require redeployment to change. Flags don’t.

2. Flag Delivery

Once you’ve defined your flags, the configuration needs to reach your application. This can be through an included file at build time, API calls, streaming updates, or a mix of all three.

The method you choose decides how fast the changes propagate. That’s why platforms like GrowthBook deliver via server-sent events (SSE) to push changes immediately. The changes come through within milliseconds.

3. Flag Evaluation

The final component is where your application actually resolves a flag’s value. The SDK (or your custom code) takes the user’s attributes and evaluates them against the flag’s rules. Based on that, the SDK returns the appropriate value.

// Beta userconst gb = new GrowthBook({  attributes: { id: “user_123”, beta: true }});gb.isOn(“new-checkout”) // Returns: true// Regular user  const gb = new GrowthBook({  attributes: { id: “user_456”, beta: false }});gb.isOn(“new-checkout”) // Returns: false (uses defaultValue)

In this example, the beta user’s attributes match the rule. So, the flag evaluates to true, and the new checkout renders. If it doesn’t, the flag falls back to the default value of false.

Note: When a flag is disabled (not just set to false, but turned off entirely), most platforms—including GrowthBook—evaluate it as null rather than false. Here, the fallback value will be used if you’ve added one. The boolean one won’t give you an error, it’ll render as false.

What Are the Types of Feature Flags?

All feature flags don’t serve the same purpose. And they aren’t meant to live for the same period either.

We can categorize feature flags across three axes:

  1. Their time span
  2. Their purpose
  3. Their scope
  4. Their value type

Here are the types of feature flags:

Feature Flag Type Lifespan Primary purpose Value type Scope Retirement
Deployment 2–4 weeks Testing and debugging code Boolean System-level Yes, after testing code
Release 2–8 weeks Progressive rollout of new features Boolean User-level Yes, always
Experiment 2–6 weeks A/B/n testing and variation assignment String or JSON User-level Yes, after shipping the winner
Ops / Kill Switch Long-lived Emergency control, runtime config Boolean or number System or user-level No, adjust as needed
Permission Permanent Access control by tier, role or geography Boolean User-level No, part of business logic

By Time Span

There are two types of feature flags based on when or if you retire them:

  1. Short-lived flags: These toggles exist for days to weeks. You create them for a specific purpose—for instance, to ship a feature or run a test, and then remove them when you’re done. These flags are often the source of technical debt because people tend to forget to clean them up.
  2. Long-lived flags: These flags live in your codebase for months or permanently. They’re part of your app’s ongoing behavior. For example, you might need a kill switch to turn off a part of the app or an entitlement flag to control which users see certain features.

By Purpose

Feature flags can be classified into five types based on usage:

1. Release Flags

You can use release flags to control the rollout of new features during the release process.

if (gb.isOn("new-checkout-flow")) {
  return <NewCheckoutFlow />;
}


return <LegacyCheckout />;

A typical lifecycle looks like this:

  1. Create the flag
  2. Test with internal users
  3. Run a progressive rollout (5%, 25%, 50%, 100%)
  4. Confirm everything is stable
  5. Remove the flag and the old code path entirely

Most release flags should live for 2 to 8 weeks. If yours has been around longer, it’s time to clean up. Platforms like GrowthBook include stale feature flag detection to surface flags that haven't been evaluated recently, so you know which ones are overdue for cleanup.

2. Experiment Flags

You should use experiment flags to assign users to variations for A/B testing. The main difference is that consistent assignment matters because the same user should always see the same variation for both UX consistency and accurate measurement. Platforms like GrowthBook handle this automatically by hashing the user's ID, so assignment is stable without any extra work on your end.

For A/B testing, it’s also very important to ensure that users are randomly being assigned to both the control and variant groups.

const variant = gb.getFeatureValue("checkout-cta-experiment", "control");
// Returns "control", "variation_a", or "variation_b"

switch (variant) {
  case "variation_a":
    return <Button>Complete Purchase</Button>;
  case "variation_b":
    return <Button>Place Order</Button>;
  default: // control
    return <Button>Buy Now</Button>;
}

Typically, you’ll leave these on for as long as your experiments run—usually 2 to 6 weeks. But it depends on your traffic volume and the level of statistical power required. Once you’ve shipped the winning variant, remove the flag.

3. Operational Flags

Operational flags control system behavior and provide emergency shutoffs. Think about cases like circuit breakers, graceful degradation under load, and runtime configuration changes that don’t warrant a full deployment.

if (gb.isOn("enable-recommendation-engine")) {
  recommendations = await fetchRecommendations(userId);
}

const cacheTimeout = gb.getFeatureValue("redis-cache-ttl", 300);
const rateLimit = gb.getFeatureValue("api-rate-limit", 1000);

These flags are mostly permanent in nature. You can trigger them manually during incidents or automatically through monitoring or feature flagging systems. A kill switch is an excellent example of such a flag.

4. Permission or Entitlement Flags

Permission flags control feature access based on subscription tier, user role, geography, or account status. These depend on the business logic and aren’t necessarily used for development or testing purposes.

// Flag targeting evaluates user's plan attribute
if (gb.isOn("advanced-analytics")) {
  return <AdvancedAnalyticsDashboard />;
}

return <BasicAnalyticsDashboard />;

A permission flag evaluates the user’s attributes (like their plan tier) to determine access, but your application database remains the source of truth for those attributes. It doesn’t mean that the flag stores data, it just evaluates conditions. 

A warehouse-native platform like GrowthBook can evaluate these attributes directly from your existing data without requiring data duplication or schema changes. Your warehouse is already the source of truth for plan tiers and user roles so you don’t have to bring them into another platform again.

Tip: Always evaluate entitlement flags server-side using verified data. If you do this client-side, the user can inspect your flag configuration in the browser’s dev tools and potentially change it to bypass controls.

5. Development Flags

Development flags are usually used to turn a feature on or off to test and debug code. These are short-lived, and you should turn them off after completing the QA or testing process.

By Scope

Depending on the scope, you can categorize it into two types:

  1. System-level flags: These flags affect your entire application uniformly. A kill switch that disables a service for all users, or a config flag that changes your cache TTL globally. These don’t care who the user is—they’re binary for the whole system.
  2. User-level flags: These flags evaluate differently per user based on their attributes—for example, user ID, plan tier, geography, device type, or behavioral signals. User-level flags are used in targeting, rollout, and experimentation because these are based on user attributes. Let’s say you’re launching a 10% rollout, you’re hashing user IDs to consistently assign each person to a cohort.

By Value Type

Value types describe what a flag returns. Most flags start as simple booleans, but as your use cases mature, you'll reach for more advanced types, such as:

  1. Boolean flags: These flags return true or false. This is the default for most feature flags, such as on/off toggles and kill switches. If you’re wrapping a feature in a flag for the first time, this is where you start.
  2. String flags: These flags return a text value. Use these when you need to serve different variations of content like button text in an A/B test ("Buy Now" vs. "Add to Cart"), or a theme identifier ("dark" vs. "light").
  3. Number flags: These flags return a numeric value. They’re useful for tuning runtime parameters such as cache TTLs, rate limits, pagination sizes, and retry counts without redeploying.
  4. JSON flags: These flags return structured data. A single JSON flag can control an entire component's behavior. For instance, returning { "layout": "grid", "rows": 10, "showFilters": true } to configure a UI layout without deploying new code. They’re also useful for complex experiment variations where each variant needs multiple parameters, or for configuration bundles that you want to manage as a single unit.

Who Uses Feature Flags and For What Purpose?

Even though feature flags started out as a developer practice, they’re no longer limited to the engineering team. If you implement them, even non-technical users can work with them.

Here’s how that works:

Technical Teams

  1. Developers and engineers: Development teams implement feature flags in code, manage rollouts, and use kill switches during incidents. The goal is to deploy with confidence by making every release reversible. They’re also responsible for maintaining and cleaning up unused or old flags.
  2. QA and testing: These teams use flags to validate features in production with real data, traffic patterns, and third-party integrations. Since staging environments can never fully replicate those conditions, feature flags allow them to get a sense of what will actually happen when the feature is live.
  3. DevOps and site reliability engineering (SRE): These teams rely on operational flags for circuit breakers, infrastructure migrations, and system configuration changes. For instance, if a service degrades, they can disable non-critical features to preserve core functionality.
  4. Data analysts: Analysts use flags to launch experiments, create targeting rules (who will be part of the experiment), and then randomly assign users to a variation. When a feature is launched as an experiment, analysts get clean and randomized data in their warehouse. For example, assignment records alongside behavioral events without having to add experimentation individually."
  5. Security and compliance: These teams audit flag changes to maintain a record of who released what, to whom, and when. Features like approval workflows and audit logs matter the most so that they can access that information. Also, if new regulatory requirements take effect, they can disable non-compliant features immediately.

Business and GTM Teams

  1. Product managers: Product teams use feature flags to control release timing and manage beta programs. Feature flags give product managers autonomy to ship code when the business is ready, not just when the code is. They also help data science teams with experimentation—for example, when they need to test how features perform with different audiences.
  2. Marketing: Typically, product marketing teams use flags to time feature releases to campaigns or run promotional experiments. Personalization is another key use case where they offer curated experiences based on audience (user attributes).

Note: While feature flags help non-technical users time feature releases, it doesn’t mean every feature flagging platform is intuitive enough to use. Consider using a platform like GrowthBook that lets non-technical team members create and manage feature flags without writing code or filing engineering tickets.

What Are the Benefits of Using Feature Flags?

Most product and engineering teams adopt feature flags to address a specific problem. Usually, it’s a painful deployment that went sideways. But there are several benefits of using feature flags, including:

Decouple Deployment from Release

Feature flags break the assumption that deploying code means releasing a feature. Your main branch can contain unreleased features safely wrapped in flags. And engineers can merge continuously without worrying about exposing incomplete work.

In short, your engineering team can deploy 10 times a day while releasing features weekly or whatever cadence the business needs.This is particularly beneficial when different teams are contributing to a feature.  For example, if the back-end team delivers new functionality ahead of the front-end team, they check that code in behind a feature flag instead of keeping it in a branch.

Enable Instant Rollbacks

When things go wrong (and they will, usually at the worst possible time), you can disable a feature immediately. You don’t have to deploy new code or revert commits. As a result, you also recover from incidents much faster.  Without a feature flag, engineering teams are often forced to create a new build that removes buggy code while keeping stable features that were included in the previous build.  This can be a painful, time-consuming process, especially if the bug is hurting the live customer experience.

In fact, the State of DevOps 2024 report found that only 19% of engineering organizations recover from failed deployments in less than an hour. These “elite” teams tend to focus on continuous delivery practices, which are usually enabled by feature flags.

image.png

Reduce Risk with Progressive Rollouts

Instead of releasing to everyone at once, you can start small. First, roll out to 5% of users and monitor key metrics like error rates and performance. If everything looks good, gradually increase to 25%, then 50%, then 100%. If issues arise, the blast radius is limited to a small subset of users.

Test in Production Safely

Staging environments never perfectly mirror production. They lack real user behavior, real data volumes, and real traffic patterns, and you can’t make decisions if you’re testing in it.

For instance, if you’re testing a new payment processor integration, the staging environment can’t replicate the complexity of real payment flows or peak traffic loads. But if you use feature flags, you can test it with real user transactions, which gives you concrete data on what’s working (and not).

Increase Team Velocity

When you start using feature flags, velocity is a second-order benefit that you’ll experience eventually. Nobody’s waiting on shared release windows anymore, so they deploy code when it makes sense for them. So, teams ship faster and with more confidence in the long run.

In fact, according to research from DORA, higher deployment frequency correlates with higher software quality and stability. And it all comes down to feature flags that enable continuous delivery.

Enable Trunk-Based Development

Long-lived feature branches are a tax on your engineering team. They diverge from the main branch and accumulate merge conflicts over time, which causes more issues the longer they live.

That’s why sophisticated engineering teams have started adopting trunk-based development. 

In this method, they merge incomplete code to main behind a flag where it’s deployed but never executed. So you get the benefit of continuous integration without the risk of shipping unfinished features to users.

Build a Foundation for Experimentation

Once you can control who sees what, the next question is: which version is actually better?

Feature flags give you the ability to test that. While they act as the delivery mechanism, experiments act as the measurement layer.

const variant = gb.getFeatureValue(“checkout-cta-text”, “Buy Now”);
// Returns: “Buy Now”, “Purchase”, or “Add to Cart”
// Assignment is stable per user for valid experiment results

Together, they move your team from “We shipped it and hope it works” to “We shipped it, measured it, and know it works.”

What Are the Use Cases of Feature Flags?

Here are the most common use cases of feature flags for product and engineering teams:

Release Management

You wrap a new feature in a flag, deploy it to production in an off-state, and progressively roll it out. You can use it for:

  • Internal dogfooding with your team
  • Beta access for a select group of users
  • 5% canary release to catch issues early
  • Gradual ramp to 25%, 50%, 100%

At each stage, you monitor metrics and can halt or roll back if problems appear. This transforms launches from high-stakes events into controlled, iterative processes.

Kill Switches and Operational Control

Sometimes the most important thing a feature flag does is turn something off. It’s usually used when incidents happen, and you need to respond quickly. This drastically reduces your mean time to recovery (MTTR).

Infrastructure Migrations

Big-bang deployments are becoming a thing of the past. You don’t need a whole ceremony to move from one database to another.

Let’s say you’re migrating from PostgreSQL to CockroachDB. All you have to do is route 1% of read queries to the new database and monitor its performance. If everything looks good, ramp up to 10% and so on and so forth until it’s complete.

A/B Testing and Experimentation

Feature flags are the natural foundation for experimentation. Once you can consistently assign users to different feature variations, you can measure which version performs better with statistical rigor.

This is becoming especially relevant for teams building AI and GenAI features. When your recommendation engine uses a large language model (LLM) or your search results rely on an embedding model, you can’t just eyeball whether the new version is better.

You need controlled experiments with guardrail metrics and feature flags that provide the infrastructure to run them in production safely.

Personalization and Targeting

Feature flags let you deliver different experiences based on user attributes, geography, device type, or behavioral signals. You don’t need to maintain separate codebases for each attribute because the targeting rules handle the variation.

Target users based on specific attributes in GrowthBook

Entitlements and Access Control

If you run a multi-tier SaaS product, feature flags can manage which plans can access certain features. For example, you can automatically offer a premium integration for Enterprise users when they upgrade.

Also, if you need control over your data, use a feature flag platform that’s self-hosted or air-gapped. So, your flag evaluation data never leaves your network and ensures you’re compliant with regulations like HIPAA, GDPR, and SOC 2.

Refactoring Code

Feature flags reduce the risk of large-scale refactors by letting you run old and new implementations side by side. Route 5% of traffic to the refactored code path, compare outputs and performance against the original, and gradually shift over once you’re confident.

This is especially useful during monolith-to-microservices migrations, where you can flag-control which service handles each request and roll back individual routes without reverting the entire migration.

Compliance and Regulatory Control

Regulatory requirements change, sometimes quickly. Feature flags let you respond without waiting for a development cycle.

When a new data protection rule takes effect, you can disable a non-compliant feature across affected jurisdictions immediately. When your compliance team needs a four-eyes approval process for production changes, approval workflows on flag modifications implement that principle directly.

What Advanced Feature Flagging Strategies Can You Use?

Once you’re comfortable with basic on/off flags, you’ll quickly run into situations where a simple toggle isn’t enough. You need to roll out to a specific percentage of users. Or target enterprise accounts in a particular region. 

These strategies build on each other. Here are a few examples:

Percentage Rollouts with Persistent Assignment

Percentage rollouts let you gradually release a feature to a random sample of users—5%, then 25%, then 50%—while monitoring for issues at each stage. The critical detail is persistent assignment

When a user lands in the 10% cohort, they need to stay there as you ramp up to 50% and eventually 100%. Most platforms handle this by hashing the user’s ID against the flag key, which produces a consistent, deterministic assignment without storing state.

Release features using percentage rollouts in GrowthBook

Use percentage rollouts when you’re releasing a new feature and want to limit your blast radius. If something breaks at 5%, you’ve affected 5% of users.

Force Rules and Complex Targeting

Force rules let you target specific user segments based on combinations of attributes. For example, geography, device type, account age, company name, subscription tier, or any custom property you pass to your SDK.

Target specific user segments using Force Rules in GrowthBook

For example, you might want to enable a feature for enterprise accounts in Australia with an account age of greater than three months.  Or certain tax rules might only apply in a few countries.

Safe Rollouts with Guardrail Metrics

A safe rollout combines a percentage rollout with automatic metric monitoring. You define guardrail metrics like page load time, click rate, conversion rate, error rate, revenue per user, or whatever matters for this feature. And the system watches them as you ramp up.

If guardrails breach your thresholds, the rollout automatically reverses. The feature goes back to 0% while you investigate.

Monitor metrics in real time when using percentage rollouts in GrowthBook

Multi-Environment Flag Management

Your new checkout feature might need to be:

  • Always on in development (so your team can build against it)
  • 50% rollout in staging (to test the progressive rollout logic itself)
  • Off in production (not ready for customers yet)

This is where the relationship between projects, environments, and SDK connections matters. In GrowthBook, projects are the top-level organizational units (e.g., your mobile app vs. your web app). 

Within each project, you have environments (production, staging, development). Each flag can have different values and rules per environment, and the SDK connection determines which flags your application actually receives.

When Should a Company Adopt Feature Flags?

The short answer is earlier than you think. Most teams wait until they’ve been burned by a botched deployment or a release that broke critical functionality. By that time, everything you do is reactive in nature—and you’re retrofitting them into a codebase that’s already complex.

If you’re seeing these signals, it’s definitely time to adopt feature flags:

  • Every deployment feels high-stakes: Everyone’s on Slack watching dashboards, ready to hit rollback. If deploying makes your team nervous, you have a release process problem that flags can solve.
  • Rollbacks take hours to complete: If your recovery time is measured in hours, a single toggle would have saved you time.
  • Multiple teams are blocked on release windows: “We can’t ship until Backend deploys” shouldn’t be a weekly conversation. Using feature flags helps you decouple these dependencies.
  • You can’t test with real production traffic: If you don’t have a way to expose features to real users in a controlled way before launching them, you’re guessing.
  • Product decisions are based on opinions: You want to run A/B tests but lack the infrastructure to do so. In these cases, feature flags act as the delivery mechanism to make experimentation possible.
  • You’re growing the engineering team: As your team grows, so does its deployment complexity. It’s easier to coordinate releases with two engineers, but when you add more to the mix, the room for errors increases.
  • You deploy more than once a week (or want to): High-frequency deployment without feature flags is high-frequency risk. Flags make it safe.

If you checked 2+ of the above criteria, feature flags will immediately improve your workflow.

When Should You Not Use Feature Flags?

Knowing when not to use a flag is just as important as knowing when to use it. Here are a few reasons why you shouldn’t:

  • Don’t use flags for static configuration: If changing the value requires a full restart, it belongs in your config, not your flag system. Feature flags are for runtime decisions, so mixing the two adds unnecessary complexity.
  • Don’t use flags for secrets or sensitive data: You should never pass personally identifiable data (PII), API keys, or tokens through your feature flag system. This is especially critical for client-side applications because your configurations and targeting rules can be sent to the user’s browser, where anyone can inspect them. If you need to target based on sensitive attributes like email addresses, evaluate the flag server-side using verified data, or use hashed and anonymized attributes for client-side evaluation.
  • Don’t use flags for core business logic: If your subscription tier logic or pricing rules permanently live inside a feature flag, your core business functionality now depends on the availability of an external flag service. Once an experiment or rollout is complete, migrate the winning variant into your application code or a dedicated entitlement service.
  • Be cautious if your app traffic is low: Feature flags for simple on/off releases work at any scale. But if you’re planning to run A/B tests and your app has 100 users a month, you won’t reach statistical significance in any reasonable timeframe. The flag infrastructure still has value for release management and kill switches—just don’t expect experimentation to pay off until your traffic can support it.
  • Don’t adopt flags without clear processes: Unless and until you have the right processes—for example, naming conventions, ownership docs, governance controls, and cleanup processes in place, don’t use flags. Otherwise, you’ll end up with too much technical debt in the long run.

What Are The Best Practices for Using Feature Flags?

To avoid spending months cleaning up avoidable issues, follow these best practices:

Use Clear, Descriptive Names

Six months from now, nobody will remember what ff-123 or test-flag means. So, choose a clear naming convention and stick to it. For example, {feature-name}-{type} works well (checkout-redesign-release, cta-color-experiment).

// ❌ Unclear - what does this control?
gb.isOn("ff-123")
gb.isOn("test")
gb.isOn("experiment_2")
gb.isOn("new-thing")

// ✅ Self-documenting
gb.isOn("new-checkout-flow")
gb.isOn("holiday-2024-promo-banner")
gb.isOn("pricing-page-v2-experiment")
gb.isOn("premium-analytics-entitlement")

Note: Platforms like GrowthBook let you enforce naming patterns with regex validation to prevent duplication and enforce governance.

Clean Up Old Flags Ruthlessly

Every flag in your codebase adds a conditional branch. For instance, 10 flags create 1,024 possible code paths, but 20 flags create over a million. These create blind spots, so after you roll out a feature, do the following:

  1. Remove the flag check from your code
  2. Remove the old code path entirely
  3. Delete the flag from your management platform
  4. Document why it was removed

Turn it into a team ritual and also implement monthly or quarterly cleanup rituals to reduce technical debt. If you’re using a platform like GrowthBook, it’ll automatically detect stale flags and show you where these flags live in your codebase.

Remove old flags with automatic staleness detection in GrowthBook

Set Expiration Dates on Temporary Flags

It’s easy for seemingly temporary flags to become permanent. If there’s no clear deadline to clean it up, it’ll continue to sit in your codebase unnoticed—while its cleanup gets deprioritized sprint after sprint.

That’s why we recommend setting a calendar reminder for 30 or 60 days whenever you create a new flag. Better yet, create a Jira ticket or GitHub issue linked to the flag due two weeks after the target completion date.

Note: GrowthBook also supports flag scheduling, so you can set flags to automatically enable or disable at a specific date and time. This is useful for both feature launches and scheduled cleanup. If you prefer creating a Jira ticket, our Jira integration lets you link flags directly to tickets, so you can track these cleanup tasks within your existing workflow.

Start Small With Your Rollouts

It’s easy to skip steps when you’re confident about a feature. Resist the urge and default to a progressive delivery method.

Start with your internal team, then 1% of the traffic, then 10%, and so on. Continuously monitor changes or unusual behavior at each step—and only remove the flag when you confirm stability.

Monitor Business Metrics Too

A feature can do everything right. It can have zero errors or sub-100ms response times, but it can still tank your conversion rate.

When you set up monitoring for a rollout, watch both layers:

  • Technical guardrails: Error rate, response time (p95 and p99), resource usage, API failures
  • Business guardrails: Conversion rate, revenue per user, support ticket volume

If a new feature is technically flawless but users keep raising tickets right after launch, something’s wrong. You’ll have to look under the hood to understand what happened.

Document Flag Purpose and Ownership

You don’t want to be rummaging through hundreds of Slack threads or Jira tickets to find out what a flag does. At a minimum, every flag should have:

  • What it controls (one sentence)
  • Who owns it (team or individual)
  • Expected cleanup date
  • What metrics indicate a problem
  • Rollback procedure (usually “set to 0%” or “disable“)

Template:

Flag: new-checkout-flow

Purpose: Progressive rollout of redesigned checkout experience

Owner: @growth-team (Primary: @jane)

Created: 2026-01-15

Expected cleanup: 2026-03-01

Rollback procedure: Set to 0% immediately if conversion drops >5%

Success metrics:

  • Checkout completion rate improves by 3%+
  • P95 checkout latency stays under 2s
  • Support tickets don’t increase

Current status: 25% rollout, monitoring for 1 week before increasing

Use Role-Based Access Control

Role-based access control (RBAC) allows you to control which user can access specific flags. Use RBAC to define roles that map to your risk model, including who can:

  • Create flags
  • Modify targeting rules
  • Approve changes to production
  • Publish

When you combine RBAC with four-eyes approval workflows and audit logs, you’ll have everything you need to remain compliant.

Understand How Feature Flags Affect Performance

Feature flags add an evaluation step to every request, so you need to know where that evaluation happens and what it costs. Most modern SDKs run flag evaluations locally, including GrowthBook.

On client-side implementations, the SDK initializes asynchronously, which means users may briefly see the default experience before flags are evaluated: a “flicker.”   You can mitigate this through server-side rendering and anti-flicker support.

Similarly, if you have hundreds of flags with complex targeting rules, it can bloat the initial SDK payload. Within GrowthBook, you can use project-scoping so each SDK connection receives only the relevant flags, and use Saved Groups to reference large ID lists rather than inlining them.

What Mistakes To Avoid While Using Feature Flags?

Here are the most common ways teams shoot themselves in the foot (and how to avoid it):

Reusing Flag Names

In 2012, Knight Capital dealt with a software glitch that bankrupted the company. When an engineer reused the name of a deprecated feature flag to launch a new feature, the app ran trades based on an old functionality. This happened because the old flag’s code was still present in an unpatched server and this mistake eventually cost the company $440 million, leading to its closure within a week.

It was one of the biggest coding errors we’ve ever seen. That’s why we recommend creating new flags for every feature you roll out. It takes 30 seconds, and you avoid the risk of activating code paths you or your team has forgotten about.

Using Client-Side Flags for Security

Feature flags control what to show. They don’t control who has permission. This distinction matters for client-side applications where flag values are visible in browser dev tools.

// ❌ WRONG: Anyone can enable this in browser dev tools
if (gb.isOn("admin-panel")) {
  showAdminPanel();
}

// ✅ RIGHT: Verify permissions server-side
const isAdmin = await checkAdminPermissions(user.id);
if (isAdmin) {
  showAdminPanel();
}

If you’re doing anything involving money, data access, permission, or privileged APIs, you’re better off using server-side flags to do it.

Ignoring Rollback Procedures

Typically, rollbacks seem simple. You flip the flag back to off, and the problem is solved. But sometimes it’s not that simple. In 2020, Slack experienced an outage because a feature flag rollout triggered a performance bug. Even though the team rolled back the feature in 3 minutes, it left a stale HAProxy state that led to a six-hour outage.

Before rolling back a flag, you should know:

  • What metrics indicate a problem
  • Who has permission to roll back
  • What the downstream effects of rollback might be
  • Whether the rollback itself has been tested

Not Testing Both Flag States

Your Continuous Integration and Continuous Delivery (CI/CD) pipeline probably tests your application with your current production flag configuration. But does it test with the new flag turned on? Does it test with the new flag turned off again (the rollback scenario)?

If you only test one state, you’re assuming the other works. So, test three configurations:

  • Current production state
  • Intended release state
  • Rollback state

If you can’t test all three in Continuous Integration (CI), at least smoke test the rollback in staging before you push the flag live.

How to Use Feature Flags for Experimentation

Most teams start with feature flags for release safety. But once you can control who sees what, a natural question follows: which version is actually better?

Without experimentation, you’re essentially shipping features based on intuition. For instance, you might think a signup form could be cleaner with fewer fields. But only a real test can tell you if there’s an uptick or fall in conversions. Feature flags give you the ability to run these tests easily.

How It Works

In GrowthBook, an experiment is a rule you add to an existing feature flag. You don’t need to migrate SDKs or add new code.

const variant = gb.getFeatureValue(“checkout-optimization”, “control”);switch (variant) {  case “control”:    return <StandardCheckout />;  case “streamlined”:    return <StreamlinedCheckout />;  case “express”:    return <ExpressCheckout />;}

Users are randomly assigned to a variation and their assignment is stable—they always see the same version. GrowthBook tracks which variation each user saw, then joins that with your existing analytics events (purchases, signups, clicks) to calculate which version performed best.

Run experiments using feature flags as the mechanism

The progression from flag to experiment typically looks like this:

  1. Simple toggle: Feature is on or off for everyone
  2. Percentage rollout: Feature reaches a growing slice of users
  3. Safe rollout: Percentage rollout with guardrail monitoring and auto-rollback
  4. Full A/B test: Controlled experiment with statistical analysis and winner selection

By the time you reach step 4, you already know the feature doesn’t break anything. Now you’re asking a different question: does it actually improve anything?

GrowthBook’s Warehouse-Native Approach

Most experimentation platforms require you to export data to their system, send tracking events to their infrastructure, or download results and crunch them in spreadsheets. All of these create data silos and increase costs.

That’s why GrowthBook connects directly to your existing data warehouse. You can integrate with platforms like Snowflake, BigQuery, Redshift, or Databricks and run the analysis there. Your data never leaves your infrastructure, which simplifies SOC 2, GDPR, and HIPAA compliance significantly. 

And because it has access to your full warehouse, you can segment experiment results by any dimension you already track. For example, LTV cohort, acquisition channel, device type.

Should You Build or Buy a Feature Flagging Tool?

The answer depends on your organization’s size and needs. Here’s an easy framework to help you decide:

CapabilityBuild Your OwnUse a Platform
Setup timeHoursMinutes
Basic on/off flags✅ Easy✅ Easy
Percentage rollouts⚠️ Custom code✅ Built-in
User targeting⚠️ Custom code✅ Built-in
A/B testing❌ Requires analytics integration✅ Built-in
Non-engineer access❌ Not without building UI✅ Web dashboard
Audit logs❌ Custom implementation✅ Built-in
Multi-environment⚠️ Manual management✅ Built-in
SDKs for multiple languages❌ You build them✅ Provided
Ongoing maintenance⏰ Significant⏰ Minimal
Monthly cost$0 (but engineering time)$50-500+ (depends on scale)
Time to advanced featuresMonths of developmentImmediate

Legend: ✅ Fully supported | ⚠️ Possible with effort | ❌ Not practical

When To Build Your Own Feature Flag Tool

Building makes sense when your needs are genuinely simple:

  • You need fewer than 10–20 simple on/off flags.
  • You have strict compliance requirements preventing any third-party services.
  • You have dedicated engineering time for ongoing maintenance.
  • You only need basic on/off functionality without targeting or experimentation.
  • You want full control and have the resources to maintain it.

A config file or a database table can work fine at this scale. But in our experience, before you know it, you’ll be building dashboards and complex functionality just to maintain the flags.

When To Use a Feature Flagging Platform

A platform pays for itself quickly once any of these apply:

  • You need targeting beyond simple on/off (user segments, percentages, complex conditions).
  • You want experimentation and A/B testing capabilities.
  • You’d rather spend engineering time on your product than on internal infrastructure.
  • Non-engineers (PMs, marketing, data analysts) need to manage flags.
  • You require audit logs, role-based access, or compliance features.
  • You want debugging tools, flag lifecycle management, or third-party integrations.
  • You plan to scale flag usage across multiple teams and services.

Note: If you need to run experiments, it’s always better to go with a feature flagging platform. It’ll give you full control over what’s being tested, and you can be sure of its statistical rigor. For instance, GrowthBook includes a suite of developer tools for testing and debugging feature flags. The DevTools Chrome Extension lets you inspect flag evaluations and simulate different user attributes directly in your browser.

How To Create Feature Flags in GrowthBook

GrowthBook supports 24+ languages and frameworks. But here’s how to implement your first feature flag in under 10 minutes using React:

1. Get Your SDK Client Key

Go to SDK Configuration in GrowthBook, create a new SDK Connection, and copy the Client Key (it starts with sdk-).

2. Install the SDK

npm install @growthbook/growthbook-react

3. Wrap Your App With GrowthBook Provider

import { GrowthBook, GrowthBookProvider } from “@growthbook/growthbook-react”;import { thirdPartyTrackingPlugin, autoAttributesPlugin } from “@growthbook/growthbook/plugins”;// Create a GrowthBook instanceconst gb = new GrowthBook({  clientKey: “sdk-abc123”, // Your SDK client key  enableDevMode: true, // Shows helpful debug info in development  plugins: [    thirdPartyTrackingPlugin(), // Optional, sends “Experiment Viewed” events via GrowthBook Managed Warehouse, Google Analytics, Google Tag Manager, and Segment.    autoAttributesPlugin(), // Optional, sets common attributes (browser, session_id, etc.)  ],});// Load feature definitions from the GrowthBook APIgb.init();export default function App() {  return (    <GrowthBookProvider growthbook={gb}>      <MyApp />    </GrowthBookProvider>  );}

4. Create a Flag in GrowthBook

In GrowthBook’s dashboard:

  1. Navigate to FeaturesAdd Feature
  2. Set a unique feature key: new-onboarding
  3. Choose value type: boolean
  4. Default value is false (off by default)

Your flag is now live.

5. Use the Flag in Your Code

import { useFeatureIsOn } from "@growthbook/growthbook-react";

function OnboardingFlow() {
  const showNewOnboarding = useFeatureIsOn("new-onboarding");

  if (showNewOnboarding) {
    return <NewOnboardingFlow />;
  }

  return <LegacyOnboardingFlow />;
}

That’s it. The flag defaults to false, so everyone sees the classic onboarding. Toggle it to true in the dashboard, and the new version appears instantly. Toggle it back, and you’ve rolled back in seconds.

From here, you can add targeting rules, percentage rollouts, safe rollouts with guardrail metrics, or full A/B experiments within the same dashboard, without changing your code.

Ready to start? Try GrowthBook Cloud free, or check out the documentation for integration guides across all 24+ SDKs. For self-hosting, the GitHub repo has everything you need.

Frequently Asked Questions

1. What is the difference between feature flags and feature management?

Feature flags are the technical mechanism—the if/else statements in your code that check configuration values. Feature management is the broader practice of using flags strategically across the software lifecycle, including targeting rules, progressive rollouts, experimentation, governance, and lifecycle management.

2. What is the difference between feature flags and feature toggles?

“Feature flags” and “feature toggles” are synonyms for the same concept. You’ll also see “feature switches,” “feature flippers,” and “feature gates.”

3. What is the difference between feature flags and experiments?

Feature flags control who sees what. Experiments measure which version performs better. So, flags act as the delivery mechanism to run your experiments, and experiments give you the measurement layer to see the results.

4. What is the difference between feature flags and branches?

Git branches manage code versions during development, while feature flags manage feature visibility in production. With branches alone, you can’t deploy a feature until the branch merges and deploys. But with feature flags, the code merges to main immediately, but the flag keeps it hidden until you’re ready to release.

5. What is feature testing?

Feature testing means validating that a feature works correctly before releasing it broadly. With feature flags, you can enable a feature only for QA accounts or internal users and test it in production with real data and traffic patterns.

6. How do feature flags help with continuous delivery?

Feature flags separate deployment from release, so you can merge code continuously and deploy multiple times a day with new features safely wrapped in flags. Without them, you can’t deploy continuously because the feature itself might be incomplete or unvalidated.

7. What is progressive delivery, and how do feature flags enable it?

Progressive delivery is the practice of gradually releasing features to larger user segments while monitoring for issues at each stage. Instead of a binary release (off for everyone, then on for everyone), you incrementally increase exposure. For example, releasing it to the internal team first, then 5% of real users, until you reach 100% of users.

8. How do feature flags differ from configuration files?

Configuration files are static. If you have to change them, you’ll have to redeploy the code or restart the whole service. But feature flags evaluate at runtime. You have to flip a switch to ensure your changes propagate to your application within seconds via streaming updates.

9. How can you deploy and manage feature flags at scale?

To deploy flags at scale, you need the following features and capabilities:

  • Centralized feature flag management across all services
  • SDKs for every language in your stack
  • Streaming updates where changes propagate instantly
  • Governance controls like audit logs, RBAC, and approval workflows
  • Lifecycle management, such as stale flag detection, ownership tracking, and enforcement of cleanup cadence

10. What are client-side feature flags?

Client-side flags evaluate in the browser or mobile app rather than on your server. They’re useful for UI experiments, frontend rollouts, and A/B tests on visual elements. They’re usually visible to users, so don’t use them for PII, sensitive data, or access control.

11. What are the benefits of an open source feature flag platform?

Here’s why open source platforms are the better choice today:

  • You can inspect exactly how flags are evaluated, how experiments are analyzed, and how your data is processed. Plus, you can audit security practices before deploying to production.
  • You can fork the codebase if the project changes direction. You can also self-host indefinitely without an ongoing vendor relationship.
  • You can deploy the app within your own infrastructure. This is critical for regulations like HIPAA, FedRAMP, SOC 2, and GDPR.
  • With open source platforms, you pay for the infrastructure you choose, so you don’t rely on the vendor’s infrastructure pricing.

You can take advantage of OpenFeature, a Cloud Native Computing Foundation (CNCF) incubating project that creates a vendor-agnostic API standard for feature flagging.

Your React Feature Flags Are Probably Broken (Here's How to Fix Them with TypeScript)
Feature Flags

Your React Feature Flags Are Probably Broken (Here's How to Fix Them with TypeScript)

Feb 23, 2026
x
min read

Your checkout component renders perfectly. The layout looks right, the tax rate loads, the payment methods appear. Everything passes your visual check — and then you deploy, and users get the experimental beta layout when they shouldn't.

The bug? A feature flag with the value "false" — a string, not a boolean. In JavaScript, a non-empty string is truthy. So the flag that was supposed to disable the experimental UI was quietly enabling it for every single user. TypeScript had no idea, because it had no idea your feature flags existed at all.

This is the quiet danger of untyped feature flags. They fail silently, they're hard to reproduce in tests, and they tend to surface at the worst possible moment. Here's how to close that gap with generated TypeScript types — and a few additional best practices that make your flags easier to maintain as your codebase grows.

Why TypeScript Doesn't Save You (By Default)

Most React developers working with feature flags write something like this:

const { isExperimental, taxRate, headline, paymentMethods } = useGrowthBook();

This looks reasonable. But TypeScript has no way of knowing:

  • Whether isExperimental is a real flag name in GrowthBook
  • Whether it should be a boolean, a string, or a number
  • Whether your fallback value type matches the default defined in your dashboard

Without type definitions, the SDK treats everything as any. You can pass a string where a boolean belongs, misspell a flag name, or reference a flag that's been deleted — and your code compiles without complaint. The result is a whole category of bugs that are genuinely hard to catch: everything looks fine at the TypeScript layer, but the runtime behavior is wrong.

The Fix: Generated Type Definitions for Your Feature Flags

The solution is to give the GrowthBook React SDK a TypeScript interface that describes all your flags — their names and their value types — so the compiler can enforce correctness for you.

GrowthBook provides a CLI tool to generate these types. But if you're using Cursor or another AI-assisted editor with MCP support, there's an even faster path: you can generate the types directly through GrowthBook's MCP server without leaving your editor.

Either way, the result is a file called app-features.ts that contains a complete TypeScript interface for every flag in your GrowthBook account:

export interface AppFeatures {  checkout_experimental_layout: boolean;  headline: string;  shipping_tax: number;  payment_methods: string[];}

Every flag. Every type. Automatically generated from your actual GrowthBook configuration — not hand-written and left to drift.

Using the Flag Types in Your Component

Once you have app-features.ts, import it and pass it to the useGrowthBook hook as a generic type parameter:

import { AppFeatures } from './appfeatures';import { useGrowthBook } from '@growthbook/growthbook-react'const gb = useGrowthBook<AppFeatures>();

That one change unlocks the full power of TypeScript's type checker against your feature flags. The moment you do this, errors that were previously invisible become immediately visible — right in your editor, before you run anything.

The Errors You'll Actually See

When we applied types to a real checkout component with four flags, TypeScript surfaced several problems immediately:

Wrong flag name. The component was using isExperimental as a flag key. The actual flag in GrowthBook is checkout.experimental_layout. Without types, this compiled and ran fine — it just returned the fallback value every time, silently. With types, it's a compiler error on the spot.

Wrong value type. The fallback for checkout.experimental_layout was "false" — a string. The actual flag type is boolean. This is the bug from the opening: because "false" is a truthy string, the experimental layout was enabled for every user. TypeScript catches this the moment you add the type definition.

Mismatched default values. The component assumed payment_methods defaulted to just credit card. The actual default in GrowthBook includes Bitcoin. With the MCP server, you can verify that your fallback values match your GrowthBook defaults directly in the editor — and even have the agent update the code for you.

These aren't hypothetical bugs. They're the kind of thing that gets deployed on a Friday.

Keeping Flag Types in Sync

Generating types once is useful. Keeping them current is what makes this a real system.

When you generate types via the GrowthBook CLI or MCP server, it also adds a script to your package.json:

"scripts": {  "generate-flag-types": "growthbook generate-types"}

Run this any time you add, remove, or change a flag in GrowthBook. It takes seconds and ensures your TypeScript definitions never drift from your actual configuration. A good practice: add it to your CI pipeline, or at minimum to your pre-release checklist. Stale type definitions are better than none, but fresh ones are what give you the full safety guarantee.

Three More Feature-Flag Best Practices Worth Adding

Type safety solves the hardest category of feature flag bugs, but there are a few additional practices that will save you headaches as your flag usage grows.

Handle Loading States Explicitly

When your app initializes, GrowthBook fetches flag values from the server. During this brief window, the SDK relies on local fallback values. If not handled explicitly, this can result in a "flash of unstyled content" (FOUC) where users see the wrong UI state for a split second.

To solve this, the GrowthBook React SDK provides the <FeaturesReady> helper component. It allows you to render a loading state until your features are fully loaded:

<FeaturesReady timeout={500} fallback={<LoadingSpinner/>}>  <ComponentThatUsesFeatures/></FeaturesReady>

Don't skip this. While loading is often near-instant in development, the "flash" becomes painfully obvious for production users on slower connections

Use Descriptive, Consistent Flag Names

Flag names like ff-123 or new-ui become unmaintainable fast. When you have 50 flags, you need to know at a glance what each one controls, which team owns it, and whether it's still active.

A naming convention that works well: {scope}-{description}-{date}

  • Example: checkout-experimental-layout-2025-03, pricing-annual-discount-enabled-2026-01, onboarding-video-modal-shown-2026-02.

It's more characters, but it's searchable, scannable, and self-documenting.

Paired with TypeScript autocomplete (which you now have), good naming means you can find the right flag in seconds rather than hunting through a dashboard.

Know What Client-Side Flags Can and Can't Do

Feature flags evaluated in the browser are visible to users — anyone with DevTools can inspect the flag values your app receives. This is fine for UI experiments and gradual rollouts, but it means you should never use client-side feature flags to gate access to sensitive features or enforce permissions.

For anything security-sensitive — premium features, admin capabilities, access control — validate on the server. Client-side flags are for experience control, not authorization.

Client Side Feature Flagging

Understand the advantages and pitfalls of client-side flagging and how to avoid many of the issues.

GrowthBook BlogGraham McNicoll

Getting Started

If you're using GrowthBook with React, here's the short path to type-safe flags:

  1. Generate your types using the GrowthBook CLI (npx growthbook features generate-types) or through the MCP server in Cursor
  2. Import and apply the types to your useGrowthBook hook
  3. Fix the errors TypeScript surfaces — treat each one as a bug caught before production
  4. Add loading state handling so users don't see flashes of the wrong UI
  5. Standardize your flag naming convention before your flag count grows
  6. Add the generation script to package.json and run it whenever your flags change

The type setup takes under 10 minutes. The bugs it prevents can take hours to diagnose after the fact — and the ones that reach users can take down conversions quietly for days before anyone notices.

GrowthBook has full documentation on TypeScript type generation for React and every other supported SDK. If you run into questions, the GrowthBook Slack community is active and helpful for anything experimentation-related.

GrowthBook vs LaunchDarkly: Why Developers Choose GrowthBook for Feature Flagging
Feature Flags

GrowthBook vs LaunchDarkly: Why Developers Choose GrowthBook for Feature Flagging

Feb 22, 2026
x
min read

A feature flag platform ends up in critical paths: request handlers, render paths, mobile startup flows, and incident response. Most development teams still evaluate tools by feature checklists and pricing pages. That misses what tends to matter after adoption: runtime behavior, failure modes, testability, and whether measurement becomes a second system of record.

This article compares GrowthBook and LaunchDarkly across three architectural planes that tend to matter more than feature checklists:

  1. Runtime plane: How flag definitions propagate, how targeting decisions are evaluated, update models, hot-path dependencies, and outage behavior.
  2. Measurement plane: How rollout exposure connects to outcomes, and whether measurement becomes a second system of record.
  3. Control plane: Governance, approvals, environments, enterprise integrations, and deployment model.

Each section focuses on real-world behavior and operational tradeoffs, not feature checklists. Both platforms cover the control-plane basics and provide rollout safety features, but they optimize for different priorities.

LaunchDarkly tends to win when you want observability-connected safety automation and enterprise workflow/compliance plumbing (ServiceNow, Terraform, broader certification programs).

GrowthBook tends to win when you want deterministic local evaluation, SQL-native impact measurement aligned with your database or warehouse, self-hosting options, and seat-based pricing predictability.

Pick based on whether you prioritize managed safety and workflow automation or runtime predictability and measurement alignment with your existing data systems.

The three planes of feature flagging

Every feature flag platform operates across three planes:

Diagram 1: The three planes of feature flagging

Control-plane features are easy to compare; runtime and measurement are where long-term debt accumulates.

GrowthBook vs LaunchDarkly: Runtime, Measurement, and Control Planes Compared

PlaneWhat mattersGrowthBook (typical)LaunchDarkly (typical)RuntimeHot-path dependency, caching, failure modes, targeting flexibilityLocal rule evaluation; attribute-driven targeting without schema changesRelies on LD services for rule storage; multi-context targeting with explicit entity modelingMeasurementJoining exposure to outcomes, avoiding "two truths"SQL-native measurement against your database/warehouseEvent collection with export paths; warehouse-native options vary by offeringControlRules, environments, approvals, governance, deploymentStrong baseline; self-host option; single-stage approvalsDeeper enterprise workflow plumbing (ITSM/IaC); multi-stage approvals; broader certifications

Runtime plane: updates, evaluation, and targeting

Feature flags live in hot paths and incident loops. When you evaluate feature flagging platforms, start with four runtime questions:

  1. How do updates propagate? (polling vs streaming; how fast can you change a rollout?)
  2. What's on the hot path? (local evaluation vs remote dependency; where does latency come from?)
  3. What happens in partial outages? (what's cached; what degrades; what falls back to defaults?)
  4. How flexible is targeting? (can you add new dimensions without refactoring your identity model?)

The first three are pure runtime concerns. Targeting spans both the control plane (where you define rules) and runtime (where those rules get evaluated). Below is how GrowthBook and LaunchDarkly answer these questions in practice.

Update propagation (polling vs streaming)

How do flag changes reach running apps? This matters when you're expanding a rollout or killing a flag during an incident.

GrowthBook: SDKs fetch and cache the rules payload locally at initialization and can refresh it periodically or on demand. If you need faster propagation, the GrowthBook Proxy or GrowthBook Cloud supports streaming updates via Server-Sent Events.

Diagram 2: GrowthBook SDKs Runtime Evaluation Flow

LaunchDarkly: Server-side SDKs commonly use streaming connections for updates. Client-side SDKs may poll or stream depending on platform and configuration.

What to take away: GrowthBook is “fetch-and-cache by default, stream when needed.” LaunchDarkly is “streaming-first” in many deployments.

Hot-path dependency (local rules vs remote evaluation)

GrowthBook: Often uses a cached-rules model: SDKs fetch a rules payload, keep it locally, and evaluate in-process. If you need to keep targeting rules off the client, GrowthBook supports Remote Evaluation mode via the proxy/edge workers.

LaunchDarkly (client-side): Client-side SDKs rely on LaunchDarkly services to store flag rules and deliver flag values/updates for a specific context, reducing rule exposure but increasing network dependence during init/refresh.

What to take away: Rule secrecy and centralized evaluation usually implies more network reliance; local rules reduce dependency surface area.

Partial outages and degradation behavior

Both platforms cache locally, but what’s cached determines the failure mode:

  • GrowthBook client-side (local rules): SDK caches the ruleset. During an outage, evaluation continues from cached rules; you mainly lose propagation of new changes.
  • LaunchDarkly client-side (local values): SDK caches evaluated values for a context. During an outage, cached values continue to serve; evaluations that require fresh context updates may fall back to defaults until connectivity is restored.
  • Server-side (both): SDKs typically cache rules locally and evaluate without network calls; outages mostly affect receiving updates.

What to take away: The practical difference is whether the client can keep evaluating against rules offline (rules cached) versus only serving previously-evaluated values (values cached). 

That covers how flags arrive and evaluate in production. The next runtime question is what you can express with those evaluations: targeting.

Targeting: who sees what, and how flexible is the model?

Targeting determines who sees what when a flag is evaluated: by user, tenant, region, device, plan, or any other attribute.

Targeting straddles both runtime and control planes. You author rules in the control plane, but they execute at runtime. We cover it here because the runtime model (how rules get evaluated, what you can express without SDK changes) is where GrowthBook and LaunchDarkly differ most. The control-plane authoring experience is comparable; the runtime flexibility is not.

Tenant-consistent rollouts for B2B

B2B SaaS teams need rollouts that are consistent per tenant. If you're testing billing changes on 10% of organizations, User A and User B from Acme Corp need to see the same thing.

GrowthBook: hash-based bucketing on any attribute. Set the hash attribute to company_id and all users from the same company land in the same bucket. No state synchronization required.

“We were looking to customize attributes on which we could toggle a roll-out, instead of using a percentage roll-out. Having tags for the classroom or the district a student is in, and then actually rolling out based on those, gives us a lot more power.”John Resig, Chief Software Architect, Khan Academy, customer story

LaunchDarkly: multi-context targeting. Define explicit context kinds (user, organization, device) and build rules that compose them. More structured but requires more upfront modeling.

Composition vs. Structural Identity

LaunchDarkly: multi-contexts model User, Organization, and Device as distinct entities. That helps when those entities have separate lifecycles, metadata, and policy rules.

GrowthBook: keeps the runtime model simpler: evaluation is driven by the attributes you pass (for example company_id, plan, device, region), and you can compose reusable targeting logic with Saved Groups (including nested groups) instead of introducing entity schemas at the SDK level.

When to choose

Choose GrowthBook if: You expect targeting dimensions to change over time and you want to add new ones without introducing new context schemas or SDK-level entity modeling.

Choose LaunchDarkly if: You need explicit, first-class separation between entity types (user vs org vs device) and you want targeting/governance to reflect those boundaries directly.

Measurement plane: proving rollout impact

Measurement determines whether you can reliably connect who saw a change with what happened next. The integration model matters because it either reuses the metrics and data pipelines your team already trusts, or creates a second analytics system that can drift from your source of truth.

When toggles aren't enough

You ship a feature behind a flag to 20% of users. A week later, your PM asks: "Did it increase conversion?" Your VP asks: "Did it slow page loads?"

Now flags become an analytics problem. You need to join flag exposure (who saw what variant) with outcomes (revenue, latency, errors).

When flag platforms become a second source of truth

Many centralized experimentation and flag platforms collect events via SDKs, store them in their infrastructure, and provide dashboards for analysis. To join rollout data with your product analytics or warehouse, you use Data Export (sometimes an add-on depending on plan) and build pipelines.

This creates two problems:

  1. Duplicate instrumentation. Sending events to your analytics platform (Amplitude, Mixpanel, your warehouse) AND your flag vendor. Duplicate tracking code, duplicate schemas, potential drift.
  2. Metric drift. Vendor analytics calculates revenue one way. BI team calculates it another. Results don't match. Trust erodes.

If your product data already lives in a central database or warehouse, a second analytics system can introduce unnecessary drift and duplication.

How the two platforms approach this:

LaunchDarkly: Collects flag exposure events into its own system and provides analysis there. To analyze outcomes using your warehouse metrics, you typically export those events and join them downstream.

GrowthBook: Reads exposure and outcome data directly from your database or warehouse and computes results with SQL, so the experiment uses the same tables and metric definitions your BI and engineering teams already rely on.

SQL as a simplifier

GrowthBook's approach: compute outcomes using SQL against your existing database. Flag exposure and outcome analysis stay aligned with the metrics and joins your team already trusts.

How it works:

  1. Define metrics in SQL against tables that already exist in your database.
  2. GrowthBook runs those queries to calculate results.
  3. All data stays in your database. No export, no pipelines, no schema mapping.

Postgres/MySQL as the practical on-ramp

"Warehouse-native" can sound like you need Snowflake or BigQuery to get started. You don't. If you're running Postgres or MySQL for your application, GrowthBook can use those directly as your measurement database. This lets engineering teams start with outcome measurement without waiting for data warehouse infrastructure or analytics team support.

Practical setup:

  1. Connect GrowthBook to a read replica (not your production primary). Cap time windows to avoid full table scans.
  2. Define a Fact Table with a single SQL query, then derive multiple metrics from it using the metric builder. For advanced cases, drop to raw SQL.

As query volume grows, the same SQL-defined metrics can move from Postgres to ClickHouse or your preferred warehouse without rewrites. GrowthBook supports multiple SQL data sources including Postgres, MySQL, ClickHouse, Snowflake, BigQuery, Redshift, and Databricks.

Practical benefits of warehouse-native measurement

Use metrics you trust. Your BI team has a revenue metric. Your data engineers validated it. Use that in rollout analysis instead of reimplementing it in a vendor dashboard.

Measure engineering outcomes without instrumentation. You have error logs in BigQuery. Write a metric that counts errors by flag variant. No SDK events, no custom tracking. Just SQL against existing tables.

Tail metrics with statistical validity. If you already store latency or error telemetry, GrowthBook can attribute changes to rollout exposure using the same analysis pipeline as your other metrics. P95/p99 and tenant-level effects get measured at the right unit, not eyeballed from a graph.

Tenant-correct measurement for B2B. GrowthBook can measure rollout impact in a way that respects tenant boundaries, so you don't accidentally treat thousands of users in one large customer as thousands of independent samples. LaunchDarkly can get you the exposure data, but tenant-correct impact measurement is something you typically implement yourself downstream.

LaunchDarkly's approach

LaunchDarkly's Data Export sends raw events to your warehouse. You can then write SQL to join them with your metrics. This approach works, but it introduces an additional pipeline to maintain.

LaunchDarkly has now started offering warehouse-native experimentation capabilities for Snowflake. Availability and feature scope may vary by plan and region.

Control plane: governance and deployment

Operational constraints determine how a flagging platform fits into your organization’s security model, change-management processes, and cost structure. These factors often matter less during early adoption, but become decisive as teams scale, enter regulated industries, or standardize release workflows across the company.

Governance and deployment

Standard controls

Both platforms support staged rollouts, approval workflows, and RBAC. GrowthBook has single-stage approvals. LaunchDarkly supports multi-stage approvals (up to five stages). Most teams find single-stage sufficient. Regulated industries or large organizations with complex change management may require multi-stage.

Environments work similarly: dev, staging, production. You can copy flags across environments and test changes before promoting to prod.

Self-hosting and data perimeter control

GrowthBook is open source and self-hostable. Run the entire platform in your VPC, keep all data in your infrastructure, never send PII to a vendor. Deployment via Docker, Kubernetes, Helm.

LaunchDarkly is a multi-tenant SaaS. Relay Proxy exists for edge caching, but the management and control plane remains SaaS. For air-gapped deployments or zero data egress requirements, GrowthBook is the option between these two.

"With the kinds of experiments we run and the sensitive data we handle, data security is paramount. The fact that GrowthBook offered us the ability to keep that data in-house was a key reason why we chose to work with them."Diego Accame, Director of Engineering, Growth, Upstart, customer story

Where LaunchDarkly's enterprise integrations matter

LaunchDarkly has native ServiceNow integration, a mature Terraform provider, and compliance certifications (ISO 27001, ISO 27701, FedRAMP). For organizations that require ServiceNow change management, Terraform for infrastructure-as-code, or specific compliance attestations, these are table stakes.

GrowthBook has SOC 2 Type II. LaunchDarkly’s additional certifications will matter to some teams selling into highly regulated industries. For teams without these specific requirements, these certifications won’t be necessary.

That covers the three architectural planes. But there's one more dimension that doesn't fit neatly into the model, and it often ends up mattering more than teams expect.

The hidden dimension: pricing (dis)incentives

The three-plane model covers the technical evaluation. But there's a fourth dimension that doesn't fit neatly into architecture diagrams: pricing. Most teams treat it as a procurement problem, separate from technical decisions. That's a mistake.

Pricing models shape how teams use platforms. They create incentives that ripple back into architecture.

LaunchDarkly prices their product based on monthly active users (MAU) for client-side and service connections for server-side (per LaunchDarkly's pricing page). As your user base grows, your bill grows. At scale, MAU-based pricing can push teams to architect around counting and routing rather than shipping: sampling, filtering, or proxying traffic to manage cost. The platform becomes a line item that scales with success, which can create tension between "flag everything" best practices and budget constraints.

GrowthBook prices per seat with unlimited flags, traffic, and experiments (per GrowthBook's pricing page). The bill is the same whether you have 100K users or 10M users. This removes the "should we flag this?" calculation. Teams use flags more liberally because there's no marginal cost per evaluation, which can lead to cleaner release processes and faster rollbacks.  This allows team to deploy feature flags at scale much more cost effectively.

This isn't about which model is "better." It's about recognizing that pricing affects architecture. If your team is already optimizing flag usage to manage costs, that's a signal worth examining.

How to decide

When GrowthBook is the better fit for development teams

  • Flag evaluation off the network hot path. GrowthBook's local evaluation model keeps decisions in-process with cached rules, reducing dependency surface area and making failure modes easier to reason about.
  • Identity model changes frequently. Attribute-driven targeting lets you add new dimensions (tenant, plan, cohort, region) without introducing new context schemas or SDK-level entity modeling.
  • Rollout impact measured in SQL on data you already trust. GrowthBook computes metrics directly against your Postgres, MySQL, or warehouse tables, avoiding a second analytics system and metric drift.
  • Self-hosting or strict data-perimeter control. GrowthBook can run entirely inside your VPC with no PII leaving your infrastructure.
  • Predictable, seat-based pricing. Costs stay stable as traffic grows, which removes incentives to sample or proxy requests just to manage MAU.

When LaunchDarkly is the better fit for development teams

  • Auto-generated metrics from observability platforms. LaunchDarkly can auto-generate metrics from OTel traces and observability tools (Dynatrace, Honeycomb, New Relic, Splunk). This reduces setup time when you want rollout safety tied immediately to production telemetry. GrowthBook has Safe Rollouts with auto-rollback as well, but guardrail metrics are SQL-defined, which means routing observability data to your database or warehouse first.
  • ServiceNow/ITSM governance is mandatory. LaunchDarkly has native ServiceNow integration. GrowthBook uses webhooks and APIs.
  • ISO 27001, ISO 27701, or FedRAMP certifications are required. For teams selling to federal agencies or highly-regulated industries, these certifications are non-negotiable.
  • Terraform provider is mandatory. LaunchDarkly has a mature Terraform provider for infrastructure-as-code workflows. GrowthBook doesn't.
  • Niche or legacy SDK coverage. LaunchDarkly supports platforms like Haskell, Erlang, Roku, and Apex. GrowthBook's 24+ SDKs cover most platforms but not these.

Decision guide

Priority Recommended Feature Flagging Platform
Runtime dependency surface (hot path, outages, testing) GrowthBook
Flexible targeting without refactors GrowthBook
Self-hosting / data perimeter control GrowthBook
SQL-native measurement (database/warehouse) GrowthBook
Seat-based pricing predictability GrowthBook
Experimentation rigor (quantile, CUPED, cluster) GrowthBook
Release safety automation (guarded rollouts, auto-rollback) Both
Auto-generated metrics from observability platforms LaunchDarkly
Enterprise compliance (ISO, FedRAMP) LaunchDarkly
ITSM workflows (ServiceNow) LaunchDarkly
Infrastructure-as-code (Terraform) LaunchDarkly
Niche SDK coverage (Haskell, Erlang, Apex) LaunchDarkly

Bottom line

Feature flagging looks simple on the surface. Development teams discover the real differences once flags sit in their hot paths, incident response, and product metrics.

At runtime, the question is how much of your flag evaluation depends on network and vendor infrastructure. GrowthBook leans toward local, deterministic evaluation. LaunchDarkly leans toward managed infrastructure with deeper built-in safety automation.

In measurement, the question is whether rollout impact lives inside the flag vendor or stays aligned with the database and metrics your team already trusts. GrowthBook centers measurement in SQL on your existing systems. LaunchDarkly centers it in a managed event and observability pipeline, with export paths when you need them.

In operational constraints, the question is whether you need enterprise workflow plumbing and compliance programs out of the box, or control over deployment, data location, and cost structure.

Both platforms cover the control-plane basics. They optimize for different failure modes and organizational priorities. The right choice isn't about feature checklists. It's about which architecture matches how your systems fail, how your data is measured, and how your organization ships software.

A/B Testing in the Age of AI
Experiments

A/B Testing in the Age of AI

Feb 10, 2026
x
min read

Prologue: Is A/B Testing Here to Stay?

In the age of AI, there is a growing debate about how it will transform the professions and skills we rely on today. Some view AI as a game changer, capable of completely reshaping the workforce: certain professions and skill sets may disappear entirely, while new, as-yet-unknown roles will emerge. Others argue that AI’s impact will be more evolutionary than revolutionary: the same professions will remain, but AI will accelerate and enhance the work we already do, enabling people to accomplish more in less time.

This debate naturally extends to the realm of A/B testing: will experimentation remain necessary at all? Some suggest that experimentation could become fully automated, potentially making roles like product managers, analysts, developers, and designers redundant. Others contend that while AI will fundamentally reshape these roles, it will not eliminate them. From this perspective, AI’s most significant contribution lies in speed: it can increase the volume of ideas that require testing and accelerates the ability to analyze the results. In effect, AI has the potential to dramatically compress product development cycles, allowing teams to iterate faster and more efficiently.

From this vantage point, A/B testing is far from disappearing; it is evolving. In this blog, we explore how AI is reshaping A/B testing: highlighting the areas already transformed, those on the verge of change, and those likely to remain largely unchanged.

Already Here: How AI Powers the Building Blocks of A/B Testing

A/B testing is the standard approach for determining whether new product versions genuinely outperform existing features. A typical A/B test comprises four main stages: hypothesis generation, where the proposed change and its expected impact are defined; experiment planning, which includes setting up the test conditions and determining an appropriate sample size; data collection and analysis; and finally, drawing conclusions and sharing the results across the organization. AI usage can already be found across the different stages of this lifecycle. 

Hypothesis Generation: Defining What to Test

Keeping track of what has already been done is essential for generating strong hypotheses. Past experiments provide critical context, helping teams avoid redundant or low-value tests and focus on ideas with real potential. Yet systematically tracking prior experiments remains a major challenge for analysts. As experiment volume grows, it quickly exceeds human cognitive capacity, and documentation becomes harder to navigate, especially as teams scale and members frequently join or leave.

This is precisely the kind of problem where AI excels. Platforms like GrowthBook leverage AI to help teams build efficiently on prior experiments by surfacing what has worked, identifying opportunities for new features, and even creating new feature flags and experiments directly in the platform. Crucially, these insights are not based on generic ideas; they are grounded in the company’s own data and experimental history, producing tailored solutions for the specific user population of the product.

Activating this AI support is as natural as talking with a teammate. In GrowthBook, analysts can simply ask what has worked in previous experiments, what has failed, and what to do next. Beyond suggesting new test ideas, the platform evaluates hypothesis quality against organization-defined criteria and helps prevent duplicate experiments by surfacing similar tests related to the current hypothesis.

Planning: Setting Up the Test

Once you know what you are going to check, the next step is to design the test. The main goal at this stage is to determine the test duration, which is directly driven by the required sample size. Sample size calculation is essential to ensure the experiment has enough statistical power to detect an effect when one truly exists.

Importantly, sample size planning is tightly linked to the data and the required confidence levels.  Sample is driven off of the expected improvement, the required statistical significance, often 0.05 and the statistical power required, often 80%.   While this planning is largely data driven, AI can still add value by helping teams manage, standardize, and document metrics across the organization by generating clear, consistent definitions and descriptions.

Analysis: Data Acquisition and Evaluation

Once data has been collected, AI can add value across the entire analysis pipeline, from data extraction to generating insights. For example, GrowthBook allows users to create SQL queries directly from plain-text descriptions, execute them, and visualize the results. But the impact of AI in this platform goes far beyond query generation. By leveraging information linked to the tests, such as the hypothesis and metrics, AI can produce a full analysis of the results. This includes generating a summary that can be attached to the experiment, with the content and style of the summary controlled through prompts that specify how the results should be described.

Beyond straightforward hypothesis testing, AI can also enable deeper exploration through segmentation analyses, helping teams understand where and for whom effects occur. AI-assisted exploration can also uncover unexpected patterns or secondary signals (such as mouse movement or click behavior) that might otherwise go unnoticed.

Documenting and Sharing: Turning Results into Knowledge

To derive impact from an A/B test, it is essential to clearly communicate the results. In the era of AI, analysts no longer need to struggle with interpreting findings or deciding how to present them to stakeholders. By leveraging language models, AI can simplify this task simply by being provided with information about the experiment and the actual data.

Within Reach: Accelerating A/B Testing Automation & Learning with AI

AI enhancements are already influencing various components of A/B testing. In the near future, we believe AI has the potential to go beyond individual components, integrating the entire A/B testing lifecycle into an end-to-end process and enabling a deeper, more causal understanding of why effects occur and where errors originate.

Unifying the experimentation lifecycle

AI already supports key parts of the experimentation lifecycle. The next step is to integrate these capabilities into a unified, automated workflow. In practice, this could range from describing analysis goals in natural language to AI proactively proposing experiment ideas. For example, GrowthBook’s Weblens allows teams to upload a website URL and receive data-driven experiment recommendations.

In the future, AI-driven systems could autonomously generate product variants and run experiments end to end, while analysts retain oversight to ensure product quality, correct user allocation, and sound interpretation of results. This shift has the potential to significantly reduce the friction that commonly exists between product, engineering, and analytics teams.

Today, analysts are rarely responsible for implementing product changes or launching experiments, which often leads to misalignment, such as missing tracking events or users being allocated but never exposed to a variant. These issues can require substantial rework or, in the worst case, invalidate the experiment entirely. By centralizing the experimentation workflow within an AI-driven system, many of these coordination failures can be prevented, resulting in faster execution, cleaner data, and more reliable insights.

Automatically diagnosing experimental issues

AI can improve experiment validity not only by reducing operational friction, but also by automating validity checks and helping debug experiments when issues arise. For example, today a common validity check is Sample Ratio Mismatch (SRM), which verifies that the actual allocation of users matches the planned allocation. Beyond SRM, it is advisable to periodically conduct A/A tests, which compare the control version against itself. Ideally, no significant differences should emerge; if they do, it may indicate that some aspect of the software or testing environment is unintentionally affecting outcomes. 

So, what can AI contribute to these validation checks? Quite a lot. While implementing SRM and A/A tests is relatively straightforward, diagnosing the source of a problem when one arises is far more complex. Tracing the root cause often requires detailed data exploration, which can be guided by AI tools. In more advanced settings, AI can even proactively detect potential issues by continuously monitoring differences in allocation or user characteristics, (e.g., identifying a higher proportion of bots in one group). This capability allows teams to catch and resolve problems earlier, reducing wasted time and resources.

Learning about your product

Understanding why an effect occurred is not only important when errors arise; it becomes even more critical when significant results are observed. Beyond simply deciding which features to ship or retire, companies are deeply interested in understanding their users and their needs. By uncovering what drove the impact in a test, companies can make more informed product decisions, identify opportunities for improvement, and tailor experiences that truly resonate with their users.

Achieving this goal today often requires substantial manual effort, from building dashboards and running follow-up analyses to iteratively exploring data to uncover the drivers of observed changes. AI can dramatically accelerate this process by enabling learning across user segments and other potential explanatory variables. While tools such as automated segmentation analysis already address part of this need, AI’s potential extends much further. In the near future, it is expected to reveal complex segment interactions, detect seasonal patterns, and analyze historical user behavior, providing a deeper understanding of the people who use our product.

Beyond Our Grasp: AI as a Replacement for Humans in A/B Testing

The existing and emerging AI-based practices naturally raises a broader question: how far can automation go? In theory, AI could fully automate the experimentation process. In such a “human-free” scenario, humans would no longer be needed in two key roles. First, they might not be required as users, as their behavior could be accurately modeled and simulated. Second, they might no longer be necessary as decision-makers, with AI autonomously generating, running, and evaluating experiments. From our perspective, however, both assumptions remain far from reality, at least for the foreseeable future. Let’s explore why.

No Need for Humans as Users?

The case for automation.

Human behavior can be modeled computationally, which raises the possibility of using synthetic users to test and iterate on product changes. In principle, such agents could enable experiments to be run, evaluated, and refined without involving real users.

Why humans still matter.

Human behavior is deeply contextual, shaped by emotions, social norms, cultural influences, and continuously evolving motivations. These nuances are difficult to capture fully in any model, and existing datasets inevitably reflect only a partial view of human decision-making. Moreover, products are ultimately designed for, and evaluated by, humans; not abstract agents or simulations. As a result, even the most sophisticated models must ultimately be validated against real human responses.

No Need for Humans as Decision-Makers?

The case for automation.If AI could autonomously generate hypotheses, run experiments, evaluate outcomes, and draw conclusions to guide subsequent tests, human intervention might become unnecessary. In such a scenario, each experiment would naturally flow into the next, creating a continuous, fully automated experimentation workflow.

Although this vision is tempting, allowing AI algorithms to operate entirely without human oversight is unlikely in the near future; too much is at stake. While AI can assist in running experiments, organizations are unlikely to relinquish human judgment, which safeguards revenue growth, user experience, and alignment with broader business objectives.

This cautious approach is already evident in A/B testing. Fully automated methods, such as reinforcement learning and multi-armed bandits for user allocation, have existed for years. Despite their advantages, these methods are never allowed to run without human supervision. Instead, they typically complement rather than replace classical A/B testing.

This highlights a broader reality: even if AI eventually handles the entire product development lifecycle autonomously, analysis and creativity  will remain crucial for evaluating AI-generated ideas, monitoring product updates, and interpreting results and insights. Human involvement in A/B testing and product decision-making is therefore unlikely to disappear; instead, it will transform: analysts will spend less time on hands-on execution and more on supervising, guiding, ideating, and shaping AI-driven processes.

Bottom line

There is no doubt that AI is transforming A/B testing as we know it. What remains open to debate is the extent of that transformation. In this piece, we’ve shared our perspective on what has already changed and what is most likely to evolve in the near future.

Today, AI is already helping teams generate stronger hypotheses, monitor and interpret metrics, automate large parts of the analysis workflow, and communicate results more effectively. Looking ahead, AI is likely to further connect the different phases of the experimentation lifecycle, enhance debugging and validation capabilities, and strengthen segmentation analysis, unlocking deeper and more nuanced product insights.

By reducing friction, accelerating learning cycles, and lowering the cost of running and analyzing experiments, AI empowers analysts and product teams to learn faster and make better-informed decisions every day.

We invite you to join us at GrowthBook as we continue building the next generation of experimentation, where A/B testing meets the power of AI.

Announcing GrowthBook 4.3: Faster Experiments, Deeper Insights
Releases
Product Updates
4.3

Announcing GrowthBook 4.3: Faster Experiments, Deeper Insights

Feb 4, 2026
x
min read

At GrowthBook, we're focused on helping you learn faster and ship with confidence. GrowthBook 4.3 delivers on both fronts, with post-stratification to reach statistical significance sooner, metric drilldowns to understand results more deeply, and feature evaluation diagnostics to verify your flags are working correctly in production.

GrowthBook 4.3 is now available to all cloud and self-hosted users.

Experiment Analysis

Post-Stratification (Enterprise only)

Experiment analysis now supports post-stratification, a powerful variance reduction technique that produces more precise results.

Here's the idea: if you know revenue varies by country, post-stratification uses that information to isolate the treatment effect from between-group noise. The result is tighter confidence intervals from your existing traffic. In the right conditions, CUPED + post-stratification can be equivalent to running your experiment with 20%+ more traffic!

Configure post-stratification at the organization level under SettingsGeneral, or override it at the metric or experiment level. To enable it, you'll need to have pre-computed dimensions configured in your experiment assignment query.

Post-stratification is available to Enterprise customers. CUPED (without post-stratification) is available to Pro and Enterprise customers.

Experiment Metric Drilldowns (All editions)

Experiment result metric drilldown showing goal metric timeseries
GrowthBook experiment metric drilldown panel showing goal metric timeseries data across experiment variations

Understanding experiment results just got a lot easier. Click any metric row to open a Metric Drilldown, a focused view with everything you need to interpret that metric without jumping between pages:

  • The Overview tab shows metric details, time series, and a results table with analysis controls.
  • The Slices tab lets you see how your metric breaks down across different values.
Metric slices showing average LCP broken out by browser and country
GrowthBook metric slices view showing average Largest Contentful Paint (LCP) broken out by browser and country for an A/B test
  • The Debug tab reveals how CUPED, post-stratification, capping, and priors are affecting your numbers.
Debug page showing experiment results metric with pre and post cuped and capping
GrowthBook experiment results debug view showing pre- and post-CUPED variance reduction and metric capping applied to a goal metric

Metric slices are an Enterprise feature. See Metric Slices for configuration details.

Experiment Result Filters (All editions)

Experiments with dozens or hundreds of metrics can be overwhelming to review. You can now filter results by tag, slice, or metric group to focus on what matters.

Once you find a view you like, use Add to Dashboard to save it for later and share with your team. We also cleaned up the results UI to reduce clutter and keep the focus on your data.

Daily Participation Metrics (All editions)

We added a brand new metric type: Daily Participation. For each user, this measures the fraction of days they were active while enrolled in the experiment (active days ÷ days exposed), then averages that value across users in each variation.

Think of it as DAU normalized per user and exposure window, but more stable for experiments than raw daily active user counts.

This is a really valuable metric for any website or app that is trying to grow daily usage.

Better Fact Table Filters (All editions)

Fact metric filter in action
GrowthBook fact metric filter interface for defining row-level filters on a fact table without writing SQL

Metrics are built on Fact Tables, and often you only need a subset of rows. This release adds a powerful filtering UI to define exactly which rows to include, without writing SQL.

Feature Flags

Feature Evaluation Diagnostics (All editions)

Feature evaluation table
GrowthBook feature flag evaluation diagnostics table showing SDK evaluation events from a data warehouse for troubleshooting targeting rules in production

When a flag isn't behaving as expected, debugging can be frustrating: you're left guessing whether the issue is in your targeting rules, SDK configuration, or something else entirely.

Feature evaluation diagnostics solves this by querying SDK evaluation events stored in your data warehouse. See exactly what evaluated in production, not just what the rules say should happen. Troubleshoot targeting conditions, rollouts, and experiment rules with real data instead of guesswork.

Nested Saved Groups (All editions)

Saved Groups now support nesting, letting you define groups in terms of other groups. Build complex targeting logic while keeping base definitions centralized and reusable.

For example, combine "Beta Users" AND "Enterprise Plan" to create "Beta Enterprise Users." Update the base group, and nested groups update automatically.

This makes it faster and easier to create targeting rules for feature flags.

Case-Insensitive Regex Targeting (All editions)

New targeting options for case-insensitive regex and "in list" matches—useful for matching email addresses and other values where case shouldn't matter.

Available now in the latest JavaScript, React, Node, and Python SDKs. More SDKs coming soon.

Rust and Roku SDKs (All editions)

We're excited to announce two new official SDKs: Rust and Roku.

Rust is the language of choice for modern performance-critical applications. Special shout out to the community, who authored the initial version of this SDK.

GrowthBook, now on your TV? That’s right, the next time you watch your favorite show, GrowthBook might be working behind the scenes with the launch of our official Roku SDK, a leading smart TV platform that powers millions of streaming devices and TVs worldwide.

With these additions, GrowthBook now offers 24+ SDKs spanning client-side, server-side, mobile, and edge.

Quality-of-Life Improvements

Big thanks to all of our users who reported bugs, shared feedback, and contributed ideas to this release on GitHub or Slack.

Many small improvements add up to a big boost in usability:

  • Improved query performance for fact metrics
  • Cleaner experiment results UI with fewer distractions
  • OR targeting conditions
  • Updated SDK support
  • New API endpoints to manage experiment dashboards and custom fields
  • New Project Admin role to make it easier to manage a large distributed team
  • New Custom Hook option to only validate incremental changes
  • Kerberos auth support for Trino/Presto
  • Option to auto-update metric slice values
  • Support for additional AI models from Anthropic, Mistral, xAI, and Gemini

Plus dozens of smaller fixes and performance improvements.

How The Social Club Cut Experimentation Costs by 82%
Experiments

How The Social Club Cut Experimentation Costs by 82%

Jan 24, 2026
x
min read

Rudger de Groot of Mintminds shared how The Social Hub slashed its experimentation costs with GrowthBook. By driving down the incremental cost per experiment as close as possible to zero, companies can run as many experiments on as much traffic as they want.

The best experimentation programs scale cost-efficiently, so they can run more experiments, learn faster, and ship smarter. But a hidden cost killer is BigQuery query inefficiency. The more you test, the more you pay. What if there were a way to test more and pay less?

In this case study, we’ll show you how Mintminds cut experimentation costs for The Social Hub using GrowthBook with BigQuery optimizations from GA4Dataform by Superform Labs. The setup slashed BigQuery costs by 81.8% while improving data refresh speeds and monitoring capabilities. Here's how they did it.

A Scaling Advantage Built into the Cost Structure

The mission at Mintminds is simple: build high-quality experiments with reliable data and analysis. GrowthBook’s pricing model allows for a setup where the more you test, the lower your per-experiment cost. But to optimize costs, you need to understand where money actually flows. Let’s break down the pricing:

Fixed Costs (pricing, as of Nov 2025)

  • $40/month per seat for GrowthBook Pro license
  • Typical team size: 5 seats = $200/month

Variable Costs (GrowthBook Cloud):

  • 2 million CDN requests included (≈ pageviews)
  • 20 GB CDN bandwidth included
  • Overage: $10 per million requests, $1 per GB bandwidth

Self-Hosting Alternative: You can eliminate CDN costs by self-hosting GrowthBook for $11-50/month (depending on your infrastructure choice).

How Experimentation Costs Compare

To understand how GrowthBook experimentation costs compare, Mintminds shares a real-world example from a client with 2.6 million unique users/month and running 5-7 experiments a month. In this example, they are running the GrowthBook JS SDK on Cloudflare pages, which means no limitations on the number of tested visitors for free. Yes, you read it right…for free!

The variable GrowthBook costs are:

  • 6.6 million CDN requests: 6.6 – 2 (first 2 million are free) = 4.6 * $10 = $46
  • 6 GB CDN Bandwidth usage: $ 0 (first 20GB is free)
  • BigQuery usage cost estimation with daily updates: $300

Fixed GrowthBook Pro costs for a team of 5 members: 5 * $40 = $200

Platform Monthly Cost Annual Cost vs. GrowthBook Optimized
Convert.com Pro $3,488 $41,856 1,050% more expensive
VWO Pro $4,308 $51,696 1,320% more expensive
GrowthBook (Unoptimized) $546 $6,552 80% more expensive
GrowthBook (Optimized) $303 $3,640 Baseline

With BigQuery costs included, GrowthBook remains dramatically cheaper than traditional alternatives like Convert ($3,500/month) or VWO ($4,300/month) at comparable traffic levels. GrowthBook is already the smart financial choice. With optimization, it becomes unbeatable. Using GrowthBook cuts experimentation costs by 82% versus Convert.com Pro and 93% compared to VWO Pro.

An 82% BigQuery reduction transforms GrowthBook from “very affordable” to an offer you simply can’t refuse.

GA4 Structure Wastes BigQuery Resources

Regardless of hosting choice, BigQuery becomes your primary variable cost when using GA4 as your data source. For companies running active experimentation programs with daily updates, Mintminds finds that unoptimized BigQuery costs can easily reach $200 to $400/month.

The default GrowthBook BigQuery integration queries GA4’s standard events_* and events_intraday_* tables. These tables store event parameters in nested structures, forcing BigQuery to process far more data than necessary.

For example when you’re running experiments with:

  • 5 metrics (1 goal + 1 secondary + 3 guardrails)
  • 3 dimensions for segmentation
  • Daily (or more frequent) data refreshes

BigQuery has to scan through nested arrays and repeated fields to extract the specific event parameters you need. You’re paying to process gigabytes of data when you only need megabytes of relevant information.

GrowthBook allows custom fact tables and metrics to select only relevant events and parameters. This helps, but optimizations plateau quickly because you’re still querying nested GA4 tables.

Enterprise customers get access to:

  • Advanced fact table query optimization
  • Data pipelines (significantly improved in GrowthBook 4.2)

But Pro license users need a different approach.

How to Use GA4Dataform's Flattened Datasets to Reduce Query Costs

At #CH2024 (the conference formerly known as Conversion Hotel), Rudger connected with Jules Stuifbergen from Superform Labs about this exact challenge. Jules introduced him to GA4Dataform, which offered an elegant solution.

What GA4Dataform Does: The Core Version (free!) creates a customized, flattened dataset optimized for the type of queries that GrowthBook uses.

Feature Benefit
Fully flattened structure No nested fields = dramatically faster queries
Smart partitioning and clustering Restricting queries by date and event names will decrease the number of rows scanned
Smaller data footprint Less data processed = lower BigQuery costs
Daily automated updates Fresh data from GA4 events table is appended to the table, using incremental logic

Key insight: Even though you’re creating a new dataset in BigQuery (which feeds from the generic GA4 table), the flattened structure makes it cheaper to generate AND cheaper to query than repeatedly querying GA4’s nested tables.

Bonus benefit: This same optimized dataset can be used for all your other BigQuery reports and dashboards, compounding the savings.

A Rigorous A/A Experiment to Test the Setup

Mintminds partnered with Laura Semeraro and the team at The Social Hub—a hybrid hospitality brand offering hotel rooms, co-living spaces, coworking facilities, and creative playgrounds across Europe—to validate this approach with real data.

"Using GA4Dataform's flattened datasets didn't just reduce GrowthBook costs—it optimized all our BigQuery reports and dashboards."
Laura Semeraro, Digital Analyst at The Social Hub    

Implementation Steps

1. GA4Dataform Setup – Laura installed GA4Dataform Core (free version). The custom event parameters from GrowthBook were added to the configuration (experiment ID and variation ID). With the daily schedule enabled, GA4Dataform automatically updates the flat events table incrementally.

2. GrowthBook Configuration – Mintminds created a new assignment query (for counting experiment visitors). Built fact tables for key conversion events: Add-to-cart and purchase events.

3. A/A Test Design – They ran two identical experiments simultaneously:

Configuration:

  • Same targeting rules
  • Same 5 metrics (1 goal, 1 secondary, 3 guardrails)
  • Same 3 dimensions

The Only Difference:

Experiment A: Default GrowthBook queries (nested GA4 tables)
Experiment B: Optimized queries (flattened GA4Dataform dataset)

4. Measurement – GrowthBook usage is automatically labelled in BigQuery, allowing us to track:

  • BigQuery costs from Experiment A (old approach)
  • BigQuery costs from Experiment B (new approach)
  • BigQuery costs for daily dataset updates

Test duration: 1 week

This gave us an objective, apples-to-apples comparison.

The Social Hub Reduced BigQuery Costs by 82%

When the results came in, Rudger and his team had to verify the numbers multiple times to ensure accuracy: a whopping 81.8% cost reduction and a massive query speed improvement, too.

By using the GA4Dataform flattened dataset instead of the default GA4 nested tables, they had reduced BigQuery data processing by more than four-fifths.

Benefit Impact
Update experiment results more frequently Better SRM and MDE monitoring without budget concerns
Run updates faster Flattened queries execute in a fraction of the time
Scale experiment volume The "more you test, less you pay" promise becomes reality
Optimize other analytics Use the same flattened dataset for all BigQuery dashboards

The compounding effect: Lower per-experiment costs + faster refresh rates = exponentially better experimentation program ROI.

Enterprise Experimentation at a Fraction of the Cost

This case study demonstrates how to achieve exceptional BigQuery efficiency with GrowthBook. By combining GrowthBook Pro, GA4Dataform Core and Strategic BigQuery optimization, you can build a cost-effective, high-performance experimentation stack that rivals Enterprise setups—at a fraction of the price. The cost reduction Mintminds achieved with The Social Hub isn’t an outlier. It’s the new baseline for GrowthBook implementations.

About Our Partners

Mintminds is a Certified GrowthBook partner based in the Netherlands. Founded by Rudger de Groot, the team assists companies worldwide with hyper-scaling experimentation using GrowthBook.  

The Social Hub is a European hospitality brand that blends traditional hotel stays with a vibrant, community-focused experience. Its unique hybrid model combines premium design-led short and long-stay hotel rooms with student accommodation, coworking spaces, meeting and event facilities, restaurants and bars, 24-hour gyms, and open-to-the-public spaces like rooftops, parks, and cultural venues.

AI Evals vs. A/B Testing: Why You Need Both to Ship GenAI
Experiments

AI Evals vs. A/B Testing: Why You Need Both to Ship GenAI

Jan 20, 2026
x
min read

Most teams building with GenAI are flying blind. They've replaced unit tests with vibes and shipped prompts that "felt right" to three engineers on a Friday afternoon.

This isn't a criticism—it's a diagnosis. For decades, we operated under a deterministic paradigm. The contract between developer and machine was explicit: Input A + Code = Output B. Always, without fail. In this world, success was binary. A unit test passed or it failed.

Generative AI has shattered this contract. We have moved from deterministic engineering to probabilistic engineering. We are no longer building binaries; we are managing stochastic agents that produce a distribution of probable outputs. You cannot assert(x == y) when x and y can change every time.

Gian Segato (Anthropic) eloquently sums up this shift: “We are no longer guaranteed what x is going to be, and we're no longer certain about the output y either, because it's now drawn from a distribution…. Stop for a moment to realize what this means. When building on top of this technology, our products can now succeed in ways we’ve never even imagined, and fail in ways we never intended” (Building AI Products In The Probabilistic Era).

As seismic as this shift may be, we’re focusing on a single aspect of it here: the shift from the domain of verification (is it correct?) to the domain of validation (is it good?).

This shift has left teams scrambling to define quality. Many have fallen into the trap of thinking AI Evaluations (Evals) are a replacement for A/B testing. They aren't.

And, for those in a hurry, here’s the point:

  • AI Evals check for competence—can the model do the job?
  • A/B testing checks for valuedo users care?

You cannot ship a good AI product without both AI Evals and A/B testing.

The limits of vibe checking

In the early days of the LLM boom, “Prompt Engineering” was largely a feeling-based art. Devs would tweak a prompt, run it three times, read the output, and decide if it “felt” better.

This manual inspection, vibe checking, leverages human intuition, which is great for nuance but terrible for scale.

Vibe checking suffers from three critical flaws:

  1. Sample size: You might test 5 inputs. Production brings 50k edge cases.
  2. Regression invisibility: Making a prompt “polite” might accidentally break its ability to output valid JSON. You won’t feel that until the API breaks.
  3. Subjectivity: One engineer’s “concise” is another’s “curt.”

As ML Systems Researcher, Shreya Shankar notes, “You can’t vibe check your way to understanding what’s going on.” Manual inspection is mathematically insufficient for understanding probabilistic systems at scale.

To solve this, the industry turned to AI Evals.

💡 For an excellent intro to AI Evals, check out Shreya Shankar and Hamal Husain on Lenny’s Podcast.

What are AI evals?

AI evaluations are an attempt to systematize the vibe check — turning qualitative judgment into quantitative metrics. They're a way to programmatically test the probabilistic parts of your application: prompts, models, and parameters.

But the term "Eval" is overloaded. When someone says "we're running evals," they might mean any of three things.

3 types of AI evals and why they matter

Model evals

Model evals are benchmarks like MMLU or HumanEval. They're useful for choosing a provider (GPT-5 vs. Claude Opus 4.5), but they tell you almost nothing about your specific application. A model might ace GSM8K (math reasoning) and still be a terrible customer service agent. Worse, these public benchmarks are increasingly contaminated—models have seen the test questions during training, inflating scores that don't transfer to novel problems. (We wrote a whole article about why “The Benchmarks Are Lying To You.”)

System evals

System evals are what matter most. These test your end-to-end pipeline: prompt + RAG retrieval + model. The key metrics here are things like hallucination rate, faithfulness (does the answer stick to the retrieved context?), and relevance.

Many teams now use LLM-as-Judge — a strong model grading outputs on subjective criteria like tone, helpfulness, and coherence. It scales better than human review, but inherits the same limitation: it measures whether an answer seems good, not whether users act on it.

Guardrails

Guardrails are real-time safety checks—toxicity filters, PII detection, jailbreak prevention. Important, but a different concern than quality.

All three share a critical constraint: they measure competence, not value. Whether you run evals offline in your CI/CD pipeline against a curated "Golden Dataset," or online against live traffic in shadow mode, you're still asking the same question: Can this model do the job?

Some evals do capture preference — human ratings, side-by-side comparisons, thumbs up/down. But these are still proxies. A user clicking "thumbs up" in a sandbox isn't the same as a user returning to your product tomorrow. Evals measure stated preference; A/B tests measure revealed preference through behavior.

What evals can't tell you is whether users will care enough to stick around.

Where evals fall short

Even within the realm of evals, a model that looks good in controlled conditions can fall apart in production.

The DoorDash engineering team documented this problem in detail. They built a new ad-ranking model that performed well in testing—but when deployed to real users, its accuracy dropped by 4.3%. The culprit? Their test data was too clean. The model had been trained under the assumption that it would always have fresh, up-to-date information about users. But in the real world, that data was often hours or days old due to system delays. The model had been optimized for conditions that didn't exist in production.

This principle applies even more to LLM applications. LLMs are sensitive to prompt phrasing, context length, and retrieval quality—all of which behave differently in production than in curated test sets.

Consider a concrete example: you optimize a customer service prompt for faithfulness—it sticks strictly to your knowledge base and never hallucinates. Evals look great. But in production, users find the responses robotic and impersonal. Satisfaction drops. You optimized for accuracy; they wanted empathy.

This is the core limitation of evals: they measure capability, not value. Even when you run evals against live traffic, you're testing whether the model can do something—not whether that something matters to users.

Why you should use A/B testing with your AI evals

If evals are the unit test, A/B testing is the integration test with reality. It’s the only way to measure what actually matters: downstream business impact like retention, revenue, conversion, engagement, and user satisfaction.

But running A/B tests on LLMs introduces challenges that didn't exist in traditional web experimentation. (For an introduction to the topic, see our practical guide to A/B testing AI.)

Challenges of running A/B tests on AI

The Latency Confound

Intelligence usually costs speed. If you test a fast, simple model against a smart, slow one and the variant loses — why? Was the answer worse or did users just hate waiting three seconds?

Isolating "intelligence" as a variable often requires artificial latency injection: intentionally slowing the control to match the variant. Only then can you measure what you think you're measuring.

High variance

LLMs are non-deterministic. Two users in the same variant might see meaningfully different responses. This noise demands larger sample sizes and longer test durations to reach statistical significance.

A button-color test might reach significance in a few thousand sessions. An LLM prompt test — where output variance is high and effect sizes are often small — might need 10x that, or weeks of runtime, to detect a meaningful difference.

Choosing the right metric

Choosing the right metric is harder for AI features than for traditional UI changes. A chatbot might increase engagement (users ask more questions) while decreasing efficiency (they take longer to get answers). Align your success metric with actual business value, not just surface activity.

These realities create a tension. A/B testing AI gives you certainty, but certainty takes time. If you have twenty prompts to evaluate, a traditional A/B test could take months. And during those months, a significant portion of your users are experiencing inferior variants.

Enter Multi-Armed Bandits

For prompt optimization, where iterations are cheap and the cost of a suboptimal variant is low, multi-armed bandits offer a different trade-off. Instead of fixed traffic allocation, they dynamically shift users toward winning variants as data accumulates. You sacrifice some statistical rigor for speed and reduced regret.

🎰 Check out our deep-dive on how they work in GrowthBook.

Comparing A/B testing to multi-armed bandits

Feature A/B Testing Multi-Armed Bandits
Primary goal Knowledge. Determine with statistical certainty if B is better than A Reward. Maximize total conversions during the experiment
Traffic allocation Fixed for the duration Dynamic. Automatically shifts traffic to the winner
Best use case Major model launches, pricing, UI changes Prompt optimization, headline testing

Bandits aren't a replacement for A/B testing. They're a complement — best suited for rapid iteration loops where you're optimizing within a validated direction, not making major strategic bets.

How to use AI evals and A/B testing together

An infographic titled "The LLMOps Pipeline" illustrating four vertical stages to filter risk: Offline Evals, Shadow Mode, Safe Rollout, and A/B Test. The chart shows a downward progression from low-cost, fast technical checks (labeled "Competence") to higher-cost, accurate business measurements (labeled "Value"). It highlights that while early stages catch technical errors like hallucinations, only the final A/B Test stage proves actual business value like retention and revenue.
Infographic showing the four-stage LLMOps pipeline: offline evals, shadow mode, safe rollout, and A/B test

At GrowthBook, we see the highest-performing teams treating evals and experimentation not as separate islands, but as a continuous pipeline—each stage filtering out risk with progressively more expensive (but more accurate) methods.

Using AI evals and A/B testing together in ractice

Stage 1: The Offline Filter (CI/CD)

A developer creates a new prompt branch. The CI/CD pipeline automatically runs evals against the Golden Dataset. If faithfulness drops below 90% or latency exceeds the threshold, the build fails. Bad ideas die here, costing pennies in API credits rather than user trust.

Stage 2: Shadow Mode (Production, Silent)

The prompt passes offline evals and gets deployed—but users never see it. The new model processes live traffic silently, logging predictions without surfacing them.

This is an online evaluation: you're still measuring competence (latency, accuracy, edge case handling), but now against real-world conditions. DoorDash's 4% accuracy gap between testing and production is exactly the kind of discrepancy  shadow mode is designed to surface—before users experience the degraded results.

Stage 3: Safe Rollout

Shadow mode passes. Feature flags gradually release the new model to users. You're monitoring guardrail metrics: error rates, refusal spikes, support tickets. If something tanks, you flip the flag and revert instantly—no code rollback required.

🦺 Use GrowthBook's Safe Rollouts to monitor guardrail metrics and rollback automatically.

Stage 4: The A/B Test (Causal Proof)

The rollout survives. Now you run the real experiment: new model vs. baseline, measured on business metrics. Not "faithfulness" but retention. Not "relevance" but conversion. This is the only stage that proves value.

Conclusion: AI evals plus A/B testing for GenAI

You cannot A/B test a broken model. It’s reckless. And you cannot Eval your way to product-market fit. It’s guesswork.

To ship generative AI that's both safe and profitable, you need both: rigorous evals to ensure competence, and robust A/B testing to prove value. The pipeline between them—shadow mode, safe rollouts—is how you get from one to the other without breaking things.

As Segato warned, our products can now fail in ways we never intended. This pipeline is how we catch those failures before users do.

We've moved from is it correct? to is it good? Evals answer the first question. A/B tests answer the second. You need both.

Frequently Asked Questions

Can AI Evals replace A/B testing?
No. AI Evals and A/B testing serve different purposes in the development lifecycle. Evals measure competence—accuracy, safety, tone—whether run offline or online. A/B testing measures business value through revealed user behavior: retention, revenue, conversion. Evals tell you the model works; A/B tests tell you it's worth shipping.

What is the difference between Offline and Online Evaluation?
Offline evaluation happens pre-deployment using a static Golden Dataset to check for regressions and quality. Online evaluation happens in production using live traffic (e.g., shadow mode). Both measure competence, but online evaluation catches issues—like feature staleness or latency spikes—that don't appear in controlled conditions.

How do you handle latency when A/B testing LLMs?
Latency is a major confounding variable because "smarter" models are often slower. If a slower model performs worse, it's unclear if users disliked the answer or the wait time. To fix this, engineers use Artificial Latency Injection—intentionally slowing down the control group to match the variant's response time, isolating "intelligence" as the single variable.

What is "Vibe Checking" in AI development?
"Vibe checking" is the informal process of manually inspecting a few model outputs to see if they "feel" right. While useful for early exploration, it is unscalable and statistically flawed for production systems because it fails to account for edge cases, regressions, or large-scale user preferences.

When should I use a Multi-Armed Bandit instead of an A/B test?
Use a Multi-Armed Bandit when your goal is optimization (maximizing reward) rather than knowledge (statistical significance). MABs are ideal for testing prompt variations or content recommendations because they automatically route traffic to the winning variation, minimizing regret. Use A/B tests for major architectural changes or risky launches where you need certainty.

What is the best way to deploy AI models safely?
Use a staged pipeline. Start with offline evals in CI/CD to catch regressions. Then use shadow mode to test against live traffic silently. Next, use feature flags to release to a small percentage of users while monitoring guardrails. Finally, run a full A/B test to measure business impact. Each stage filters out risk before exposing users to problems.

What is LLM-as-Judge?
LLM-as-Judge is an evaluation technique where a strong model (like GPT-4 or Claude) grades the outputs of your system on subjective criteria such as tone, helpfulness, and coherence. It scales better than human review but shares the same limitation as other evals: it measures whether an answer seems good, not whether users will act on it.

What is the difference between stated and revealed preference in AI evaluation?
Stated preference is what users say they like—thumbs up ratings, side-by-side comparisons in a sandbox. Revealed preference is what users actually do—returning to your product, completing tasks, converting. Evals capture stated preference; A/B tests capture revealed preference. The two often diverge.

Ready to ship faster?

No credit card required. Start with feature flags, experimentation, and product analytics — free.

Simplified white illustration of a right angle ruler or carpenter's square tool.White checkmark symbol with a scattered pixelated effect around its edges on a transparent background.