A/B testing and feature flags: how they work together

Feature flags decide who sees a change. A/B tests decide whether the change worked. The best product teams connect the two instead of treating them as separate rituals.
Without feature flags, A/B testing can become slow and brittle. Every experiment requires a release, rollback is awkward, and engineering has to coordinate product learning with deployment timing.
Without A/B testing, feature flags can become release switches with no measurement discipline. Teams roll out gradually, watch dashboards, and call the result good without knowing whether the change caused the metric movement.
Together, feature flags and A/B testing create a tighter loop: deploy safely, expose users intentionally, measure impact, then roll out or roll back based on evidence.
Feature flags and A/B tests are different tools
The easiest way to separate them is by job.
GrowthBook's feature flag docs describe flags as a way to control application behavior without deploying new code: target users, roll out gradually, or run A/B tests on client or server. That is the connection point. A flag controls the experience; the experiment evaluates the outcome.
Feature flags are release infrastructure
Feature flags decouple deploys from releases. Engineering can merge code behind a flag, test it internally, expose it to a beta group, expand to a small percentage of production traffic, and turn it off if something breaks.
That release-control workflow is valuable even when there is no experiment. Use flags for:
- Kill switches.
- Gradual rollouts.
- Internal testing.
- Beta programs.
- Permission-based access.
- Migration control.
- Operational fallbacks.
The feature flag's primary job is control.
A/B testing is measurement infrastructure
A/B testing is an experiment design. Users are assigned to control or treatment, and the team measures whether the treatment changes a predefined metric.
GrowthBook's A/B testing fundamentals explain the core workflow: use statistics to determine whether the effect measured on a metric differs across variations. The result is not just who saw what. It is whether the change likely moved activation, retention, revenue, engagement, latency, or another metric.
The A/B test's primary job is learning.
A rollout is not automatically an experiment
A gradual rollout reduces risk. It does not automatically create causal evidence.
If you turn a feature on for 10% of users, then 25%, then 50%, you may see metrics move. But without clean random assignment, stable eligibility, exposure logging, and a defined comparison group, that movement may be caused by seasonality, user mix, marketing campaigns, or unrelated product changes.
This is the most common confusion. Rollout control is useful, but it is not the same as experiment measurement.
How feature flags power A/B tests
Feature flags are a practical assignment mechanism for many product experiments.
Assignment happens at the flag
In a flag-based experiment, the flag returns different values for different users. A user might receive:
falsefor control.truefor treatment.- A string variation such as
old_nav,compact_nav, orguided_nav. - A JSON configuration that changes product behavior.
The application uses that value to render or run the right experience. The experiment analysis then compares outcomes across assigned groups.
GrowthBook's feature flag experiments docs explain this pattern directly: a flag can include an experiment rule that randomly assigns users to variations and tracks assignment through the SDK callback.
Consistency matters
Users should see the same variation across sessions unless the experiment is designed otherwise. If a user sees treatment in one session and control in the next, the analysis becomes harder to trust and the user experience can feel broken.
This is why deterministic assignment and stable identifiers matter. A B2B SaaS product also needs to decide whether assignment happens at the user level, account level, workspace level, device level, or another unit.
The assignment unit should match the product experience. If one account shares a feature, account-level assignment may be cleaner than user-level assignment.
Exposure logging matters
Feature flag evaluation is not always the same as experiment exposure.
A backend service may evaluate a flag before the user ever sees the UI. A React component may check a flag but fail to render because the user navigates away. A mobile app may fetch configuration at startup even if the feature screen is never opened.
Exposure should be logged when the user can actually experience the change. If exposure is logged too early, the experiment includes users who had no chance to respond. That dilutes the effect and can bias results.
Rollout after the result is easier
When the experiment is controlled by a flag, the post-test action is straightforward:
- If the treatment wins, ramp the winning variation to more users.
- If the treatment loses, turn it off.
- If the result is inconclusive, keep the flag limited or iterate.
- If guardrails degrade, stop or roll back without redeploying.
The flag makes the decision operational. The A/B test makes it evidence-based.
A practical workflow
The combined workflow is simple, but each step has a purpose.
Step 1: write the hypothesis
Do not start with the flag. Start with the hypothesis.
Example:
"For new workspace admins, showing a guided setup checklist will increase seven-day activation because it makes the next meaningful task clearer."
That hypothesis tells the team what to build, who should be eligible, and what metric should decide the test.
Step 2: create the flag
Create a flag for the product behavior you want to control. The flag should have a clear owner, description, default value, and cleanup plan.
Good flag names describe behavior, not implementation trivia:
guided_onboarding_checklistnew_pricing_layoutai_answer_quality_prompt_v2account_level_invite_flow
Avoid vague names like experiment_test or new_flow.
Step 3: define variations
Variations should map to product experiences.
For a boolean feature:
- Control: current onboarding.
- Treatment: guided checklist.
For a config feature:
- Control: standard prompt.
- Treatment A: concise prompt.
- Treatment B: context-first prompt.
The more variations you add, the more traffic you need. Keep the design as small as the decision allows.
Step 4: define metrics and guardrails
Pick one primary metric. Add guardrails for outcomes that must not degrade.
For the guided setup checklist:
- Primary metric: seven-day activation.
- Guardrails: paid conversion, support contacts, onboarding completion time, invite rate.
Metrics should be defined before launch. If your organization uses warehouse-defined metrics, the experiment should use those definitions rather than rebuilding weaker dashboard versions.
Step 5: launch gradually
A flag lets you separate technical release from experiment exposure.
A common path:
- Enable for internal users.
- Enable for QA or beta accounts.
- Start the experiment with eligible production users.
- Monitor guardrails and assignment quality.
- Keep the experiment running until the decision rule is met.
Do not stop early just because the result looks exciting unless the statistical method supports that behavior.
Step 6: decide and clean up
At the end, make a decision:
- Ship.
- Stop.
- Iterate.
- Retest with a narrower hypothesis.
Then clean up. If the winning variation becomes default, remove old code paths when it is safe. If the treatment loses, remove the flag and treatment code. If the flag must remain, document why it is now long-lived configuration rather than a temporary release flag.
Community discussions about feature flags often warn about stale flags for a reason. Hacker News threads on feature flag reality and short-lived flags repeatedly point to the same operational issue: teams prioritize new work and postpone cleanup. A flag strategy needs an owner and lifecycle, not just a dashboard.
When to use a flag without an A/B test
Not every flag should become an experiment.
Kill switches
A kill switch exists to turn behavior off quickly if something breaks. The goal is reliability, not causal learning.
Examples:
- Disable a third-party integration.
- Turn off an AI feature if error rates spike.
- Stop a slow query path.
- Revert a risky backend migration.
You may monitor metrics, but you do not need an A/B test to justify the switch.
Permissions and entitlements
Some flags control who has access to a feature based on plan, role, beta status, region, or contract terms. These are product configuration, not temporary experiments.
Treat them differently from experiment flags. They may be long-lived, require stricter governance, and need clearer documentation.
Internal testing and beta programs
Internal releases and beta programs are useful for quality, feedback, and readiness. They are not usually randomized controlled experiments.
Use them to find bugs and qualitative issues. Use A/B testing when you need measured causal impact.
When to turn a flag into an A/B test
A flag should become an A/B test when the team needs evidence about product impact.
Good candidates:
- Onboarding changes.
- Pricing-page changes.
- Recommendation logic.
- AI prompt or model changes.
- Activation nudges.
- Search ranking changes.
- New dashboard layouts.
- Checkout or upgrade flows.
The key is measurable uncertainty. If the team already knows the feature must ship for contractual or compliance reasons, an A/B test may not be necessary. If the question is whether users benefit, convert the rollout into an experiment.
Common anti-patterns
Feature flags and experiments are powerful together, but the combined workflow has traps.
Anti-pattern 1: using rollout percentage as evidence
"We rolled it out to 50% and nothing looked bad" is not an experiment result.
It may be a valid release safety signal. It is not proof that the feature improved activation, retention, or revenue. For that, you need a comparison group and a predefined metric.
Anti-pattern 2: logging exposure too early
If exposure fires when a flag is evaluated but before the user can experience the treatment, the experiment includes users who never saw the change.
This is especially common in backend or config-heavy implementations. Audit exposure timing before trusting the result.
Anti-pattern 3: leaving experiment flags forever
Every temporary flag should have a cleanup plan.
Long-lived flags are sometimes legitimate. But experiment flags should not quietly become permanent branches in the codebase. Cleanup is part of the experiment cost.
Anti-pattern 4: testing without guardrails
A test can improve the primary metric and harm something else.
Examples:
- More signups, lower paid conversion.
- More clicks, lower retention.
- Faster activation, more support tickets.
- Higher engagement, worse latency.
Guardrails keep a narrow metric from creating a bad release.
Anti-pattern 5: confusing users with changing assignments
If users bounce between variants, they may notice inconsistent product behavior. Worse, the experiment result becomes less reliable.
Use stable identifiers and choose the assignment unit carefully.
Where GrowthBook fits
GrowthBook is built around the idea that feature flags and experimentation should work together.
The feature flags product page states that any feature flag in GrowthBook can become an A/B test, assigning users to variants and tracking metrics using data from the existing warehouse. The experimentation platform extends that workflow with multiple variants, advanced statistics, warehouse-native analysis, and guardrails.
That matters because the hardest part of connecting flags and experiments is usually not toggling code. It is keeping release control, assignment, exposure, metrics, and analysis aligned.
GrowthBook is a good fit when teams want:
- Feature flags for release control.
- A/B testing for product impact.
- Warehouse-native metrics.
- Transparent analysis.
- Cloud or self-hosted deployment.
- Product Analytics connected to experiments.
The important claim is practical: GrowthBook lets teams move from "who saw the change?" to "did the change work?" without switching operating models.
Architecture patterns that work
Feature flags and A/B tests can be wired several ways. The right pattern depends on where the decision happens and where the metric lives.
Client-side flags for UI experiments
Client-side flags are useful when the experience is rendered in the browser or mobile app: button copy, page layout, onboarding modals, navigation changes, or feature discovery prompts.
The benefit is speed. Frontend teams can often create experiments without changing backend services. The risk is that client-side flags may expose some flag metadata or variation names to the client, so they should not protect sensitive business logic.
Use client-side flags when:
- The treatment is visible UI behavior.
- The flag does not protect sensitive logic.
- The page or app can handle loading states safely.
- Exposure can be logged when the user sees the experience.
Server-side flags for product logic
Server-side flags are better when the treatment affects backend behavior: search ranking, recommendation logic, pricing calculations, eligibility rules, AI prompt selection, or API behavior.
The benefit is control. Sensitive logic stays server-side, and assignment can happen close to the service that owns the behavior. The risk is exposure logging. A server may evaluate a flag for a request that does not lead to a visible treatment, so teams need to place exposure tracking carefully.
Use server-side flags when:
- The feature changes backend behavior.
- The treatment should not be exposed to the client.
- The assignment unit is account, workspace, organization, or user.
- The team can log exposure at the right product moment.
Hybrid flags for full-stack changes
Many real product changes are full-stack. A new onboarding flow may include backend eligibility, frontend layout, email timing, and analytics events. In that case, the flag becomes a shared contract across services.
The team should define:
- Which service owns assignment.
- Which surfaces read the assigned variation.
- Where exposure is logged.
- Which identifiers must match across systems.
- What the safe default is if flag data is unavailable.
Hybrid experiments need more coordination, but they are often where feature flags are most useful. They let teams ship the code safely before exposing the full product experience.
Worked example: rolling out an AI feature
Imagine a SaaS company adding an AI assistant to a reporting workflow. The team wants to know whether the assistant helps users complete analysis tasks faster without lowering trust.
Release-control plan
The engineering team ships the assistant behind a server-side feature flag. The default is off. Internal employees get access first, followed by a beta cohort of trusted accounts.
The flag supports:
- Internal targeting.
- Account-level rollout.
- A kill switch.
- A treatment variation for the new assistant.
- A control variation that preserves the current workflow.
This lets engineering test production behavior without exposing every user.
Experiment plan
Once QA and beta feedback look stable, the team converts the flag into an experiment.
The hypothesis:
"For analysts building reports, showing the AI assistant will increase successful report completion because users can translate questions into chart configuration faster."
The metrics:
- Primary metric: successful report completion within the session.
- Guardrails: assistant error rate, latency, report deletion, negative feedback, support contact rate.
- Secondary learning metric: time to first useful chart.
The assignment unit is account-level because multiple users in the same workspace may collaborate on reports. Mixing variations inside the same account would create confusion.
Decision and cleanup
If the assistant improves report completion without harming guardrails, the team ramps the treatment gradually. If latency or negative feedback moves, the team rolls back with the flag.
After the decision, engineering does not leave the experiment flag in place indefinitely. If the assistant ships to all eligible accounts, the old control path is removed after an engineering review. If the assistant remains plan-gated, the flag is converted from a temporary experiment flag into a documented permission or entitlement rule.
That cleanup step is the difference between an experimentation program and a codebase full of forgotten conditions.
Ownership across product, engineering, and data
Flags and experiments fail when ownership is implicit.
Engineering owns safe delivery
Engineering should own SDK integration, default values, runtime behavior, rollback paths, and cleanup. If a flag can break production, engineering needs to know how it behaves under network failure, stale config, bad targeting, or missing attributes.
Engineering should also review long-lived flags. Temporary release flags and experiment flags should not quietly become permanent product logic.
Product owns the decision
Product should own the hypothesis, target audience, minimum practical effect, and action after the result. If the test wins, what ships? If it loses, what happens to the idea? If it is inconclusive, will the team run longer, narrow the audience, or move on?
Without that decision ownership, experiments become dashboards with no consequence.
Data owns measurement trust
Data teams or analytically strong product teams should own metric definitions, exposure quality, sample ratio checks, guardrails, and readout interpretation.
This does not mean every experiment needs a custom analysis project. It means the metrics and statistical method should be understood before the result is used to justify a release.
Implementation checklist
Before launch:
- Write the hypothesis.
- Choose the assignment unit.
- Create the flag with owner, description, and cleanup date.
- Define control and treatment variations.
- Choose one primary metric.
- Choose guardrails.
- Confirm exposure logging timing.
- Test internal targeting.
- Test rollback.
- Decide the stopping rule.
During the experiment:
- Monitor assignment and sample ratio.
- Watch guardrails.
- Document incidents.
- Avoid changing the decision rule.
- Keep the rollout within the planned design.
After the result:
- Decide ship, stop, iterate, or retest.
- Ramp the winning variation deliberately.
- Remove losing code paths.
- Archive or convert the flag.
- Record what the team learned.
What to do next
The cleanest workflow is simple: use feature flags to control exposure, and use A/B testing to measure impact.
Do not make every flag an experiment. Do not treat every rollout as evidence. Use the right tool for the job, then connect them when the team has a measurable hypothesis.
Start with one real product change. Put it behind a flag, assign users cleanly, define the metric before launch, and decide what will happen when the result arrives. That is the loop worth building.
Related Articles
Ready to ship faster?
No credit card required. Start with feature flags, experimentation, and product analytics—free.

