Experiments

Feature Flags

A/B testing and feature flags: how they work together

Jun 15, 2026

min read

A graphic of a bar chart with an arrow pointing upward.

Feature flags decide who sees a change. A/B tests decide whether the change worked. The best product teams connect the two instead of treating them as separate rituals.

Without feature flags, A/B testing can become slow and brittle. Every experiment requires a release, rollback is awkward, and engineering has to coordinate product learning with deployment timing.

Without A/B testing, feature flags can become release switches with no measurement discipline. Teams roll out gradually, watch dashboards, and call the result good without knowing whether the change caused the metric movement.

Together, feature flags and A/B testing create a tighter loop: deploy safely, expose users intentionally, measure impact, then roll out or roll back based on evidence.

Feature flags and A/B tests are different tools

The easiest way to separate them is by job.

Question	Feature flags answer	A/B tests answer
What do they control?	Exposure to product behavior	Measurement of product impact
Main user	Engineering and product teams	Product, data, and growth teams
Primary purpose	Release control, targeting, rollout, rollback	Causal learning and decision-making
Output	A user gets variation A, B, or off	A readout about metric impact
Failure mode	Flag sprawl, unsafe defaults, stale logic	False positives, underpowered tests, bad metrics
Best together when	A flag controls assignment and rollout	The experiment measures whether to keep rolling out

GrowthBook's feature flag docs describe flags as a way to control application behavior without deploying new code: target users, roll out gradually, or run A/B tests on client or server. That is the connection point. A flag controls the experience; the experiment evaluates the outcome.

Feature flags are release infrastructure

Feature flags decouple deploys from releases. Engineering can merge code behind a flag, test it internally, expose it to a beta group, expand to a small percentage of production traffic, and turn it off if something breaks.

That release-control workflow is valuable even when there is no experiment. Use flags for:

Kill switches.
Gradual rollouts.
Internal testing.
Beta programs.
Permission-based access.
Migration control.
Operational fallbacks.

The feature flag's primary job is control.

A/B testing is measurement infrastructure

A/B testing is an experiment design. Users are assigned to control or treatment, and the team measures whether the treatment changes a predefined metric.

GrowthBook's A/B testing fundamentals explain the core workflow: use statistics to determine whether the effect measured on a metric differs across variations. The result is not just who saw what. It is whether the change likely moved activation, retention, revenue, engagement, latency, or another metric.

The A/B test's primary job is learning.

A rollout is not automatically an experiment

A gradual rollout reduces risk. It does not automatically create causal evidence.

If you turn a feature on for 10% of users, then 25%, then 50%, you may see metrics move. But without clean random assignment, stable eligibility, exposure logging, and a defined comparison group, that movement may be caused by seasonality, user mix, marketing campaigns, or unrelated product changes.

This is the most common confusion. Rollout control is useful, but it is not the same as experiment measurement.

How feature flags power A/B tests

Feature flags are a practical assignment mechanism for many product experiments.

Assignment happens at the flag

In a flag-based experiment, the flag returns different values for different users. A user might receive:

false for control.
true for treatment.
A string variation such as old_nav, compact_nav, or guided_nav.
A JSON configuration that changes product behavior.

The application uses that value to render or run the right experience. The experiment analysis then compares outcomes across assigned groups.

GrowthBook's feature flag experiments docs explain this pattern directly: a flag can include an experiment rule that randomly assigns users to variations and tracks assignment through the SDK callback.

Consistency matters

Users should see the same variation across sessions unless the experiment is designed otherwise. If a user sees treatment in one session and control in the next, the analysis becomes harder to trust and the user experience can feel broken.

This is why deterministic assignment and stable identifiers matter. A B2B SaaS product also needs to decide whether assignment happens at the user level, account level, workspace level, device level, or another unit.

The assignment unit should match the product experience. If one account shares a feature, account-level assignment may be cleaner than user-level assignment.

Exposure logging matters

Feature flag evaluation is not always the same as experiment exposure.

A backend service may evaluate a flag before the user ever sees the UI. A React component may check a flag but fail to render because the user navigates away. A mobile app may fetch configuration at startup even if the feature screen is never opened.

Exposure should be logged when the user can actually experience the change. If exposure is logged too early, the experiment includes users who had no chance to respond. That dilutes the effect and can bias results.

Rollout after the result is easier

When the experiment is controlled by a flag, the post-test action is straightforward:

If the treatment wins, ramp the winning variation to more users.
If the treatment loses, turn it off.
If the result is inconclusive, keep the flag limited or iterate.
If guardrails degrade, stop or roll back without redeploying.

The flag makes the decision operational. The A/B test makes it evidence-based.

A practical workflow

The combined workflow is simple, but each step has a purpose.

Step 1: write the hypothesis

Do not start with the flag. Start with the hypothesis.

Example:

"For new workspace admins, showing a guided setup checklist will increase seven-day activation because it makes the next meaningful task clearer."

That hypothesis tells the team what to build, who should be eligible, and what metric should decide the test.

Step 2: create the flag

Create a flag for the product behavior you want to control. The flag should have a clear owner, description, default value, and cleanup plan.

Good flag names describe behavior, not implementation trivia:

guided_onboarding_checklist
new_pricing_layout
ai_answer_quality_prompt_v2
account_level_invite_flow

Avoid vague names like experiment_test or new_flow.

Step 3: define variations

Variations should map to product experiences.

For a boolean feature:

Control: current onboarding.
Treatment: guided checklist.

For a config feature:

Control: standard prompt.
Treatment A: concise prompt.
Treatment B: context-first prompt.

The more variations you add, the more traffic you need. Keep the design as small as the decision allows.

Step 4: define metrics and guardrails

Pick one primary metric. Add guardrails for outcomes that must not degrade.

For the guided setup checklist:

Primary metric: seven-day activation.
Guardrails: paid conversion, support contacts, onboarding completion time, invite rate.

Metrics should be defined before launch. If your organization uses warehouse-defined metrics, the experiment should use those definitions rather than rebuilding weaker dashboard versions.

Step 5: launch gradually

A flag lets you separate technical release from experiment exposure.

A common path:

Enable for internal users.
Enable for QA or beta accounts.
Start the experiment with eligible production users.
Monitor guardrails and assignment quality.
Keep the experiment running until the decision rule is met.

Do not stop early just because the result looks exciting unless the statistical method supports that behavior.

Step 6: decide and clean up

At the end, make a decision:

Ship.
Stop.
Iterate.
Retest with a narrower hypothesis.

Then clean up. If the winning variation becomes default, remove old code paths when it is safe. If the treatment loses, remove the flag and treatment code. If the flag must remain, document why it is now long-lived configuration rather than a temporary release flag.

Community discussions about feature flags often warn about stale flags for a reason. Hacker News threads on feature flag reality and short-lived flags repeatedly point to the same operational issue: teams prioritize new work and postpone cleanup. A flag strategy needs an owner and lifecycle, not just a dashboard.

When to use a flag without an A/B test

Not every flag should become an experiment.

Kill switches

A kill switch exists to turn behavior off quickly if something breaks. The goal is reliability, not causal learning.

Examples:

Disable a third-party integration.
Turn off an AI feature if error rates spike.
Stop a slow query path.
Revert a risky backend migration.

You may monitor metrics, but you do not need an A/B test to justify the switch.

Permissions and entitlements

Some flags control who has access to a feature based on plan, role, beta status, region, or contract terms. These are product configuration, not temporary experiments.

Treat them differently from experiment flags. They may be long-lived, require stricter governance, and need clearer documentation.

Internal testing and beta programs

Internal releases and beta programs are useful for quality, feedback, and readiness. They are not usually randomized controlled experiments.

Use them to find bugs and qualitative issues. Use A/B testing when you need measured causal impact.

When to turn a flag into an A/B test

A flag should become an A/B test when the team needs evidence about product impact.

Good candidates:

Onboarding changes.
Pricing-page changes.
Recommendation logic.
AI prompt or model changes.
Activation nudges.
Search ranking changes.
New dashboard layouts.
Checkout or upgrade flows.

The key is measurable uncertainty. If the team already knows the feature must ship for contractual or compliance reasons, an A/B test may not be necessary. If the question is whether users benefit, convert the rollout into an experiment.

Common anti-patterns

Feature flags and experiments are powerful together, but the combined workflow has traps.

Anti-pattern 1: using rollout percentage as evidence

"We rolled it out to 50% and nothing looked bad" is not an experiment result.

It may be a valid release safety signal. It is not proof that the feature improved activation, retention, or revenue. For that, you need a comparison group and a predefined metric.

Anti-pattern 2: logging exposure too early

If exposure fires when a flag is evaluated but before the user can experience the treatment, the experiment includes users who never saw the change.

This is especially common in backend or config-heavy implementations. Audit exposure timing before trusting the result.

Anti-pattern 3: leaving experiment flags forever

Every temporary flag should have a cleanup plan.

Long-lived flags are sometimes legitimate. But experiment flags should not quietly become permanent branches in the codebase. Cleanup is part of the experiment cost.

Anti-pattern 4: testing without guardrails

A test can improve the primary metric and harm something else.

Examples:

More signups, lower paid conversion.
More clicks, lower retention.
Faster activation, more support tickets.
Higher engagement, worse latency.

Guardrails keep a narrow metric from creating a bad release.

Anti-pattern 5: confusing users with changing assignments

If users bounce between variants, they may notice inconsistent product behavior. Worse, the experiment result becomes less reliable.

Use stable identifiers and choose the assignment unit carefully.

Where GrowthBook fits

GrowthBook is built around the idea that feature flags and experimentation should work together.

The feature flags product page states that any feature flag in GrowthBook can become an A/B test, assigning users to variants and tracking metrics using data from the existing warehouse. The experimentation platform extends that workflow with multiple variants, advanced statistics, warehouse-native analysis, and guardrails.

That matters because the hardest part of connecting flags and experiments is usually not toggling code. It is keeping release control, assignment, exposure, metrics, and analysis aligned.

GrowthBook is a good fit when teams want:

Feature flags for release control.
A/B testing for product impact.
Warehouse-native metrics.
Transparent analysis.
Cloud or self-hosted deployment.
Product Analytics connected to experiments.

The important claim is practical: GrowthBook lets teams move from "who saw the change?" to "did the change work?" without switching operating models.

Architecture patterns that work

Feature flags and A/B tests can be wired several ways. The right pattern depends on where the decision happens and where the metric lives.

Client-side flags for UI experiments

Client-side flags are useful when the experience is rendered in the browser or mobile app: button copy, page layout, onboarding modals, navigation changes, or feature discovery prompts.

The benefit is speed. Frontend teams can often create experiments without changing backend services. The risk is that client-side flags may expose some flag metadata or variation names to the client, so they should not protect sensitive business logic.

Use client-side flags when:

The treatment is visible UI behavior.
The flag does not protect sensitive logic.
The page or app can handle loading states safely.
Exposure can be logged when the user sees the experience.

Server-side flags for product logic

Server-side flags are better when the treatment affects backend behavior: search ranking, recommendation logic, pricing calculations, eligibility rules, AI prompt selection, or API behavior.

The benefit is control. Sensitive logic stays server-side, and assignment can happen close to the service that owns the behavior. The risk is exposure logging. A server may evaluate a flag for a request that does not lead to a visible treatment, so teams need to place exposure tracking carefully.

Use server-side flags when:

The feature changes backend behavior.
The treatment should not be exposed to the client.
The assignment unit is account, workspace, organization, or user.
The team can log exposure at the right product moment.

Hybrid flags for full-stack changes

Many real product changes are full-stack. A new onboarding flow may include backend eligibility, frontend layout, email timing, and analytics events. In that case, the flag becomes a shared contract across services.

The team should define:

Which service owns assignment.
Which surfaces read the assigned variation.
Where exposure is logged.
Which identifiers must match across systems.
What the safe default is if flag data is unavailable.

Hybrid experiments need more coordination, but they are often where feature flags are most useful. They let teams ship the code safely before exposing the full product experience.

Worked example: rolling out an AI feature

Imagine a SaaS company adding an AI assistant to a reporting workflow. The team wants to know whether the assistant helps users complete analysis tasks faster without lowering trust.

Release-control plan

The engineering team ships the assistant behind a server-side feature flag. The default is off. Internal employees get access first, followed by a beta cohort of trusted accounts.

The flag supports:

Internal targeting.
Account-level rollout.
A kill switch.
A treatment variation for the new assistant.
A control variation that preserves the current workflow.

This lets engineering test production behavior without exposing every user.

Experiment plan

Once QA and beta feedback look stable, the team converts the flag into an experiment.

The hypothesis:

"For analysts building reports, showing the AI assistant will increase successful report completion because users can translate questions into chart configuration faster."

The metrics:

Primary metric: successful report completion within the session.
Guardrails: assistant error rate, latency, report deletion, negative feedback, support contact rate.
Secondary learning metric: time to first useful chart.

The assignment unit is account-level because multiple users in the same workspace may collaborate on reports. Mixing variations inside the same account would create confusion.

Decision and cleanup

If the assistant improves report completion without harming guardrails, the team ramps the treatment gradually. If latency or negative feedback moves, the team rolls back with the flag.

After the decision, engineering does not leave the experiment flag in place indefinitely. If the assistant ships to all eligible accounts, the old control path is removed after an engineering review. If the assistant remains plan-gated, the flag is converted from a temporary experiment flag into a documented permission or entitlement rule.

That cleanup step is the difference between an experimentation program and a codebase full of forgotten conditions.

Ownership across product, engineering, and data

Flags and experiments fail when ownership is implicit.

Engineering owns safe delivery

Engineering should own SDK integration, default values, runtime behavior, rollback paths, and cleanup. If a flag can break production, engineering needs to know how it behaves under network failure, stale config, bad targeting, or missing attributes.

Engineering should also review long-lived flags. Temporary release flags and experiment flags should not quietly become permanent product logic.

Product owns the decision

Product should own the hypothesis, target audience, minimum practical effect, and action after the result. If the test wins, what ships? If it loses, what happens to the idea? If it is inconclusive, will the team run longer, narrow the audience, or move on?

Without that decision ownership, experiments become dashboards with no consequence.

Data owns measurement trust

Data teams or analytically strong product teams should own metric definitions, exposure quality, sample ratio checks, guardrails, and readout interpretation.

This does not mean every experiment needs a custom analysis project. It means the metrics and statistical method should be understood before the result is used to justify a release.

Implementation checklist

Before launch:

Write the hypothesis.
Choose the assignment unit.
Create the flag with owner, description, and cleanup date.
Define control and treatment variations.
Choose one primary metric.
Choose guardrails.
Confirm exposure logging timing.
Test internal targeting.
Test rollback.
Decide the stopping rule.

During the experiment:

Monitor assignment and sample ratio.
Watch guardrails.
Document incidents.
Avoid changing the decision rule.
Keep the rollout within the planned design.

After the result:

Decide ship, stop, iterate, or retest.
Ramp the winning variation deliberately.
Remove losing code paths.
Archive or convert the flag.
Record what the team learned.

What to do next

The cleanest workflow is simple: use feature flags to control exposure, and use A/B testing to measure impact.

Do not make every flag an experiment. Do not treat every rollout as evidence. Use the right tool for the job, then connect them when the team has a measurable hypothesis.

Start with one real product change. Put it behind a flag, assign users cleanly, define the metric before launch, and decide what will happen when the result arrives. That is the loop worth building.

Example H2

See All Articles

Experiments

Type I vs Type II error: key differences with examples

Jun 17, 2026

min read

Experiments

Type I error explained: definition, examples, and how to reduce it

Jun 16, 2026

min read

Experiments

Multivariate testing vs A/B testing: key differences explained

Jun 16, 2026

min read

Ready to ship faster?

No credit card required. Start with feature flags, experimentation, and product analytics—free.

Get Started

Book a Demo

Simplified white illustration of a right angle ruler or carpenter's square tool.

White checkmark symbol with a scattered pixelated effect around its edges on a transparent background.

Feature flags decide who sees a change. A/B tests decide whether the change worked. The best product teams connect the two instead of treating them as separate rituals.

Feature flags and A/B tests are different tools

Feature flags are release infrastructure

A/B testing is measurement infrastructure

A rollout is not automatically an experiment

How feature flags power A/B tests

Assignment happens at the flag

Consistency matters

Exposure logging matters

Rollout after the result is easier

A practical workflow

Step 1: write the hypothesis

Step 2: create the flag

Step 3: define variations

Step 4: define metrics and guardrails

Step 5: launch gradually

Step 6: decide and clean up

When to use a flag without an A/B test

Kill switches

Permissions and entitlements

Internal testing and beta programs

When to turn a flag into an A/B test

Common anti-patterns

Anti-pattern 1: using rollout percentage as evidence

Anti-pattern 2: logging exposure too early

Anti-pattern 3: leaving experiment flags forever

Anti-pattern 4: testing without guardrails

Anti-pattern 5: confusing users with changing assignments

Where GrowthBook fits

Architecture patterns that work

Client-side flags for UI experiments

Server-side flags for product logic

Hybrid flags for full-stack changes

Worked example: rolling out an AI feature

Release-control plan

Experiment plan

Decision and cleanup

Ownership across product, engineering, and data

Engineering owns safe delivery

Product owns the decision

Data owns measurement trust

Implementation checklist

What to do next

Table of Contents

Related Articles

Type I vs Type II error: key differences with examples

Type I error explained: definition, examples, and how to reduce it

Multivariate testing vs A/B testing: key differences explained

Ready to ship faster?