Experiments

Feature flagging for AI models: how to safely roll out changes

A graphic of a bar chart with an arrow pointing upward.

Swapping a model ID or tweaking a system prompt doesn't touch your codebase — which means your CI/CD pipeline has no record of it, no way to track it, and nothing to roll back when it degrades in production.

That's the gap that catches AI teams off guard. Feature flags for AI models aren't just a nice-to-have on top of your existing deployment process. They're the control plane that makes model configs, prompt templates, and hyperparameters as safe to change as any other piece of production code.

This article is for engineers, PMs, and data teams who are shipping AI features and want a practical system for doing it safely — without slowing down. Whether you're swapping foundation models, iterating on prompts, or trying to measure whether a new model actually performs better for your users, the same flag infrastructure handles all of it. Here's what you'll learn:

The article moves in that order — from the case for this infrastructure, to what you put behind flags, to how you roll out and measure, to how you govern and revert. Each section builds on the last, and by the end you'll have a complete picture of how feature flags become the operational backbone of a safe AI release process.

Why AI model rollouts demand a different kind of infrastructure

If your team already runs a mature CI/CD pipeline with staging environments, automated tests, and deployment gates, you might reasonably ask: why isn't that enough for AI? The answer cuts to something fundamental about how AI systems fail — and it's different enough from traditional software that it demands a different class of tooling.

AI failures are probabilistic, not deterministic

When a software bug ships to production, it's reproducible. Given the same input, you get the same broken output. You can replicate it in staging, trace it to a commit, and revert. The feedback loop is tight and the failure mode is legible.

AI failures don't work that way. A model or prompt change might succeed on 999 user inputs and fail catastrophically on the thousandth — not because of a logic error, but because of the way model parameters interact with an edge case that no one anticipated.

As Statsig frames it, AI model parameters are "often interdependent, and those relationships only become apparent when real users encounter edge cases." One tweak can shift many behaviors in surprising ways. That's not a bug in the traditional sense. It's a probabilistic failure that your test suite has no reliable mechanism to catch.

This distinction matters enormously for how you think about release infrastructure. Traditional QA gates — unit tests, integration tests, staging environments — are necessary but not sufficient. They test known inputs. Real users generate unknown ones.

Production is the only real test environment

The uncomfortable reality for any team shipping AI features is that offline evaluation and staging environments are approximations. They test the inputs you thought to include. Production tests everything else.

The team at Helix, an image recognition platform, discovered this directly. When adding new image classifications to their model, their core question was: "How do we know if they will work in the real world with real users and return the correct classification?" The answer wasn't a better staging environment — it was controlled exposure to a specific subset of real users before any broader rollout.

Production exposure wasn't optional; it was the only meaningful test. The only question was whether that exposure would be controlled or uncontrolled.

This is the defining characteristic of AI risk that makes it categorically different from shipping a new UI component or a refactored API endpoint. You cannot fully validate an AI system before real users touch it. You can only control how many real users touch it, and which ones, and what you're measuring while they do.

Why CI/CD alone can't solve this

Here's the specific gap that catches teams off guard: prompt changes, model version updates, and hyperparameter adjustments often don't involve any code change at all. You might update a system prompt in a config file, swap a model ID in an environment variable, or adjust temperature through an API parameter — none of which produces a git commit that your deployment pipeline can track or revert.

When that change degrades in production, there is no deployment to roll back. The rollback mechanism has to exist at the configuration layer, not the deployment layer. Feature flags provide exactly that: runtime control over application behavior without touching the deployment pipeline.

LaunchDarkly's framing captures the principle cleanly — flags make individual features, not deployments, the unit of control. GrowthBook's documentation puts the gap even more bluntly: "Don't Let Vibe Coding Become Vibe Shipping." The tools that help you build AI features don't automatically give you the infrastructure to ship them safely.

The business risk of skipping this infrastructure

The stakes of an uncontrolled AI rollout aren't just technical. A hallucinating documentation bot, a content moderation model that misclassifies at scale, or a recommendation system that degrades user trust — these carry reputational and sometimes ethical consequences that a slow API response or a broken form field typically don't.

At the same time, the velocity pressure is real. New foundation models ship constantly, and teams that can't safely test them in production are forced into a bad choice: move slowly and fall behind, or move fast and accept uncontrolled exposure.

Feature flags resolve that tradeoff. They're the infrastructure layer that makes it possible to move quickly and maintain a kill switch — not as competing priorities, but as the same mechanism. That's why treating them as optional for AI deployments is the wrong frame. For any team running AI features in production, they're the floor, not the ceiling.

Models, prompts, and hyperparameters all belong behind a flag

Once you accept that runtime configuration control is the gap CI/CD can't close, the next question is concrete: what exactly belongs behind a flag? When engineers first start thinking about feature flags for AI, they usually think about the model — which version of GPT-4 or Claude is active.

That's the right instinct, but it captures maybe a third of what actually needs to be controlled. Every configurable dimension of an LLM integration is a potential source of behavioral change in production, and every one of them belongs behind a flag. The practical question is: what exactly does that mean, and what does it look like in code?

Model identity and provider configuration

The modelId field is the most fundamental thing to put behind a flag. It determines which model handles a request, and swapping it — from Claude 3 Haiku to Claude 3 Sonnet, or from GPT-4 Turbo to GPT-4o — is a meaningful behavioral change that should be controlled, not deployed.

LaunchDarkly's AI model flag templates, released in GA in June 2024, are built around exactly this pattern: a JSON flag carries the modelId alongside inference parameters, and the application reads it at request time.

The DevCycle/Helix case study makes the operational benefit concrete. Helix built model selection directly into their image recognition microservice using a feature flag, and new model candidates were added as flag variations by the product manager — no engineering deployment required. That detail matters: when introducing a new model is a flag configuration change rather than a code change, the iteration cycle compresses dramatically.

Prompt templates as flaggable configuration

Prompt templates are not static strings. They are configuration that directly determines model behavior, and they should be versioned and controlled with the same rigor as model IDs.

LaunchDarkly treats prompts as a distinct flaggable category — separate from the model flag — which signals that the industry has recognized prompts as a first-class configuration concern rather than an implementation detail.

A prompt flag can carry the full system prompt, the user prompt structure, and any few-shot examples as a string or structured JSON field. This matters operationally because prompt changes are among the most frequent modifications AI teams make in production, and without a flag, every iteration requires a deployment cycle. It also matters for governance, which is addressed later in this article — but the short version is that a prompt change is effectively a code change, and it should be treated as one.

LLM hyperparameters — temperature, max_tokens, top_p, top_k

These are the inference-time knobs that control output behavior: temperature governs creativity versus determinism, max_tokens caps response length, top_p controls nucleus sampling, and top_k limits the candidate token pool. LaunchDarkly explicitly names all four as parameters that belong in model flags, with the note that parameter naming conventions vary by provider — Anthropic and OpenAI use slightly different field names for equivalent concepts.

Changing any of these without a flag creates an uncontrolled variable in production. Adjusting temperature in isolation, for example, can interact unexpectedly with a top_p value that was calibrated for a different temperature range — a combination that may have worked well in offline testing but behaves differently when real users generate the full distribution of inputs.

Bundling everything into a structured JSON payload

That interdependency is the strongest argument for encoding the entire LLM configuration as a single multivariate JSON flag rather than managing model ID, prompt, and hyperparameters as separate flags. When configuration is split across multiple flags, interaction effects become difficult to reason about — you can't be certain which combination of values is actually active for a given user at a given moment.

The industry pattern, confirmed across LaunchDarkly, ConfigCat, and GrowthBook's documentation, is a single JSON flag payload that bundles the full configuration. A minimal example looks something like this:

{
  "modelId": "claude-3-sonnet-20240229",
  "provider": "anthropic",
  "temperature": 0.7,
  "maxTokens": 1024,
  "topP": 0.9,
  "systemPrompt": "You are a helpful assistant specialized in..."
}

GrowthBook's JSON flag type is explicitly designed for this kind of complex, multi-value configuration. GrowthBook's JSON flag type supports JSON Schema validation on flag values, which means teams can enforce that every LLM config payload contains required fields — modelId is always present, temperature is always a float within a valid range — before the configuration reaches production. That's a meaningful safety layer for teams with compliance requirements or customer-facing AI features.

The flag value is evaluated at request time, so a prompt change, model swap, or hyperparameter adjustment takes effect immediately across all running instances. No deployment, no restart, no coordination across engineering teams. The configuration becomes a runtime variable, which is exactly the kind of control that AI deployments require.

Canary releases and targeted rollouts: controlling blast radius at every stage

Deploying a new AI model to production is not a binary event. The question isn't whether to flip a switch — it's how to design a progression that limits blast radius at every step while preserving the ability to reverse course instantly. Percentage-based traffic splitting and user attribute targeting are the two mechanisms that make this possible, and feature flags are the infrastructure layer that ties them together.

Why feature flag rollouts differ from infrastructure-level canaries

The term "canary release" gets applied to two fundamentally different patterns, and the distinction matters for AI. A traditional canary deployment routes traffic at the infrastructure layer — a load balancer sends a fraction of requests to machines running the new version. That works when the change is a new binary.

But when the change is a model ID, a prompt template, or an endpoint configuration, there's no new binary to deploy. The change is a config value, and the correct control mechanism is an application-layer feature flag, not a load balancer rule. Feature flags make individual configurations — not deployments — the unit of control, which means rollback is a flag toggle rather than a container redeployment.

Percentage-based canary mechanics

The mechanics of a percentage-based rollout are straightforward: a flag routes X% of traffic to the new model configuration and the remainder to the current one. What makes this safe is deterministic user assignment. GrowthBook uses MurmurHash3 to hash user identifiers into consistent bucket assignments — the same user always resolves to the same model variant, preventing the kind of mid-session inconsistency that would corrupt evaluation data and confuse users.

A staged ramp schedule — 10%, then 25%, 50%, 75%, and finally 100% — gives teams defined checkpoints to assess metrics before advancing. Adjusting the overall exposure percentage mid-rollout is safe; reshuffling split ratios between variants is not, because it risks users switching model assignments.

User attribute targeting strategies

Percentage-based rollouts control how many users see the new model. Attribute targeting controls which users see it first, and that distinction is what makes a canary genuinely risk-stratified rather than just small.

The logical progression sequences exposure by risk profile: internal employees carry zero customer risk and provide a first functional validation pass; opted-in beta users are tolerant of rough edges and generate useful early signal; premium or paid-tier users offer higher-value feedback before broader exposure; geographic targeting limits regional blast radius before a global rollout.

In B2B contexts, organization-level targeting ensures all users within a tenant get the same model — critical for consistency in enterprise accounts. GrowthBook supports AND/OR targeting logic across attributes like subscription tier, geography, company ID, and custom properties, with reusable Saved Groups for audience segments that appear across multiple flags. This lets teams define a "beta cohort" once and reference it across every model rollout without rebuilding the targeting rule each time.

The kill switch as a required component

At every stage of the ramp-up sequence, the flag must be instantly reversible. This is not an optional feature — it is the architectural requirement that justifies the entire staged approach. If a model starts producing degraded outputs at 25% exposure, the correct response is to turn the flag off, not to file a deployment ticket.

GrowthBook's Safe Rollouts capability takes this further by combining the ramp schedule with automatic guardrail monitoring: if a key metric regresses beyond a defined threshold, the system can trigger rollback without waiting for manual intervention. That matters specifically for AI, where a quality regression might surface gradually across thousands of requests before it's visible in aggregate dashboards.

What a responsible rollout sequence actually looks like

A concrete progression, grounded in the mechanics above, looks like this: start at 0% with the flag forced on only for internal employees via attribute targeting — validate that the model responds correctly and latency is acceptable. Expand to 5–10% targeting the beta cohort, monitoring error rates and response quality. At 25%, broaden to a specific subscription tier or geography and begin watching user-facing outcome metrics.

From there, advance through 50%, 75%, and 100% using a scheduled ramp with guardrail thresholds defined before each step. The critical discipline is specifying, before each advance, what metric threshold would trigger a hold or rollback — not after a problem surfaces, but as a precondition for moving forward. That pre-commitment is what separates a controlled rollout from an optimistic deployment.

From canary to controlled experiment: measuring which AI model actually performs better

A controlled experiment and a canary release are not the same thing, and conflating them is one of the more expensive mistakes a team can make when rolling out AI model changes. A canary answers one question: does this break anything? A controlled experiment answers a different question entirely: does this actually improve outcomes that matter?

For AI model selection, the distinction is critical. A new model can pass a canary without throwing errors, without degrading latency, without producing a single visible hallucination in the sample traffic — and still be the wrong model for your users. Safety is necessary, but it is not sufficient.

The conceptual pivot: from safety check to quality check

The DevCycle/Helix image recognition rollout described earlier in this article is a clean example of the canary pattern. Exposing a new model to internal users and a beta cohort via email-based targeting confirms the model works under real conditions. It does not tell you whether the model is statistically better than the incumbent at the outcomes your product depends on.

Once you have confirmed a model does not break things, you still need evidence it is better before promoting it to 100% of traffic. That evidence has to come from production, not from internal benchmarks. As Statsig has observed, real-time evaluation in production consistently reveals gaps that offline checks miss — which is exactly why the experiment has to happen in the wild, with real users, under real conditions.

Landon Smith, Head of Post-Training at Character.AI, describes this directly: "GrowthBook has been an invaluable tool for Character.AI, helping us develop our models into a great consumer experience. We can compare different modeling techniques from the perspective of our users — guiding our research in the direction that best serves our product." That framing — from the perspective of our users — is the defining characteristic of experiment-driven model selection. Internal evals measure what the model does. Experiments measure what the model does to users.

Instrumenting metrics at two layers

Running a rigorous model comparison experiment requires instrumentation at two distinct layers, and both matter.

The first layer is user-facing outcome metrics: engagement, task completion rate, session length, satisfaction signals, conversion. These are the metrics that justify model decisions to product leadership and business stakeholders. "Our internal eval preferred Model B" is not the same argument as "Model B produced a statistically significant improvement in task completion rate across a large, representative user sample." The experiment produces the latter.

The second layer is LLM-level operational metrics: latency per request, cost per inference, refusal rate, error rate. A model that improves engagement while tripling inference cost may not be the right choice at your traffic volume. These operational metrics function as guardrails — they prevent a model that wins on quality from being promoted despite being economically or operationally unsustainable. Treating them as guardrail metrics in your experiment framework means a degradation surfaces as a warning before it becomes a production incident.

Warehouse-native analysis for the cost-quality tradeoff

The specific challenge with AI model experiments is that the data you need lives in two places: inference cost data comes from your LLM provider, and user outcome data comes from your product. Joining and analyzing those together is a SQL problem, not a dashboard problem.

GrowthBook's warehouse-native experimentation connects directly to Snowflake, BigQuery, Redshift, ClickHouse, Databricks, and similar environments — the analysis runs as SQL queries against data where it already lives, without requiring a separate data pipeline or moving raw user-level data out of your environment. This matters for teams with data residency requirements and for teams that simply want their experiment results to reflect the same source of truth as their product analytics.

This architecture also matters for experiment runtime. Variance reduction techniques like CUPED can meaningfully shorten the time required to reach statistical significance, which matters when inference costs are accumulating throughout the experiment window. Shorter experiments mean lower cost exposure and faster decisions — both of which are meaningful when you are running model comparisons continuously as new model versions become available.

Connecting results to the promotion decision

A defensible model promotion decision has a specific shape: statistically significant improvement on at least one user-facing outcome metric, no degradation on guardrail metrics, and a documented record of the experiment that produced the decision. That record is what makes model selection auditable and repeatable over time — not just for the current decision, but for the institutional knowledge that accumulates across many model iterations.

The mechanical bridge back to flag infrastructure is direct. The same feature flag used to run the canary becomes the experiment assignment mechanism. When the experiment concludes in favor of the new model, the flag promotion to 100% traffic is the final step in a process that started with a targeted rollout and ended with statistical evidence. The flag is not just a deployment tool — it is the control plane for the entire model selection lifecycle.

Instant rollback, kill switches, and governance for AI changes in production

When a bad model config reaches production, the clock starts immediately. Every request served by a degraded prompt or misconfigured temperature setting is a user interaction that may produce a harmful, embarrassing, or simply wrong response.

The problem with deploy-based rollback in this scenario isn't just that it's slow — it's that there's nothing to redeploy. The code hasn't changed. The model config lives outside your codebase, and if it wasn't stored behind a feature flag, your only options are to push a hotfix, manually revert a config file, or call someone at 3am to log into a dashboard and change a value by hand. None of these are acceptable for customer-facing AI features.

This section covers two distinct problems that flag-based infrastructure solves: the technical problem of reverting a bad AI config instantly, and the organizational problem of ensuring that model and prompt changes go through appropriate review before they ever reach production.

Kill switch mechanics and revert speed

The technical case for flag-based rollback is straightforward once you understand what happens when a flag is disabled. With SDK-based flag evaluation, when your application calls something like growthbook.getFeatureValue('llm-config', defaultConfig), disabling the flag causes the platform to stop serving that flag's rules and return nothing — so the SDK falls back to the defaultConfig value you provided. All targeting rules are ignored. No redeploy. No pipeline. No waiting.

For an AI system where the model ID, prompt template, and temperature are all stored as a JSON payload in a feature flag, this means a single toggle in a dashboard immediately routes every subsequent request back to the previous known-good configuration. The contrast with deploy-based rollback is stark: a traditional rollback requires a code push, CI/CD pipeline execution, and deployment time — all while the degraded config continues serving users. For AI configurations, which live in the flag rather than in code, the flag is the only rollback mechanism that operates at the speed the problem demands.

Automated rollback with guardrail metrics

Manual kill switches require someone to notice the problem first. Automated rollback closes that gap by monitoring guardrail metrics during a staged rollout and triggering a revert when a regression is detected — without human intervention.

GrowthBook's Safe Rollouts include an Auto Rollback toggle that automatically disables the rollout rule if any guardrail metric fails significantly, using one-sided sequential testing to detect regressions while minimizing false positives. The ramp-up schedule — 10%, 25%, 50%, 75%, then 100% — is designed so that a regression surfaces early, when only a fraction of users are exposed. If guardrails fail at 10%, the rollout stops and reverts before the majority of traffic is ever affected.

The metric selection question matters especially for AI. A degraded model response may not spike your error rate at all — it might show up in engagement drop-off, task completion rate, or user satisfaction scores. That means your guardrail metrics need to include user-facing quality signals, not just infrastructure health indicators.

LaunchDarkly's Guarded Releases, which reached general availability in 2025, offers a comparable automated remediation capability with application performance thresholds and telemetry integrations. Notably, not every flag platform provides this natively — Flagsmith, for instance, requires an external Datadog Workflows integration to approximate automated rollback, which introduces meaningful operational complexity.

Approval workflows for model and prompt changes

A prompt change that alters how an AI responds to users is functionally a code change. It modifies production behavior, it can introduce regressions, and it can cause harm if it reaches users without review. Treating it as a casual config edit — something anyone on the team can push directly to production — is a governance gap that will eventually cause an incident.

The correct pattern is to route all model config changes through the same approval workflow that governs code changes. GrowthBook supports drafts, revisions, merge conflict handling, and requiring approvals as part of its Publishing and Approval Flows. Unleash offers change requests with four-eyes approvals at the enterprise tier. LaunchDarkly provides approval workflows with custom roles. The specific platform matters less than the principle: no prompt template, model ID swap, or hyperparameter change should reach production without a documented review step.

For teams with compliance requirements, this isn't optional. An audit trail that shows who approved a prompt change, when, and what the previous value was is the same artifact your security and compliance teams expect for code deployments. If your AI configuration lives in feature flags, it inherits that audit trail automatically — the same infrastructure that governs your code changes now governs your model configs. That's the architectural argument for why flags are the right home for AI configuration, not ad-hoc config files or direct API calls that leave no paper trail and offer no rollback path.

Implementing feature flags for AI models: a practical starting point

The through-line of this article is simple: AI configuration changes — model IDs, prompt templates, hyperparameters — are production changes, and they deserve the same controls as code. The gap isn't a tooling gap so much as a framing gap.

Once you treat your LLM config as a runtime variable rather than a deployment artifact, the right infrastructure becomes obvious. Feature flags aren't bolted on top of your AI release process. They are the release process.

JSON payloads vs. simple toggles: why the architecture decision matters first

Start with a single multivariate JSON flag that bundles your full LLM configuration — model ID, prompt template, and inference parameters together. Splitting these across separate flags creates interaction effects you can't reason about cleanly. If you're evaluating platforms, GrowthBook's JSON flag type with schema validation is worth examining specifically for this use case, since it lets you enforce required fields before a config reaches production.

The rollout sequence is only as safe as the discipline around it

The ramp sequence matters less than the discipline around it. What separates a controlled rollout from an optimistic deployment is the habit of defining your rollback threshold before you advance — not after a metric starts moving. Internal employees first, then a beta cohort, then a broader tier. The flag is your kill switch at every stage.

Governance feels like overhead until the first incident

Governance feels like overhead until the first incident. A prompt change that skips review is a code change that skipped review — same risk, same consequences. Build the approval workflow before your team is large enough to make skipping it tempting. The audit trail you get for free when config lives in flags is the same artifact your security team will ask for anyway.

There's a real tension worth naming: the faster your team iterates on models and prompts, the more you need this infrastructure — but building it takes time that feels like it's slowing you down. The honest answer is that the investment front-loads the friction. Teams that skip it don't move faster; they just absorb the cost later, usually during an incident.

This article was written to give you a complete picture of how this infrastructure fits together, not just the theory. If it helps you avoid one uncontrolled rollout or make one model promotion decision with actual evidence behind it, it's done its job.

What to do next: where you start depends on where you are

The right first move depends on your current state. No flag infrastructure at all means the highest-leverage action is putting your model ID and system prompt behind a single JSON feature flag — even before you build any rollout logic. That one change gives you runtime control and a rollback path, which is the most valuable property to have first.

Already have flags but they're simple booleans? Audit which AI configs are still living in environment variables or hardcoded strings and migrate the highest-risk ones to JSON flags. JSON flags already in place means the next investment is defining guardrail metrics before your next rollout — latency, error rate, and at least one user-facing quality signal — so that your next ramp has a defined stop condition rather than a hopeful one.

Related insights

Sign up for free

Take Growthbook for a spin, no credit card required.

Create my account

Table of Contents

Related Articles

See All Articles
Experiments

How much traffic do you need to test AI features reliably?

Jun 8, 2026
x
min read
Experiments

Why traditional A/B testing breaks down for AI products

Jun 8, 2026
x
min read
Experiments

How to measure "quality" in AI outputs (beyond accuracy)

Jun 7, 2026
x
min read

Ready to ship faster?

No credit card required. Start with feature flags, experimentation, and product analytics—free.

Simplified white illustration of a right angle ruler or carpenter's square tool.White checkmark symbol with a scattered pixelated effect around its edges on a transparent background.