The Uplift Blog

Subscribe
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
The Benchmarks Are Lying to You: Why You Should A/B Test Your AI
Experiments
AI

The Benchmarks Are Lying to You: Why You Should A/B Test Your AI

Sep 30, 2025
x
min read

Quick Takeaways

  • Performance varies by domain: Models that ace benchmarks often fail on your specific use case
  • The Trade-offs might not be real: Faster, cheaper models might outperform expensive ones for your needs
  • The best solution is rarely one model: Most successful deployments use model portfolios
  • A/B testing quantifies what matters: User completion rates, costs, and latency—not abstract scores

Introduction

OpenAI's GPT-5 (high) model scores 25% on the Frontier Math benchmark for expert-level mathematics. Claude Opus 4.1 only scores 7%. Based on these numbers alone, you might assume GPT-5 is clearly the superior choice for any application requiring mathematical reasoning.

FrontierMath Accuracy Bnechmark
Confidence intervals across multiple A/B test variations showing variance in model performance estimates

But this assumption illustrates a fundamental problem in AI evaluation, one that we in the experimentation space know quite well as Goodhart's Law: "When a measure becomes a target, it ceases to be a good measure." The AI industry has turned benchmarks into targets, and now those benchmarks are failing us.

When GPT-4 launched, it dominated every benchmark. Yet within weeks, engineering teams discovered that smaller, "inferior" models often outperformed it on specific production tasks—at a fraction of the cost.

With all the fanfare of the GPT-5 launch and outperforming all other models on coding benchmarks, developers continued to prefer Anthropic's models and tooling for real-world usage. This disconnect between benchmark performance and production reality isn't an edge case. It's the norm.

The market for LLMs is expanding rapidly—OpenAI, Anthropic, Google, Mistral, Meta, xAI, and dozens of open-source options all compete for your attention. But the question isn't which model scores highest on benchmarks. It's which model actually works in your production environment, with your users, under your constraints.

Why Traditional Benchmarks Fail in Production

AI benchmarks are standardized tests designed to measure model performance—MMLU tests general knowledge, HumanEval measures coding ability, and FrontierMath evaluates mathematical reasoning. Every major model release leads with these scores.

But these benchmarks fail in three critical ways that make them unreliable for production decisions:

1. They Don't Measure What Actually Matters Benchmarks test surrogate tasks—simplified proxies that are easier to measure than actual performance. A model might excel at multiple-choice medical questions while failing to parse your actual clinical notes. It might ace standardized coding challenges while struggling with your company's specific codebase patterns. The benchmarks measure something, just not real-world problem-solving ability.

2. They're Systematically Gamed Data contamination lets models memorize benchmark datasets during training, achieving perfect scores on familiar questions while failing on slight variations. Worse, models are specifically optimized to excel at benchmark tasks—essentially teaching to the test. When your model has seen the answers beforehand, the test becomes meaningless.

3. They Ignore Production Reality Benchmarks operate in a fantasy world without your constraints. Latency doesn't exist in benchmarks, but your multi-model chain takes 15+ seconds. Cost doesn't matter in benchmarks, but 10x price differences destroy unit economics. Your infrastructure has real memory limits. Your healthcare app can't hallucinate drug dosages.

Consider this sobering statistic: 79% of ML papers claiming breakthrough performance used weak baselines to make their results look better. When researchers reran these comparisons fairly, the advantages often disappeared.

The A/B Testing Advantage: Finding What Actually Works

So if benchmarks fail us, how do we actually select and optimize LLMs? Through the same methodology that transformed digital products: rigorous A/B testing with real users and real workloads.

The Portfolio Approach

The first insight from production A/B testing contradicts everything vendors tell you: the optimal solution is rarely a single model.

Successful deployments use a portfolio approach. Through testing, teams discover patterns like:

  • Simple queries handled by models that are fast, cheap, and good enough
  • Complex reasoning routed to thinking models
  • Domain-specific tasks sent to fine-tuned specialist models

Take v0, Vercel's AI app builder. It uses a composite model architecture: a state-of-the-art model for new generations, a Quick Edit model for small changes, and an AutoFix model that checks outputs for errors.

This dynamic selection approach can slash costs by 80% while maintaining or improving quality. But you'll only discover your optimal routing strategy through systematic testing.

Metrics That Actually Drive Business Value

Production A/B testing reveals the metrics that benchmarks completely miss:

Performance Metrics That Matter:

  • Task completion rate: Do users actually accomplish their goals?
  • Problem resolution rate: Are issues solved, or do users return?
  • Regeneration requests: How often is the first answer insufficient?
  • Session depth: Are simple tasks requiring multiple interactions?

Cost and Efficiency Reality:

  • Tokens per request: Your actual API costs, not theoretical pricing
  • P95 latency: How long your slowest users wait (the ones most likely to churn)
  • Throughput limits: Can you handle Black Friday or just Tuesday afternoon?

Counterintuitive insight: If an LLM solves a user's question on the first try, you may see fewer follow-up prompts. That drop in "requests per session" is actually positive—your model is more effective, not less engaging.

Making A/B Testing Work for LLMs

Testing LLMs requires adapting traditional experimental methods to handle their unique characteristics:

Handle the Randomness: Unlike deterministic code, LLMs produce different outputs for the same prompt. This variance means:

  • Run tests longer than typical UI experiments
  • Use larger sample sizes to achieve statistical significance
  • Consider lowering temperature settings if consistency matters more than creativity

Isolate Your Variables: Test one change at a time:

  • Model swap (GPT-5 → Claude Opus)
  • Prompt refinement (shorter, more specific instructions)
  • Parameter tuning (temperature, max tokens)
  • Routing logic (which queries go to which model)

Without this discipline, you can't attribute improvements to specific changes.

Set Smart Guardrails: Layer guardrail metrics alongside your primary success metrics. An improvement in task completion that doubles costs might not be worth deploying. Track:

  • Cost per successful interaction (not just cost per request)
  • Safety violations that could trigger PR nightmares
  • Latency thresholds that cause user abandonment

Build Once, Test Forever: Invest in infrastructure that makes testing sustainable:

  • Centralized proxy service for LLM communications
  • Automatic metric collection and monitoring
  • Prompt versioning and management
  • Response validation and safety checking

This investment pays off immediately—making tests easier to run and results more trustworthy.

Embrace Empiricism

Benchmarks aren't entirely useless—use them for initial screening, understanding capability boundaries, and meeting regulatory minimums. But they should never be your final decision criterion.

The AI industry's obsession with benchmarks has created a dangerous illusion. Models that dominate standardized tests struggle with real tasks. The metrics we celebrate have divorced from the outcomes we need.

For teams building with LLMs, the path is clear:

  1. Start with hypotheses, not benchmarks: "We believe Model X will improve task completion," not "Model X scores higher"
  2. Test with real users and real data: Your production environment is the only benchmark that matters
  3. Measure what moves your business: User satisfaction, cost per outcome, and regulatory compliance
  4. Iterate based on evidence: Let data, not vendor claims, drive your model selection

Despite the fanfare surrounding the GPT-5 launch and its outperformance on coding benchmarks, developers continued to prefer Anthropic's models and tooling for real-world use. The benchmarks aren't exactly lying—they're just answering the wrong questions. A/B testing asks the right ones: Will this solve my users' problems? Can we afford it at scale? Does it meet our requirements?

In the end, the best benchmark for your AI isn't a standardized test. It's users voting with their actions, costs staying within budget, and your application delivering real value.

Everything else is just numbers on a leaderboard.

Further Reading

GrowthBook Version 4.1
Releases
4.1
Product Updates

GrowthBook Version 4.1

Sep 9, 2025
x
min read

This release continues the momentum of GrowthBook 4.0 by adding two of our most requested Enterprise features - Holdouts and Experiment Dashboards. Plus, we’ve made several significant enhancements to our integrations with AI coding tools and our MCP server capabilities. If you’re not yet using GrowthBook with your AI coding tools, we highly recommend it!

Read on to learn more about these features and everything else we’ve been working on these past 2 months.

Holdouts

GrowthBook Holdouts UI showing a control group maintained separately from users receiving new features

MCP Server

Holdout experiments measure the long-term impact of features by maintaining a control group that doesn't receive new functionality. While most users experience your latest features and improvements, a small percentage remain on the original version, providing a baseline for measuring cumulative effects over time. Read more about holdouts

Experiment Dashboards

GrowthBook Experiment Dashboards showing key metrics, dimension breakdowns, and context in a single shareable view

Experiment Dashboards let you create tailored views of an experiment. Highlight key insights, add context, and share a clear story with your team. For example, highlight the key goal metric results, show an interesting breakdown by dimension, and link to supporting external documents, all in a single view. Dashboards are available for all Enterprise customers. We have a lot planned for this, so stay tuned!

AI Features

This release integrates AI to accelerate your workflows in GrowthBook. Auto-summarize experiment results, get help writing SQL, improve hypotheses, detect similar past experiments, and more. These features are available even if you’re self-hosted, just supply an OpenAI API key. See the AI features in action or read detailed information on how these features work.

MCP Updates

We've updated our MCP Server to allow you to create experiments directly from your AI coding tool of choice, without needing to context switch to GrowthBook. This change unlocks a bunch of new, exciting workflows, and we can't wait to see how you use it!

Vercel Native Integration

We're excited to announce that GrowthBook is now available as a native integration in the Experimentation category on the Vercel Marketplace! This integration makes it easier than ever to add feature flagging and A/B testing to your Vercel projects, with streamlined setup, unified billing, and ultra-low latency performance. Read more on our announcement post.

Pre-computed Dimensions

You can now pick a set of key experiment dimensions and pre-compute them along with the main experiment results. This allows for more efficient database queries and instant dimension breakdowns in the UI. Read more in our docs.

FerretDB Support

GrowthBook now supports FerretDB as a MongoDB-compatible open-source database backend

FerretDB is a MongoDB-compatible, open-source database that is free to use. It serves as a drop-in replacement for MongoDB, converting MongoDB wire protocol queries to SQL and using PostgreSQL as its backend storage engine. We're pleased to support FerretDB officially!

Sanity CMS Integration

GrowthBook feature flags integrated with Sanity CMS for testing content variations

Sanity is a real-time content backend for all your text and assets. You can now use GrowthBook feature flags to seamlessly test different content variations within Sanity. Check out our announcement video and tutorial or our docs

The 4.1 release includes over 150 commits, way more than we can quickly summarize here. View the release details on GitHub for a more comprehensive list.  As always, we love feedback - good and bad. Let us know what you think of the new features and what you want to see as part of 4.2!

Feedback Loops Are the Next Breakthrough in Agentic Coding
Experiments
AI

Feedback Loops Are the Next Breakthrough in Agentic Coding

Sep 8, 2025
x
min read

At first glance, feature flag and experimentation platforms don’t seem closely tied to AI. But at GrowthBook, we see it differently. These platforms don’t just test whether a feature works technically—they test whether it delivers the business outcomes developers intended. That distinction is critical, and it’s exactly the kind of feedback loop AI coding platforms need to evolve.

Research shows that only about one-third of software features actually deliver the expected results. Another third make little difference. And the final third actively harm key metrics like conversion or engagement. Without structured feedback, teams repeat the same costly mistakes.

Now imagine an AI that could warn you before you invested weeks of engineering effort: “This feature is unlikely to move the needle.”  That’s the future we believe is coming.

The Next Frontier for LLMs

Most AI coding tools today help developers build features exactly as they always have. Which means they’re just as likely to produce underperforming features. The next breakthrough will be AI systems that understand what to build and how to build it—drawing on millions of past experiments.

OpenAI has already hinted at this direction. In its GPT-5 Prompting Cookbook, it recommends creating a rubric to evaluate a development plan, then iterating until the plan earns top marks. Now imagine if that rubric weren’t handcrafted, but instead learned automatically from thousands of feature tests. AI wouldn’t just critique plans. It would know what success looks like—and guide you there directly.

That’s a leap toward more intelligent, agentic AI—not only in coding, but also in fields like finance and healthcare, where feedback loops are abundant.

Bringing Agentic Coding Into Your Workflow Today

The good news: you don’t need to wait for the future. With GrowthBook’s MCP server, AI coding tools can already tap into your past experiments to build intelligent rubrics. They can:

  • Design and deploy experiments for the features they create
  • Measure results in real time against your KPIs
  • Iterate continuously until outcomes align with business goals

The scale of experimentation today is staggering. GrowthBook customers collectively run hundreds of thousands of experiments each month—and that number is growing. AI can now unlock insights from this volume of data in ways that were never possible before.

The Bigger Impact

Building a culture of experimentation does more than improve feature delivery. It accelerates innovation, drives better customer experiences, and creates measurable gains in usage, retention, and sales.

Feedback loops will make agentic AI smarter, faster, and more valuable to every software team. The future of coding isn’t just about writing code—it’s about learning from every outcome. And with the right experimentation infrastructure, that future is already here.

How GrowthBook Holdouts Work Under the Hood
Experiments
4.1
Analytics

How GrowthBook Holdouts Work Under the Hood

Sep 3, 2025
x
min read

Holdouts answer a deceptively simple question: “What did all of this shipping actually do?” In GrowthBook, a holdout keeps a small, durable control group away from new features, experiments, and bandits, then compares them to everyone else over time. That comparison is your long-run, cumulative impact—no guess work, no complicated de-biasing algorithms.

You can read more about holdouts in this blog post, Holdouts in GrowthBook: The Gold Standard for Measuring Cumulative Impact and in our documentation. But in this post, I’m going to talk about some of the nitty-gritty choices we made and why we made them.

We measure everything that happened, not just shipped winners

There are two different approaches out there to measuring impact with holdouts:

  • “Measure everything” approach (used in GrowthBook). The holdout group stays off all new functionality; everyone else proceeds as normal—experimenting, shipping, backtracking, and iterating. We then compare a small, like-for-like measurement subset of the general population to the holdout. That design deliberately measures the full experience of what happened over the quarter, not just the curated list of winners. It’s a more faithful assessment of the world your users actually saw.
  • “Clean-room” approach. The holdout group still stays off all new functionality. However, you also withhold a holdout test group that only sees shipped features. This slice is used to compare against your holdout; meanwhile, the remaining traffic is where day-to-day experiments run.

Here’s another way to think about it. Imagine your traffic is split into 3 groups with a 5% holdout:

  • Holdout (5%): The same across both groups. Never sees any new feature
  • Measurement (5%): The key difference is here. In the “clean room” approach, they are held out until a feature is shipped, and then get the winning variation. In the “measure everything” approach, they are identical to the General group, and are used to experiment and ship
  • General (90%): The same across both groups. Used to experiment and ship

How do they compare?

The “clean-room” approach provides you with the most accurate assessment of what you shipped. You get a sample that only sees the shipped features and does not have a history of seeing features you decided not to ship. This can really help you know if “what you shipped worked.”

However, it has 3 major downsides:

  1. It leaves you blind to what actually happened to the vast majority of users along the way (failed experiments, feature false starts, etc.). If you want to know if your overall program is headed in the right direction, you have to include the costs of running experiments, exposing users to losing variations, and more. While “measure everything” may be a worse estimate of simply the cumulative impact of winners, it more accurately represents the impact your team had. What’s more, not knowing what is going on with 90+% of your entire user base is quite a cost to pay.
  2. Furthermore, it may actually be a worse estimate of the impact going forward. If seeing past failed experiments better represents how future failed experiments may interact with your shipped features, then you actually want your holdout estimate to include these past failed experiments.
  3. You end up with lower power for your regular tests. By splitting another 5% off of the general population, all of your regular tests will have 5% less traffic to ship. This could slow down your overall experimentation program and lead to worse decisions.

For these reasons, at GrowthBook, we opted for the approach where you “measure everything.”

How feature evaluation works: prerequisites

Under the hood, Holdouts rely on prerequisites. Before any feature rule or experiment is evaluated, GrowthBook checks the holdout prerequisite and diverts holdout users to default values. This works just like a regular rule in your Feature evaluation flow, making it easy to understand what's happening

Highlighting holdout on the experiment creation modal
GrowthBook experiment creation modal showing holdout prerequisite field selected by default

Everyone else flows through your rules as usual. Because that evaluation triggers on every included feature or experiment, holdout exposure can occur at different moments in a user’s journey.

That has two important implications for analysis:

  • Prefer metrics with lookback windows. Since users can encounter the holdout at varying times, fixed conversion windows anchored to a single “first exposure” are often ill-posed for long-running, multi-feature measurement. GrowthBook enforces this: you can’t add conversion-window metrics to a holdout; instead, use long-range metrics without windows or with lookback windows.
  • Use the built-in Analysis Period when you’re ready to read the holdout: freeze new additions, keep splitting traffic, and let GrowthBook apply dynamic lookback windows per experiment/metric so you measure exactly the period you care about.

Compliance by default: project-level enforcement

Holdouts are scoped to Projects—a core GrowthBook organizing unit for features, metrics, experiments, SDKs, and permissions. Assign a holdout to a project and, from that point on, new features, experiments, and bandits created in that project inherit the holdout by default (there’s an escape hatch if you truly need it). This keeps your baseline clean without relying on every engineer, product manager, or data analyst remembering to use the holdout.

Under the hood, each time your team creates an experiment or a feature in a Project, we check if that Project has any associated holdouts. If there is one, we pre-select it, and allow you to opt out with a warning. If there is more than one holdout, we select the first one by default, but experimenters can switch their selected holdout. We recommend you avoid this situation. If there are any holdouts without project scoping, they are available to all projects, and we recommend avoiding this unless you are running a global holdout.

This adds one more reason to use Projects:

TL;DR

Get started by reading our docs or by signing up.

Holdouts in GrowthBook: The Gold Standard for Measuring Cumulative Impact
Experiments
4.1
Analytics
Product Updates

Holdouts in GrowthBook: The Gold Standard for Measuring Cumulative Impact

Sep 3, 2025
x
min read

Many successful product teams iterate quickly, running simultaneous experiments and launching new features weekly. Measuring the overall effect of these tests is critical to understanding the team’s impact and to help set product direction. However, actually measuring this cumulative impact can be quite difficult.

Holdouts in GrowthBook provide a simple way to keep a true control group across multiple features and measure long-run cumulative impact. It’s the gold standard way to answer the question: “What did all of this shipping actually do to my key metric?”

Why holdouts matter

Cumulative impact is important to measure.

Ensuring that your experimentation program helps you ship winning features and avoid losing features sets your product direction. Knowing which teams are driving the most impact can help you understand what’s working and what isn’t. Teams that are successfully moving the needle may deserve more investment to continue driving their goals upward. If a team struggles to have a significant impact, they may have hit diminishing returns, they may need a new direction, or the product may have reached a certain level of maturity, making gains more difficult to achieve.

Cumulative impact is hard to measure.

Looking at the overall trend in your goal metrics is not enough. Forces beyond your control or seasonality can dictate goal metric movements and can mislead you. With constant shipping across product teams, attributing lift to individual teams can be nearly impossible.

Other approaches try to sum up the effect of individual experiments and apply some bias reduction, like the one on our own Insights section. Almost always, the individual impacts of experiments, when summed up, overstate the final effects due to selection bias, generally diminishing returns over time, and cannibalizing interactions with other experiments. This isn’t just theoretical; Airbnb documented how a naive sum overstates impact by 2x when compared with a holdout, and bias-corrected estimates still overstate impact by 1.3x.

Holdouts as the solution.

A well-run holdout exposes a stable baseline of users to none of your new features for a period of time, then compares them to the general population. Because a holdout can run for longer on a small percentage of traffic, you capture longer-run effects. Furthermore, it allows you to stack all of your features and experiments into one test, capturing cumulative and interactive effects. Finally, it uses reliable statistics and inference from experiments to make holdouts the gold standard for cumulative, long-run impact.

How Holdouts work in GrowthBook

At a high level:

  • Holdout group: A small percentage of traffic (usually users) is diverted away from new features, experiments, and bandits.
  • General population: Everyone else—experimenting and shipping as usual. We then select a small subset of the general population as a measurement group to compare against the holdout group.

As you launch new features and experiments, all new traffic checks whether they should be diverted to the holdout before seeing the new feature or experiment values.

When an experiment goes live, the holdout group is completely excluded while the general population gets randomized into one condition or another. Once an experiment is shipped, all users in the general population will receive the shipped variant.

This means that the holdout measures the cumulative impact of using your product, which includes all the false starts and the test period for the experiments that didn’t ship, because that is a true record of what actually happened in the past quarter.

Only once the holdout is ended will users in the holdout group receive any shipped features.

Using your Holdout

Facebook and X product teams ran 6-month holdouts for all their features, withholding 5% or less of traffic, and then used the cumulative impact in reporting and to understand if they had correctly set their product direction. They then released the holdout and started a new one for the next 6-month period.

Other teams at X were also using long-run, low-traffic holdouts on a bundle of critical features to ensure they were continuing to provide value.

  • Define the population size: Pick a sample large enough to measure your cumulative impact, but beware that larger population sizes mean you will end up with less traffic for your day-to-day experiments and fewer users with the latest set of features.
  • Define the active period length (half a month to a quarter): Pick a period long enough to accumulate some wins
  • During the active period (half to a full quarter): Ship normally. Keep adding experiments and launching features. The holdout quietly accumulates evidence.
  • Analysis period (2–4 weeks): Freeze adding new changes, let effects settle, and compare cumulative impact with our automatic lookback windows applied to measure only the analysis period.

Product teams at X would run a holdout for a half a year, adding new features to the holdout over the course of 6 months. Then, they would use the following quarter to get a reliable, long-run measure of their cumulative impact.

So, a year would look like this:

Timeframe Holdout Status
Q1 h1-holdout (active)
Q2 h1-holdout (active)
Q3 h2-holdout (active)

h1-holdout (measurement only)
Q4 h2-holdout (active)

Tips & Trade-offs

  • Project-scope your Holdout: If you want to measure the impact of a given team’s set of features, have that team work within one or more GrowthBook Projects and have the Holdout automatically apply to their features and experiments.
  • Be wary of the user experience: A small group won’t see new features—keep the percentage small and the period finite.
  • Be ready to keep feature flags in code: Holdouts require feature flags to stick around through the analysis period, so prepare your workflows for longer-lasting features.
  • Metrics: Favor durable outcomes (revenue, retention, engagement) and use lookbacks for clean analysis windows so that you only measure the impact once all experiments have had a chance to bed-in.

Get started

  • Create your first holdout in the app (ExperimentsHoldouts) and scope it to a project you want to measure impact within.
  • Pick 2 - 4 long-run metrics that your team is hoping to improve in the long-run.

Read more about holdouts in our Knowledge Base and see our documentation to help run your first holdout.

Building in the AI Era: Lessons from Past Technological Revolutions
AI

Building in the AI Era: Lessons from Past Technological Revolutions

Jul 29, 2025
x
min read

We are living through a generational technology shift—one that comes along only once or twice in a lifetime, reshaping how humans interact with the world. Just as electricity, automobiles, computers, the internet, and mobile computing were transformative, AI is doing the same today. However, history shows us that in the early days of a new technology, people often misunderstand the power that it unlocks. This article will examine some of the historical technology shifts and the lessons we can learn from them. 

Lessons from History

Practical applications of electricity began to take root in the 1880s and 90s, with the first electrical power station opening in Manhattan by Edison. The uses were initially targeted at consumers, with rich New Yorkers able to electrify their homes and replace their gas lights with electric ones. Industry, on the other hand, was slow to adapt, despite the evident advantages. Most industries simply replaced steam-powered equipment with electric ones, or added electric lights, without considering how their industry could operate differently. 

The engineering breakthrough came when Henry Ford reimagined the factory in the 1910s. He utilized electric motors' precise speed control and distributed power to create the moving assembly line in 1913—a feat impossible with centralized steam engines that required complex systems of belts and pulleys. These improvements cut the Model T build time from 12 hours to about 93 minutes­—a systemic redesign that enabled scale, lowered costs, and transformed labor and manufacturing fundamentally. 

A similar lesson comes from the introduction of the television. In the early days of television, content was heavily borrowed from radio—simply filmed broadcasts of radio shows without inventing for the new medium. The real shift came when creators embraced television's potential: drama anthologies, magazine-format shows like Today and The Tonight Show, recording and editing footage from multiple cameras, and new storytelling formats were designed for television. By the 1950s, TV overtook radio: between 1950 and 1960, U.S. household ownership jumped from about 9 percent to over 60 percent, nearing 90 percent in the early 1960s.

The lesson: Early adopters who treat a new medium like the old one often miss its full value. The true winners reimagine processes, experiences—and even entire business models—when they adopt these new technologies. 

Parallels with Today’s AI Adoption

It is evident from the above examples that there are parallels with the adoption of AI into our products and businesses. Pressure to add AI or to be the AI for x industry results in many uninspired implementations. Many organizations today bolt on an AI assistant—like lighting a few bulbs in a steam-powered factory—but miss the opportunity to reimagine workflows end-to-end. The real transformation occurs when considering how AI can transform the user experience.

The difference between the past technological shifts and the AI one we’re experiencing today is the incredible velocity of the change.

  • It took about 13 years for Ford to sell 1 million cars. 
  • It took Google 1 year to reach 1 million searches per day. 
  • Apple’s iPhone launched in 2007, heralding the smartphone revolution, and sold 1 million units in just 74 days. 
  • ChatGPT, on the other hand, reached 1 billion searches per day in under a year—a metric that Google took over 10 years to achieve. 

Within just two months of its November 2022 launch, ChatGPT surpassed 100 million users—the fastest adoption rate ever recorded for a consumer software product. This rate of adoption suggests that companies that don't learn from history and adapt to the AI era face an existential threat, not just a competitive disadvantage.

GrowthBook’s Journey with AI

At GrowthBook, our initial step was adding the lightbulb: we launched an AI chatbot to help users navigate our documentation (a helpful concierge, if you will). 

Simultaneously, we conducted several brainstorming sessions to reevaluate our product and explore the potential impact of AI on our business. We ran the 11-star brainstorming sessions and planned our roadmap to reimagine what AI will mean in the A/B testing and product analytics space. We built Weblens.ai as a demonstration of some of the features AI can unlock for AB testing—and we have many more coming very soon. 

Conclusion

From electrification to television to AI, each technological shift has rewarded those who reimagined systems entirely. They didn’t just adopt new tools—they rewrote workflows, content, and the way they delivered value. 

Here are the lessons:

  • Treat AI as a new paradigm—not just as an add-on. Like Ford reengineered production or TV creators abandoned radio formats, design products from an AI-native perspective
  • Focus on user journeys and tasks that AI can redefine—insights, decisions, personalization—rather than isolated features shoe‑horned onto existing interfaces.
  • If you don’t adapt now, someone else will. AI has experienced an explosive rate of growth, resulting in significant productivity gains and a reduction in the time it takes to bring products to market.
GrowthBook is Now Available on the Vercel Marketplace
Platform
4.1
Product Updates
Feature Flags

GrowthBook is Now Available on the Vercel Marketplace

Jul 24, 2025
x
min read

We're excited to announce that GrowthBook is now available as a native integration in the Experimentation category on the Vercel Marketplace! This integration makes it easier than ever to add feature flagging and A/B testing to your Vercel projects, with streamlined setup, unified billing, and ultra-low latency performance.

What This Means for Developers

The Vercel Marketplace includes an Experimentation category specifically designed for developers who want to implement feature flags and run experiments without the complexity of managing separate platforms. As one of the first experimentation providers in this new category, GrowthBook brings enterprise-grade feature management and experimentation directly into your Vercel workflow.

With this native integration, you can:

  • Access GrowthBook from Vercel: Access flags and experiments without leaving the Vercel dashboard
  • Sync to Vercel Edge Config: Automatically sync your feature flags to Vercel Edge Config for near-zero latency flag evaluation
  • Unified billing: Manage GrowthBook billing through your existing Vercel account
  • Integrate GrowthBook and Vercel SDKs seamlessly: Use GrowthBook's SDKs or integrate with Vercel's Flags SDK for simplified setup

Built for Performance and Scale

Traditional feature-flagging solutions often introduce latency via API calls, leading to flickering web pages, missed analytics, poor UX, and skewed experimentation results. Our Vercel integration leverages Edge Config to eliminate this bottleneck entirely. When you enable Edge Config syncing, your feature flags are automatically distributed to Vercel's global edge network, allowing your applications to evaluate flags without making external API calls.

This means no flickering, no delays, and no compromises on performance—just real-time control over your application features with sub-millisecond flag evaluation times.

How It Works

Getting started is incredibly straightforward:

  1. Install from the Marketplace: Navigate to the Vercel dashboard, select Integrations, then Browse Marketplace. Find GrowthBook in the Experimentation category.
  2. Choose Your Plan: Select between our free Starter or Pro plan. Pro plan billing is handled directly through Vercel for a unified experience.
  3. Connect Your Projects: The integration creates a new GrowthBook organization and automatically connects it to your selected Vercel projects.
  4. Start Building: Create feature flags, set up A/B tests, and manage rollouts in GrowthBook, with direct access from your Vercel dashboard  
  5. Dive Deeper: Use GrowthBook's full analytics suite to understand experiment results

Perfect for Next.js Applications

If you're building with Next.js, the integration works seamlessly with Vercel's Flags SDK. You can use the newly released @flags-sdk/growthbook provider to load experiments and flags with zero configuration. For other frameworks, GrowthBook's comprehensive SDK library supports every major language and platform.

Enterprise-Grade Features, Startup-Friendly Pricing

This integration brings all of GrowthBook's powerful features to Vercel users:

  • Advanced Targeting: Target users based on attributes, location, device type, and custom rules
  • Statistical Analysis: Built-in Bayesian and Frequentist statistics engines for reliable experiment results
  • Multi-armed Bandits: Automatically optimize traffic allocation based on performance
  • Comprehensive Analytics: Track any metric and understand the full impact of your experiments
  • Warehouse Native: Use your existing data stack (Snowflake, BigQuery, Databricks, ClickHouse, Postgres, etc.)

Our pricing remains developer-friendly, with a generous free tier that includes unlimited feature flags and experiments for up to 3 team members. The Pro plan scales with your needs and is now conveniently billed through Vercel.

The Next Step in Your Development Workflow

Modern web development requires the ability to test, iterate, and optimize continuously. With GrowthBook now available on the Vercel Marketplace, you can add sophisticated feature management and experimentation capabilities to your projects in minutes, not days.

Whether you're rolling out a new feature to a subset of users, running A/B tests to optimize conversion rates, or implementing progressive rollouts to minimize risk, GrowthBook provides the tools you need without slowing down your development workflow.

Get Started Today

Ready to start experimenting? Install the GrowthBook integration from the Vercel Marketplace today. It's available to users on all Vercel plans, and you can be up and running with your first feature flag in under 60 seconds.

Install GrowthBook on Vercel Marketplace →

For questions or support, join our Slack community or check out our documentation for Next.js integration details.

GrowthBook Version 4.0
Releases
Product Updates
4.0

GrowthBook Version 4.0

Jul 9, 2025
x
min read

We shipped so many new features in our June Launch Month that we decided that it deserved a major version increase. Version 4.0 brings a huge array of new features.  Here’s a quick summary of everything it includes.

GrowthBook MCP Server

AI tools like Cursor can now interact with GrowthBook via our new MCP server. Create feature flags, check the status of running experiments, clean up stale code, and more.

Safer Rollouts

Building upon our Safe Rollouts release from the last version, we added gradual traffic ramp-up, auto rollback, a smart update schedule, and a time series view of results.  All of these combine to add even more safety around your feature releases.

Decision Criteria

You can now customize the shipping recommendation logic for experiments.  Choose from a “Clear Signals” model, a “Do No Harm” model, or define your own from scratch.

Search Filters

We’ve revamped the search experience within GrowthBook to make it easier to find feature flags, metrics, and experiments.  Easily filter by project, owner, tag, type, and more.

Insights Section

We added a brand-new left nav section called “Insights” with a bunch of tools to help you learn from your past experiments.

  • The Dashboard shows velocity, win rate, and scaled metric impact by project.
  • Learnings is a searchable knowledge base of all of your completed experiments.
  • The Experiment Timeline shows when experiments were running and how they overlapped with each other.
  • Metric Effects lists the experiments that had the biggest impact on a specific metric.
  • Metric Correlations let you see how two metrics move in relation to each other.

SQL Explorer 

We launched a lightweight SQL console and BI tool to explore and visualize your data directly within GrowthBook, without needing to switch to another platform like Looker.

Managed Warehouse

GrowthBook Cloud now offers a fully managed ClickHouse database that is deeply integrated with the product.  It’s the fastest way to start collecting data and running experiments on GrowthBook.  You still get raw SQL access and all the benefits of a warehouse-native product.

Feature Flag Usage

See analytics about how your feature flags are being evaluated in your app in real time.  This is built on top of the new Managed Warehouse on GrowthBook Cloud and is a game-changer for debugging and QA.

Vercel Flags SDK

GrowthBook now has an official provider for the Vercel Flags SDK.  This is now the easiest way to add server-side feature flags to any Next.js project. We have an even deeper Vercel integration coming soon to make this experience even more seamless.

Official Framer Plugin

You can now easily run GrowthBook experiments inside your Framer projects.  Assign visitors to different versions of your design (like layouts, headlines, or calls to action), track results, and confidently choose the best experience for your audience.

Personalized Landing Page

There’s a new landing page when you first log into GrowthBook.  Quickly see any features or experiments that need your attention, pick up where you left off, and learn about advanced GrowthBook functionality to get the most out of the platform.

New Experimentation Left Nav

There’s a new “Experimentation” section in the left nav. Experiments and Bandits now live within this section, along with our Power Calculator, Experiment Templates, and Namespaces.  We’ll be expanding this section soon with Holdouts and more, so stay tuned!

REST API Updates

  • Filter the listFeatures endpoint by clientKey
  • Support partial rule updates in the putFeature endpoint
  • New Queries endpoint to retrieve raw SQL queries and results from an experiment
  • Added Custom Field support to feature and experiment endpoints
  • New endpoints for getting feature code refs
  • New endpoint to revert a feature to a specific revision

Performance Improvements

We’ve significantly reduced CPU and memory usage when self-hosting GrowthBook at scale. On GrowthBook Cloud, we’ve seen a roughly 50% reduction during peak load, leading to lower latency and virtually eliminating container failures in production.

Flavors of Experimentation in GrowthBook
Experiments
Platform

Flavors of Experimentation in GrowthBook

Jul 7, 2025
x
min read

Digital experiments serve a large variety of purposes. You may want to learn whether you’re building the right thing, you might want to safely release changes without introducing regressions, or you might just want to pick a winner between some easy-to-build options.

But one tool won’t be best for all of them. A classic A/B test might struggle if you throw 10 options at it, or it might take too long to reach a clear result if your goal is just to do no harm.

That's why GrowthBook provides you with 3 different tools, all powered by our state-of-the-art statistics engine and performant SDKs.

Experiments for learning, Safe Rollouts for releasing safely, and Bandits for picking a winner among many.

When to use a classic Experiment

Use classic Experiments when you want to:

  • Build a better product or website
  • Learn about customer behavior as accurately as possible
  • Choose from only 2-3 different options, or a few options that were costly to build with respect to time from design, engineering, and product

Classic experiments in GrowthBook are great at providing you the clearest answer to the difference in your key goal metrics between 2 or 3 variations. For instance, if you've spent weeks designing and building a new checkout flow, you need precise measurements of its impact on conversion rates compared to your current design.

A screenshot showing the trend in a goal metric in a GrowthBook Experiment
GrowthBook experiment results showing goal metric trend and 1.36% lift over time

You can reduce variance using tools like CUPED. Or you can use sequential testing and multiple-comparisons corrections to best balance false-positive rates and faster shipping. You can also add Dimensional analyses to slice-and-dice your results and learn more about how what you are building affects your users.

Furthermore, classic Experiments provide accurate experimental effects that form the basis for a historical library. This data becomes invaluable for driving Insights about the overall performance of your product development.

Histogram showing the spread of historical Experiment effects on a key metric
GrowthBook Experiment Insights histogram showing distribution of historical experiment lifts on a key metric

When to use a Safe Rollout

Use a Safe Rollout when you want to

  • Release confidently by rolling back as soon as there is a clear regression
  • Ship automatically as long as you're doing no harm
  • Do lightweight experimentation with every release

Safe Rollouts are built right into GrowthBook Feature Flags and are fast and easy to set up. They use one-sided sequential tests and automatic traffic ramp-ups to ensure that when a guardrail fails, your feature rolls back without inflating false-positive rates. This way, you can make experimentation a part of every release.

GrowthBook Safe Rollout showing a rule that is safe to ship as no guardrails are failing
GrowthBook Safe Rollout showing no guardrail failures and ready to ship status

Imagine you've refactored an API endpoint for better performance. Your goal isn't to learn whether it's 5% or 8% faster. You just need confidence that it won't break anything. Safe Rollouts lets you release to 5% of users, automatically scale up if metrics look healthy, and instantly roll back if error rates spike.

While Safe Rollouts can more confidently flag early regressions than a classic Experiment, they aren’t as fine-tuned for building up a library of effects or getting exceedingly precise estimates. They do use CUPED, but it is used in the service of detecting regressions more quickly, not getting the most precise overall lift. Safe Rollouts are also restricted to just 2 variations since they’re designed to safely release a new feature, rather than test between multiple arms.

When to use a Bandit

Use a Bandit when you want to:

  • Pick a winner between 4+ different variations that were easy to build
  • Reduce traffic going to variations that are struggling early in an experiment
Time series of the probability of a variation winning in a multi-armed bandit
GrowthBook multi-armed bandit showing Variation 4 winning probability increasing over time across five variations

Multi-armed Bandits optimize traffic in an experiment by directing more traffic to better-performing variations. For example, you're running a week-long sale and want to test different CTAs. By the time you completed a classic Experiment, the sale would be over, and you would've lost out on sales. With Bandits, traffic automatically shifts toward the winning CTA during the sale, maximizing conversions.

This provides dual benefits: better variations get more statistical power from increased traffic, while fewer users see worse-performing options, protecting your bottom line.

GrowthBook’s Bandits stand apart from the field by ensuring a consistent user experience during the bandit and by using period-specific weighting to deal with seasonality (e.g., day-of-the-week effects) in your experiment sample. However, Bandits in general are known to suffer from some inaccuracies at providing top-level estimates of experiment lifts, so they are best suited for picking a winner among many, instead of learning precisely how much a variation outperformed another.

GrowthBook provides the tools you need

All 3 forms of experimentation, classic Experiments, Safe Rollouts, and multi-armed Bandits, use the power of randomization and GrowthBook’s state-of-the-art statistics engine to provide you with the right answers to the right questions.

Ready to choose the right experimentation approach for your next project? Get started with GrowthBook in under 5 minutes.

Ready to ship faster?

No credit card required. Start with feature flags, experimentation, and product analytics—free.

Simplified white illustration of a right angle ruler or carpenter's square tool.White checkmark symbol with a scattered pixelated effect around its edges on a transparent background.