The Uplift Blog

Subscribe
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
GrowthBook vs LaunchDarkly: Why Developers Choose GrowthBook for Feature Flagging
Feature Flags

GrowthBook vs LaunchDarkly: Why Developers Choose GrowthBook for Feature Flagging

Feb 22, 2026
x
min read

A feature flag platform ends up in critical paths: request handlers, render paths, mobile startup flows, and incident response. Most development teams still evaluate tools by feature checklists and pricing pages. That misses what tends to matter after adoption: runtime behavior, failure modes, testability, and whether measurement becomes a second system of record.

This article compares GrowthBook and LaunchDarkly across three architectural planes that tend to matter more than feature checklists:

  1. Runtime plane: How flag definitions propagate, how targeting decisions are evaluated, update models, hot-path dependencies, and outage behavior.
  2. Measurement plane: How rollout exposure connects to outcomes, and whether measurement becomes a second system of record.
  3. Control plane: Governance, approvals, environments, enterprise integrations, and deployment model.

Each section focuses on real-world behavior and operational tradeoffs, not feature checklists. Both platforms cover the control-plane basics and provide rollout safety features, but they optimize for different priorities.

LaunchDarkly tends to win when you want observability-connected safety automation and enterprise workflow/compliance plumbing (ServiceNow, Terraform, broader certification programs).

GrowthBook tends to win when you want deterministic local evaluation, SQL-native impact measurement aligned with your database or warehouse, self-hosting options, and seat-based pricing predictability.

Pick based on whether you prioritize managed safety and workflow automation or runtime predictability and measurement alignment with your existing data systems.

The three planes of feature flagging

Every feature flag platform operates across three planes:

Diagram 1: The three planes of feature flagging

Control-plane features are easy to compare; runtime and measurement are where long-term debt accumulates.

GrowthBook vs LaunchDarkly: Runtime, Measurement, and Control Planes Compared

PlaneWhat mattersGrowthBook (typical)LaunchDarkly (typical)RuntimeHot-path dependency, caching, failure modes, targeting flexibilityLocal rule evaluation; attribute-driven targeting without schema changesRelies on LD services for rule storage; multi-context targeting with explicit entity modelingMeasurementJoining exposure to outcomes, avoiding "two truths"SQL-native measurement against your database/warehouseEvent collection with export paths; warehouse-native options vary by offeringControlRules, environments, approvals, governance, deploymentStrong baseline; self-host option; single-stage approvalsDeeper enterprise workflow plumbing (ITSM/IaC); multi-stage approvals; broader certifications

Runtime plane: updates, evaluation, and targeting

Feature flags live in hot paths and incident loops. When you evaluate feature flagging platforms, start with four runtime questions:

  1. How do updates propagate? (polling vs streaming; how fast can you change a rollout?)
  2. What's on the hot path? (local evaluation vs remote dependency; where does latency come from?)
  3. What happens in partial outages? (what's cached; what degrades; what falls back to defaults?)
  4. How flexible is targeting? (can you add new dimensions without refactoring your identity model?)

The first three are pure runtime concerns. Targeting spans both the control plane (where you define rules) and runtime (where those rules get evaluated). Below is how GrowthBook and LaunchDarkly answer these questions in practice.

Update propagation (polling vs streaming)

How do flag changes reach running apps? This matters when you're expanding a rollout or killing a flag during an incident.

GrowthBook: SDKs fetch and cache the rules payload locally at initialization and can refresh it periodically or on demand. If you need faster propagation, the GrowthBook Proxy or GrowthBook Cloud supports streaming updates via Server-Sent Events.

Diagram 2: GrowthBook SDKs Runtime Evaluation Flow

LaunchDarkly: Server-side SDKs commonly use streaming connections for updates. Client-side SDKs may poll or stream depending on platform and configuration.

What to take away: GrowthBook is “fetch-and-cache by default, stream when needed.” LaunchDarkly is “streaming-first” in many deployments.

Hot-path dependency (local rules vs remote evaluation)

GrowthBook: Often uses a cached-rules model: SDKs fetch a rules payload, keep it locally, and evaluate in-process. If you need to keep targeting rules off the client, GrowthBook supports Remote Evaluation mode via the proxy/edge workers.

LaunchDarkly (client-side): Client-side SDKs rely on LaunchDarkly services to store flag rules and deliver flag values/updates for a specific context, reducing rule exposure but increasing network dependence during init/refresh.

What to take away: Rule secrecy and centralized evaluation usually implies more network reliance; local rules reduce dependency surface area.

Partial outages and degradation behavior

Both platforms cache locally, but what’s cached determines the failure mode:

  • GrowthBook client-side (local rules): SDK caches the ruleset. During an outage, evaluation continues from cached rules; you mainly lose propagation of new changes.
  • LaunchDarkly client-side (local values): SDK caches evaluated values for a context. During an outage, cached values continue to serve; evaluations that require fresh context updates may fall back to defaults until connectivity is restored.
  • Server-side (both): SDKs typically cache rules locally and evaluate without network calls; outages mostly affect receiving updates.

What to take away: The practical difference is whether the client can keep evaluating against rules offline (rules cached) versus only serving previously-evaluated values (values cached). 

That covers how flags arrive and evaluate in production. The next runtime question is what you can express with those evaluations: targeting.

Targeting: who sees what, and how flexible is the model?

Targeting determines who sees what when a flag is evaluated: by user, tenant, region, device, plan, or any other attribute.

Targeting straddles both runtime and control planes. You author rules in the control plane, but they execute at runtime. We cover it here because the runtime model (how rules get evaluated, what you can express without SDK changes) is where GrowthBook and LaunchDarkly differ most. The control-plane authoring experience is comparable; the runtime flexibility is not.

Tenant-consistent rollouts for B2B

B2B SaaS teams need rollouts that are consistent per tenant. If you're testing billing changes on 10% of organizations, User A and User B from Acme Corp need to see the same thing.

GrowthBook: hash-based bucketing on any attribute. Set the hash attribute to company_id and all users from the same company land in the same bucket. No state synchronization required.

“We were looking to customize attributes on which we could toggle a roll-out, instead of using a percentage roll-out. Having tags for the classroom or the district a student is in, and then actually rolling out based on those, gives us a lot more power.”John Resig, Chief Software Architect, Khan Academy, customer story

LaunchDarkly: multi-context targeting. Define explicit context kinds (user, organization, device) and build rules that compose them. More structured but requires more upfront modeling.

Composition vs. Structural Identity

LaunchDarkly: multi-contexts model User, Organization, and Device as distinct entities. That helps when those entities have separate lifecycles, metadata, and policy rules.

GrowthBook: keeps the runtime model simpler: evaluation is driven by the attributes you pass (for example company_id, plan, device, region), and you can compose reusable targeting logic with Saved Groups (including nested groups) instead of introducing entity schemas at the SDK level.

When to choose

Choose GrowthBook if: You expect targeting dimensions to change over time and you want to add new ones without introducing new context schemas or SDK-level entity modeling.

Choose LaunchDarkly if: You need explicit, first-class separation between entity types (user vs org vs device) and you want targeting/governance to reflect those boundaries directly.

Measurement plane: proving rollout impact

Measurement determines whether you can reliably connect who saw a change with what happened next. The integration model matters because it either reuses the metrics and data pipelines your team already trusts, or creates a second analytics system that can drift from your source of truth.

When toggles aren't enough

You ship a feature behind a flag to 20% of users. A week later, your PM asks: "Did it increase conversion?" Your VP asks: "Did it slow page loads?"

Now flags become an analytics problem. You need to join flag exposure (who saw what variant) with outcomes (revenue, latency, errors).

When flag platforms become a second source of truth

Many centralized experimentation and flag platforms collect events via SDKs, store them in their infrastructure, and provide dashboards for analysis. To join rollout data with your product analytics or warehouse, you use Data Export (sometimes an add-on depending on plan) and build pipelines.

This creates two problems:

  1. Duplicate instrumentation. Sending events to your analytics platform (Amplitude, Mixpanel, your warehouse) AND your flag vendor. Duplicate tracking code, duplicate schemas, potential drift.
  2. Metric drift. Vendor analytics calculates revenue one way. BI team calculates it another. Results don't match. Trust erodes.

If your product data already lives in a central database or warehouse, a second analytics system can introduce unnecessary drift and duplication.

How the two platforms approach this:

LaunchDarkly: Collects flag exposure events into its own system and provides analysis there. To analyze outcomes using your warehouse metrics, you typically export those events and join them downstream.

GrowthBook: Reads exposure and outcome data directly from your database or warehouse and computes results with SQL, so the experiment uses the same tables and metric definitions your BI and engineering teams already rely on.

SQL as a simplifier

GrowthBook's approach: compute outcomes using SQL against your existing database. Flag exposure and outcome analysis stay aligned with the metrics and joins your team already trusts.

How it works:

  1. Define metrics in SQL against tables that already exist in your database.
  2. GrowthBook runs those queries to calculate results.
  3. All data stays in your database. No export, no pipelines, no schema mapping.

Postgres/MySQL as the practical on-ramp

"Warehouse-native" can sound like you need Snowflake or BigQuery to get started. You don't. If you're running Postgres or MySQL for your application, GrowthBook can use those directly as your measurement database. This lets engineering teams start with outcome measurement without waiting for data warehouse infrastructure or analytics team support.

Practical setup:

  1. Connect GrowthBook to a read replica (not your production primary). Cap time windows to avoid full table scans.
  2. Define a Fact Table with a single SQL query, then derive multiple metrics from it using the metric builder. For advanced cases, drop to raw SQL.

As query volume grows, the same SQL-defined metrics can move from Postgres to ClickHouse or your preferred warehouse without rewrites. GrowthBook supports multiple SQL data sources including Postgres, MySQL, ClickHouse, Snowflake, BigQuery, Redshift, and Databricks.

Practical benefits of warehouse-native measurement

Use metrics you trust. Your BI team has a revenue metric. Your data engineers validated it. Use that in rollout analysis instead of reimplementing it in a vendor dashboard.

Measure engineering outcomes without instrumentation. You have error logs in BigQuery. Write a metric that counts errors by flag variant. No SDK events, no custom tracking. Just SQL against existing tables.

Tail metrics with statistical validity. If you already store latency or error telemetry, GrowthBook can attribute changes to rollout exposure using the same analysis pipeline as your other metrics. P95/p99 and tenant-level effects get measured at the right unit, not eyeballed from a graph.

Tenant-correct measurement for B2B. GrowthBook can measure rollout impact in a way that respects tenant boundaries, so you don't accidentally treat thousands of users in one large customer as thousands of independent samples. LaunchDarkly can get you the exposure data, but tenant-correct impact measurement is something you typically implement yourself downstream.

LaunchDarkly's approach

LaunchDarkly's Data Export sends raw events to your warehouse. You can then write SQL to join them with your metrics. This approach works, but it introduces an additional pipeline to maintain.

LaunchDarkly has now started offering warehouse-native experimentation capabilities for Snowflake. Availability and feature scope may vary by plan and region.

Control plane: governance and deployment

Operational constraints determine how a flagging platform fits into your organization’s security model, change-management processes, and cost structure. These factors often matter less during early adoption, but become decisive as teams scale, enter regulated industries, or standardize release workflows across the company.

Governance and deployment

Standard controls

Both platforms support staged rollouts, approval workflows, and RBAC. GrowthBook has single-stage approvals. LaunchDarkly supports multi-stage approvals (up to five stages). Most teams find single-stage sufficient. Regulated industries or large organizations with complex change management may require multi-stage.

Environments work similarly: dev, staging, production. You can copy flags across environments and test changes before promoting to prod.

Self-hosting and data perimeter control

GrowthBook is open source and self-hostable. Run the entire platform in your VPC, keep all data in your infrastructure, never send PII to a vendor. Deployment via Docker, Kubernetes, Helm.

LaunchDarkly is a multi-tenant SaaS. Relay Proxy exists for edge caching, but the management and control plane remains SaaS. For air-gapped deployments or zero data egress requirements, GrowthBook is the option between these two.

"With the kinds of experiments we run and the sensitive data we handle, data security is paramount. The fact that GrowthBook offered us the ability to keep that data in-house was a key reason why we chose to work with them."Diego Accame, Director of Engineering, Growth, Upstart, customer story

Where LaunchDarkly's enterprise integrations matter

LaunchDarkly has native ServiceNow integration, a mature Terraform provider, and compliance certifications (ISO 27001, ISO 27701, FedRAMP). For organizations that require ServiceNow change management, Terraform for infrastructure-as-code, or specific compliance attestations, these are table stakes.

GrowthBook has SOC 2 Type II. LaunchDarkly’s additional certifications will matter to some teams selling into highly regulated industries. For teams without these specific requirements, these certifications won’t be necessary.

That covers the three architectural planes. But there's one more dimension that doesn't fit neatly into the model, and it often ends up mattering more than teams expect.

The hidden dimension: pricing (dis)incentives

The three-plane model covers the technical evaluation. But there's a fourth dimension that doesn't fit neatly into architecture diagrams: pricing. Most teams treat it as a procurement problem, separate from technical decisions. That's a mistake.

Pricing models shape how teams use platforms. They create incentives that ripple back into architecture.

LaunchDarkly prices their product based on monthly active users (MAU) for client-side and service connections for server-side (per LaunchDarkly's pricing page). As your user base grows, your bill grows. At scale, MAU-based pricing can push teams to architect around counting and routing rather than shipping: sampling, filtering, or proxying traffic to manage cost. The platform becomes a line item that scales with success, which can create tension between "flag everything" best practices and budget constraints.

GrowthBook prices per seat with unlimited flags, traffic, and experiments (per GrowthBook's pricing page). The bill is the same whether you have 100K users or 10M users. This removes the "should we flag this?" calculation. Teams use flags more liberally because there's no marginal cost per evaluation, which can lead to cleaner release processes and faster rollbacks.  This allows team to deploy feature flags at scale much more cost effectively.

This isn't about which model is "better." It's about recognizing that pricing affects architecture. If your team is already optimizing flag usage to manage costs, that's a signal worth examining.

How to decide

When GrowthBook is the better fit for development teams

  • Flag evaluation off the network hot path. GrowthBook's local evaluation model keeps decisions in-process with cached rules, reducing dependency surface area and making failure modes easier to reason about.
  • Identity model changes frequently. Attribute-driven targeting lets you add new dimensions (tenant, plan, cohort, region) without introducing new context schemas or SDK-level entity modeling.
  • Rollout impact measured in SQL on data you already trust. GrowthBook computes metrics directly against your Postgres, MySQL, or warehouse tables, avoiding a second analytics system and metric drift.
  • Self-hosting or strict data-perimeter control. GrowthBook can run entirely inside your VPC with no PII leaving your infrastructure.
  • Predictable, seat-based pricing. Costs stay stable as traffic grows, which removes incentives to sample or proxy requests just to manage MAU.

When LaunchDarkly is the better fit for development teams

  • Auto-generated metrics from observability platforms. LaunchDarkly can auto-generate metrics from OTel traces and observability tools (Dynatrace, Honeycomb, New Relic, Splunk). This reduces setup time when you want rollout safety tied immediately to production telemetry. GrowthBook has Safe Rollouts with auto-rollback as well, but guardrail metrics are SQL-defined, which means routing observability data to your database or warehouse first.
  • ServiceNow/ITSM governance is mandatory. LaunchDarkly has native ServiceNow integration. GrowthBook uses webhooks and APIs.
  • ISO 27001, ISO 27701, or FedRAMP certifications are required. For teams selling to federal agencies or highly-regulated industries, these certifications are non-negotiable.
  • Terraform provider is mandatory. LaunchDarkly has a mature Terraform provider for infrastructure-as-code workflows. GrowthBook doesn't.
  • Niche or legacy SDK coverage. LaunchDarkly supports platforms like Haskell, Erlang, Roku, and Apex. GrowthBook's 24+ SDKs cover most platforms but not these.

Decision guide

PriorityRecommended Feature Flagging PlatformRuntime dependency surface (hot path, outages, testing)GrowthBookFlexible targeting without refactorsGrowthBookSelf-hosting / data perimeter controlGrowthBookSQL-native measurement (database/warehouse)GrowthBookSeat-based pricing predictabilityGrowthBookExperimentation rigor (quantile, CUPED, cluster)GrowthBookRelease safety automation (guarded rollouts, auto-rollback)BothAuto-generated metrics from observability platformsLaunchDarklyEnterprise compliance (ISO, FedRAMP)LaunchDarklyITSM workflows (ServiceNow)LaunchDarklyInfrastructure-as-code (Terraform)LaunchDarklyNiche SDK coverage (Haskell, Erlang, Apex)LaunchDarkly

Bottom line

Feature flagging looks simple on the surface. Development teams discover the real differences once flags sit in their hot paths, incident response, and product metrics.

At runtime, the question is how much of your flag evaluation depends on network and vendor infrastructure. GrowthBook leans toward local, deterministic evaluation. LaunchDarkly leans toward managed infrastructure with deeper built-in safety automation.

In measurement, the question is whether rollout impact lives inside the flag vendor or stays aligned with the database and metrics your team already trusts. GrowthBook centers measurement in SQL on your existing systems. LaunchDarkly centers it in a managed event and observability pipeline, with export paths when you need them.

In operational constraints, the question is whether you need enterprise workflow plumbing and compliance programs out of the box, or control over deployment, data location, and cost structure.

Both platforms cover the control-plane basics. They optimize for different failure modes and organizational priorities. The right choice isn't about feature checklists. It's about which architecture matches how your systems fail, how your data is measured, and how your organization ships software.

A/B Testing in the Age of AI
Experiments

A/B Testing in the Age of AI

Feb 10, 2026
x
min read

Prologue: Is A/B Testing Here to Stay?

In the age of AI, there is a growing debate about how it will transform the professions and skills we rely on today. Some view AI as a game changer, capable of completely reshaping the workforce: certain professions and skill sets may disappear entirely, while new, as-yet-unknown roles will emerge. Others argue that AI’s impact will be more evolutionary than revolutionary: the same professions will remain, but AI will accelerate and enhance the work we already do, enabling people to accomplish more in less time.

This debate naturally extends to the realm of A/B testing: will experimentation remain necessary at all? Some suggest that experimentation could become fully automated, potentially making roles like product managers, analysts, developers, and designers redundant. Others contend that while AI will fundamentally reshape these roles, it will not eliminate them. From this perspective, AI’s most significant contribution lies in speed: it can increase the volume of ideas that require testing and accelerates the ability to analyze the results. In effect, AI has the potential to dramatically compress product development cycles, allowing teams to iterate faster and more efficiently.

From this vantage point, A/B testing is far from disappearing; it is evolving. In this blog, we explore how AI is reshaping A/B testing: highlighting the areas already transformed, those on the verge of change, and those likely to remain largely unchanged.

Already Here: How AI Powers the Building Blocks of A/B Testing

A/B testing is the standard approach for determining whether new product versions genuinely outperform existing features. A typical A/B test comprises four main stages: hypothesis generation, where the proposed change and its expected impact are defined; experiment planning, which includes setting up the test conditions and determining an appropriate sample size; data collection and analysis; and finally, drawing conclusions and sharing the results across the organization. AI usage can already be found across the different stages of this lifecycle. 

Hypothesis Generation: Defining What to Test

Keeping track of what has already been done is essential for generating strong hypotheses. Past experiments provide critical context, helping teams avoid redundant or low-value tests and focus on ideas with real potential. Yet systematically tracking prior experiments remains a major challenge for analysts. As experiment volume grows, it quickly exceeds human cognitive capacity, and documentation becomes harder to navigate, especially as teams scale and members frequently join or leave.

This is precisely the kind of problem where AI excels. Platforms like GrowthBook leverage AI to help teams build efficiently on prior experiments by surfacing what has worked, identifying opportunities for new features, and even creating new feature flags and experiments directly in the platform. Crucially, these insights are not based on generic ideas; they are grounded in the company’s own data and experimental history, producing tailored solutions for the specific user population of the product.

Activating this AI support is as natural as talking with a teammate. In GrowthBook, analysts can simply ask what has worked in previous experiments, what has failed, and what to do next. Beyond suggesting new test ideas, the platform evaluates hypothesis quality against organization-defined criteria and helps prevent duplicate experiments by surfacing similar tests related to the current hypothesis.

Planning: Setting Up the Test

Once you know what you are going to check, the next step is to design the test. The main goal at this stage is to determine the test duration, which is directly driven by the required sample size. Sample size calculation is essential to ensure the experiment has enough statistical power to detect an effect when one truly exists.

Importantly, sample size planning is tightly linked to the data and the required confidence levels.  Sample is driven off of the expected improvement, the required statistical significance, often 0.05 and the statistical power required, often 80%.   While this planning is largely data driven, AI can still add value by helping teams manage, standardize, and document metrics across the organization by generating clear, consistent definitions and descriptions.

Analysis: Data Acquisition and Evaluation

Once data has been collected, AI can add value across the entire analysis pipeline, from data extraction to generating insights. For example, GrowthBook allows users to create SQL queries directly from plain-text descriptions, execute them, and visualize the results. But the impact of AI in this platform goes far beyond query generation. By leveraging information linked to the tests, such as the hypothesis and metrics, AI can produce a full analysis of the results. This includes generating a summary that can be attached to the experiment, with the content and style of the summary controlled through prompts that specify how the results should be described.

Beyond straightforward hypothesis testing, AI can also enable deeper exploration through segmentation analyses, helping teams understand where and for whom effects occur. AI-assisted exploration can also uncover unexpected patterns or secondary signals (such as mouse movement or click behavior) that might otherwise go unnoticed.

Documenting and Sharing: Turning Results into Knowledge

To derive impact from an A/B test, it is essential to clearly communicate the results. In the era of AI, analysts no longer need to struggle with interpreting findings or deciding how to present them to stakeholders. By leveraging language models, AI can simplify this task simply by being provided with information about the experiment and the actual data.

Within Reach: Accelerating A/B Testing Automation & Learning with AI

AI enhancements are already influencing various components of A/B testing. In the near future, we believe AI has the potential to go beyond individual components, integrating the entire A/B testing lifecycle into an end-to-end process and enabling a deeper, more causal understanding of why effects occur and where errors originate.

Unifying the experimentation lifecycle

AI already supports key parts of the experimentation lifecycle. The next step is to integrate these capabilities into a unified, automated workflow. In practice, this could range from describing analysis goals in natural language to AI proactively proposing experiment ideas. For example, GrowthBook’s Weblens allows teams to upload a website URL and receive data-driven experiment recommendations.

In the future, AI-driven systems could autonomously generate product variants and run experiments end to end, while analysts retain oversight to ensure product quality, correct user allocation, and sound interpretation of results. This shift has the potential to significantly reduce the friction that commonly exists between product, engineering, and analytics teams.

Today, analysts are rarely responsible for implementing product changes or launching experiments, which often leads to misalignment, such as missing tracking events or users being allocated but never exposed to a variant. These issues can require substantial rework or, in the worst case, invalidate the experiment entirely. By centralizing the experimentation workflow within an AI-driven system, many of these coordination failures can be prevented, resulting in faster execution, cleaner data, and more reliable insights.

Automatically diagnosing experimental issues

AI can improve experiment validity not only by reducing operational friction, but also by automating validity checks and helping debug experiments when issues arise. For example, today a common validity check is Sample Ratio Mismatch (SRM), which verifies that the actual allocation of users matches the planned allocation. Beyond SRM, it is advisable to periodically conduct A/A tests, which compare the control version against itself. Ideally, no significant differences should emerge; if they do, it may indicate that some aspect of the software or testing environment is unintentionally affecting outcomes. 

So, what can AI contribute to these validation checks? Quite a lot. While implementing SRM and A/A tests is relatively straightforward, diagnosing the source of a problem when one arises is far more complex. Tracing the root cause often requires detailed data exploration, which can be guided by AI tools. In more advanced settings, AI can even proactively detect potential issues by continuously monitoring differences in allocation or user characteristics, (e.g., identifying a higher proportion of bots in one group). This capability allows teams to catch and resolve problems earlier, reducing wasted time and resources.

Learning about your product

Understanding why an effect occurred is not only important when errors arise; it becomes even more critical when significant results are observed. Beyond simply deciding which features to ship or retire, companies are deeply interested in understanding their users and their needs. By uncovering what drove the impact in a test, companies can make more informed product decisions, identify opportunities for improvement, and tailor experiences that truly resonate with their users.

Achieving this goal today often requires substantial manual effort, from building dashboards and running follow-up analyses to iteratively exploring data to uncover the drivers of observed changes. AI can dramatically accelerate this process by enabling learning across user segments and other potential explanatory variables. While tools such as automated segmentation analysis already address part of this need, AI’s potential extends much further. In the near future, it is expected to reveal complex segment interactions, detect seasonal patterns, and analyze historical user behavior, providing a deeper understanding of the people who use our product.

Beyond Our Grasp: AI as a Replacement for Humans in A/B Testing

The existing and emerging AI-based practices naturally raises a broader question: how far can automation go? In theory, AI could fully automate the experimentation process. In such a “human-free” scenario, humans would no longer be needed in two key roles. First, they might not be required as users, as their behavior could be accurately modeled and simulated. Second, they might no longer be necessary as decision-makers, with AI autonomously generating, running, and evaluating experiments. From our perspective, however, both assumptions remain far from reality, at least for the foreseeable future. Let’s explore why.

No Need for Humans as Users?

The case for automation.

Human behavior can be modeled computationally, which raises the possibility of using synthetic users to test and iterate on product changes. In principle, such agents could enable experiments to be run, evaluated, and refined without involving real users.

Why humans still matter.

Human behavior is deeply contextual, shaped by emotions, social norms, cultural influences, and continuously evolving motivations. These nuances are difficult to capture fully in any model, and existing datasets inevitably reflect only a partial view of human decision-making. Moreover, products are ultimately designed for, and evaluated by, humans; not abstract agents or simulations. As a result, even the most sophisticated models must ultimately be validated against real human responses.

No Need for Humans as Decision-Makers?

The case for automation.If AI could autonomously generate hypotheses, run experiments, evaluate outcomes, and draw conclusions to guide subsequent tests, human intervention might become unnecessary. In such a scenario, each experiment would naturally flow into the next, creating a continuous, fully automated experimentation workflow.

Although this vision is tempting, allowing AI algorithms to operate entirely without human oversight is unlikely in the near future; too much is at stake. While AI can assist in running experiments, organizations are unlikely to relinquish human judgment, which safeguards revenue growth, user experience, and alignment with broader business objectives.

This cautious approach is already evident in A/B testing. Fully automated methods, such as reinforcement learning and multi-armed bandits for user allocation, have existed for years. Despite their advantages, these methods are never allowed to run without human supervision. Instead, they typically complement rather than replace classical A/B testing.

This highlights a broader reality: even if AI eventually handles the entire product development lifecycle autonomously, analysis and creativity  will remain crucial for evaluating AI-generated ideas, monitoring product updates, and interpreting results and insights. Human involvement in A/B testing and product decision-making is therefore unlikely to disappear; instead, it will transform: analysts will spend less time on hands-on execution and more on supervising, guiding, ideating, and shaping AI-driven processes.

Bottom line

There is no doubt that AI is transforming A/B testing as we know it. What remains open to debate is the extent of that transformation. In this piece, we’ve shared our perspective on what has already changed and what is most likely to evolve in the near future.

Today, AI is already helping teams generate stronger hypotheses, monitor and interpret metrics, automate large parts of the analysis workflow, and communicate results more effectively. Looking ahead, AI is likely to further connect the different phases of the experimentation lifecycle, enhance debugging and validation capabilities, and strengthen segmentation analysis, unlocking deeper and more nuanced product insights.

By reducing friction, accelerating learning cycles, and lowering the cost of running and analyzing experiments, AI empowers analysts and product teams to learn faster and make better-informed decisions every day.

We invite you to join us at GrowthBook as we continue building the next generation of experimentation, where A/B testing meets the power of AI.

Announcing GrowthBook 4.3: Faster Experiments, Deeper Insights
Releases
Product Updates
4.3

Announcing GrowthBook 4.3: Faster Experiments, Deeper Insights

Feb 4, 2026
x
min read

At GrowthBook, we're focused on helping you learn faster and ship with confidence. GrowthBook 4.3 delivers on both fronts, with post-stratification to reach statistical significance sooner, metric drilldowns to understand results more deeply, and feature evaluation diagnostics to verify your flags are working correctly in production.

GrowthBook 4.3 is now available to all cloud and self-hosted users.

Experiment Analysis

Post-Stratification (Enterprise only)

Experiment analysis now supports post-stratification, a powerful variance reduction technique that produces more precise results.

Here's the idea: if you know revenue varies by country, post-stratification uses that information to isolate the treatment effect from between-group noise. The result is tighter confidence intervals from your existing traffic. In the right conditions, CUPED + post-stratification can be equivalent to running your experiment with 20%+ more traffic!

Configure post-stratification at the organization level under SettingsGeneral, or override it at the metric or experiment level. To enable it, you'll need to have pre-computed dimensions configured in your experiment assignment query.

Post-stratification is available to Enterprise customers. CUPED (without post-stratification) is available to Pro and Enterprise customers.

Experiment Metric Drilldowns (All editions)

Experiment result metric drilldown showing goal metric timeseries

Understanding experiment results just got a lot easier. Click any metric row to open a Metric Drilldown, a focused view with everything you need to interpret that metric without jumping between pages:

  • The Overview tab shows metric details, time series, and a results table with analysis controls.
  • The Slices tab lets you see how your metric breaks down across different values.
Metric slices showing average LCP broken out by browser and country
  • The Debug tab reveals how CUPED, post-stratification, capping, and priors are affecting your numbers.
Debug page showing experiment results metric with pre and post cuped and capping

Metric slices are an Enterprise feature. See Metric Slices for configuration details.

Experiment Result Filters (All editions)

Experiments with dozens or hundreds of metrics can be overwhelming to review. You can now filter results by tag, slice, or metric group to focus on what matters.

Once you find a view you like, use Add to Dashboard to save it for later and share with your team. We also cleaned up the results UI to reduce clutter and keep the focus on your data.

Daily Participation Metrics (All editions)

We added a brand new metric type: Daily Participation. For each user, this measures the fraction of days they were active while enrolled in the experiment (active days ÷ days exposed), then averages that value across users in each variation.

Think of it as DAU normalized per user and exposure window, but more stable for experiments than raw daily active user counts.

This is a really valuable metric for any website or app that is trying to grow daily usage.

Better Fact Table Filters (All editions)

Fact metric filter in action

Metrics are built on Fact Tables, and often you only need a subset of rows. This release adds a powerful filtering UI to define exactly which rows to include, without writing SQL.

Feature Flags

Feature Evaluation Diagnostics (All editions)

Feature evaluation table

When a flag isn't behaving as expected, debugging can be frustrating: you're left guessing whether the issue is in your targeting rules, SDK configuration, or something else entirely.

Feature evaluation diagnostics solves this by querying SDK evaluation events stored in your data warehouse. See exactly what evaluated in production, not just what the rules say should happen. Troubleshoot targeting conditions, rollouts, and experiment rules with real data instead of guesswork.

Nested Saved Groups (All editions)

Saved Groups now support nesting, letting you define groups in terms of other groups. Build complex targeting logic while keeping base definitions centralized and reusable.

For example, combine "Beta Users" AND "Enterprise Plan" to create "Beta Enterprise Users." Update the base group, and nested groups update automatically.

This makes it faster and easier to create targeting rules for feature flags.

Case-Insensitive Regex Targeting (All editions)

New targeting options for case-insensitive regex and "in list" matches—useful for matching email addresses and other values where case shouldn't matter.

Available now in the latest JavaScript, React, Node, and Python SDKs. More SDKs coming soon.

Rust and Roku SDKs (All editions)

We're excited to announce two new official SDKs: Rust and Roku.

Rust is the language of choice for modern performance-critical applications. Special shout out to the community, who authored the initial version of this SDK.

GrowthBook, now on your TV? That’s right, the next time you watch your favorite show, GrowthBook might be working behind the scenes with the launch of our official Roku SDK, a leading smart TV platform that powers millions of streaming devices and TVs worldwide.

With these additions, GrowthBook now offers 24+ SDKs spanning client-side, server-side, mobile, and edge.

Quality-of-Life Improvements

Big thanks to all of our users who reported bugs, shared feedback, and contributed ideas to this release on GitHub or Slack.

Many small improvements add up to a big boost in usability:

  • Improved query performance for fact metrics
  • Cleaner experiment results UI with fewer distractions
  • OR targeting conditions
  • Updated SDK support
  • New API endpoints to manage experiment dashboards and custom fields
  • New Project Admin role to make it easier to manage a large distributed team
  • New Custom Hook option to only validate incremental changes
  • Kerberos auth support for Trino/Presto
  • Option to auto-update metric slice values
  • Support for additional AI models from Anthropic, Mistral, xAI, and Gemini

Plus dozens of smaller fixes and performance improvements.

How The Social Club Cut Experimentation Costs by 82%
Experiments

How The Social Club Cut Experimentation Costs by 82%

Jan 24, 2026
x
min read

Rudger de Groot of Mintminds shared how The Social Hub slashed its experimentation costs with GrowthBook. By driving down the incremental cost per experiment as close as possible to zero, companies can run as many experiments on as much traffic as they want.

The best experimentation programs scale cost-efficiently, so they can run more experiments, learn faster, and ship smarter. But a hidden cost killer is BigQuery query inefficiency. The more you test, the more you pay. What if there were a way to test more and pay less?

In this case study, we’ll show you how Mintminds cut experimentation costs for The Social Hub using GrowthBook with BigQuery optimizations from GA4Dataform by Superform Labs. The setup slashed BigQuery costs by 81.8% while improving data refresh speeds and monitoring capabilities. Here's how they did it.

A Scaling Advantage Built into the Cost Structure

The mission at Mintminds is simple: build high-quality experiments with reliable data and analysis. GrowthBook’s pricing model allows for a setup where the more you test, the lower your per-experiment cost. But to optimize costs, you need to understand where money actually flows. Let’s break down the pricing:

Fixed Costs (pricing, as of Nov 2025)

  • $40/month per seat for GrowthBook Pro license
  • Typical team size: 5 seats = $200/month

Variable Costs (GrowthBook Cloud):

  • 2 million CDN requests included (≈ pageviews)
  • 20 GB CDN bandwidth included
  • Overage: $10 per million requests, $1 per GB bandwidth

Self-Hosting Alternative: You can eliminate CDN costs by self-hosting GrowthBook for $11-50/month (depending on your infrastructure choice).

How Experimentation Costs Compare

To understand how GrowthBook experimentation costs compare, Mintminds shares a real-world example from a client with 2.6 million unique users/month and running 5-7 experiments a month. In this example, they are running the GrowthBook JS SDK on Cloudflare pages, which means no limitations on the number of tested visitors for free. Yes, you read it right…for free!

The variable GrowthBook costs are:

  • 6.6 million CDN requests: 6.6 – 2 (first 2 million are free) = 4.6 * $10 = $46
  • 6 GB CDN Bandwidth usage: $ 0 (first 20GB is free)
  • BigQuery usage cost estimation with daily updates: $300

Fixed GrowthBook Pro costs for a team of 5 members: 5 * $40 = $200

Platform Monthly Cost Annual Cost vs. GrowthBook Optimized
Convert.com Pro $3,488 $41,856 1,050% more expensive
VWO Pro $4,308 $51,696 1,320% more expensive
GrowthBook (Unoptimized) $546 $6,552 80% more expensive
GrowthBook (Optimized) $303 $3,640 Baseline

With BigQuery costs included, GrowthBook remains dramatically cheaper than traditional alternatives like Convert ($3,500/month) or VWO ($4,300/month) at comparable traffic levels. GrowthBook is already the smart financial choice. With optimization, it becomes unbeatable. Using GrowthBook cuts experimentation costs by 82% versus Convert.com Pro and 93% compared to VWO Pro.

An 82% BigQuery reduction transforms GrowthBook from “very affordable” to an offer you simply can’t refuse.

GA4 Structure Wastes BigQuery Resources

Regardless of hosting choice, BigQuery becomes your primary variable cost when using GA4 as your data source. For companies running active experimentation programs with daily updates, Mintminds finds that unoptimized BigQuery costs can easily reach $200 to $400/month.

The default GrowthBook BigQuery integration queries GA4’s standard events_* and events_intraday_* tables. These tables store event parameters in nested structures, forcing BigQuery to process far more data than necessary.

For example when you’re running experiments with:

  • 5 metrics (1 goal + 1 secondary + 3 guardrails)
  • 3 dimensions for segmentation
  • Daily (or more frequent) data refreshes

BigQuery has to scan through nested arrays and repeated fields to extract the specific event parameters you need. You’re paying to process gigabytes of data when you only need megabytes of relevant information.

GrowthBook allows custom fact tables and metrics to select only relevant events and parameters. This helps, but optimizations plateau quickly because you’re still querying nested GA4 tables.

Enterprise customers get access to:

  • Advanced fact table query optimization
  • Data pipelines (significantly improved in GrowthBook 4.2)

But Pro license users need a different approach.

How to Use GA4Dataform's Flattened Datasets to Reduce Query Costs

At #CH2024 (the conference formerly known as Conversion Hotel), Rudger connected with Jules Stuifbergen from Superform Labs about this exact challenge. Jules introduced him to GA4Dataform, which offered an elegant solution.

What GA4Dataform Does: The Core Version (free!) creates a customized, flattened dataset optimized for the type of queries that GrowthBook uses.

Feature Benefit
Fully flattened structure No nested fields = dramatically faster queries
Smart partitioning and clustering Restricting queries by date and event names will decrease the number of rows scanned
Smaller data footprint Less data processed = lower BigQuery costs
Daily automated updates Fresh data from GA4 events table is appended to the table, using incremental logic

Key insight: Even though you’re creating a new dataset in BigQuery (which feeds from the generic GA4 table), the flattened structure makes it cheaper to generate AND cheaper to query than repeatedly querying GA4’s nested tables.

Bonus benefit: This same optimized dataset can be used for all your other BigQuery reports and dashboards, compounding the savings.

A Rigorous A/A Experiment to Test the Setup

Mintminds partnered with Laura Semeraro and the team at The Social Hub—a hybrid hospitality brand offering hotel rooms, co-living spaces, coworking facilities, and creative playgrounds across Europe—to validate this approach with real data.

"Using GA4Dataform's flattened datasets didn't just reduce GrowthBook costs—it optimized all our BigQuery reports and dashboards."
Laura Semeraro, Digital Analyst at The Social Hub    

Implementation Steps

1. GA4Dataform Setup – Laura installed GA4Dataform Core (free version). The custom event parameters from GrowthBook were added to the configuration (experiment ID and variation ID). With the daily schedule enabled, GA4Dataform automatically updates the flat events table incrementally.

2. GrowthBook Configuration – Mintminds created a new assignment query (for counting experiment visitors). Built fact tables for key conversion events: Add-to-cart and purchase events.

3. A/A Test Design – They ran two identical experiments simultaneously:

Configuration:

  • Same targeting rules
  • Same 5 metrics (1 goal, 1 secondary, 3 guardrails)
  • Same 3 dimensions

The Only Difference:

Experiment A: Default GrowthBook queries (nested GA4 tables)
Experiment B: Optimized queries (flattened GA4Dataform dataset)

4. Measurement – GrowthBook usage is automatically labelled in BigQuery, allowing us to track:

  • BigQuery costs from Experiment A (old approach)
  • BigQuery costs from Experiment B (new approach)
  • BigQuery costs for daily dataset updates

Test duration: 1 week

This gave us an objective, apples-to-apples comparison.

The Social Hub Reduced BigQuery Costs by 82%

When the results came in, Rudger and his team had to verify the numbers multiple times to ensure accuracy: a whopping 81.8% cost reduction and a massive query speed improvement, too.

By using the GA4Dataform flattened dataset instead of the default GA4 nested tables, they had reduced BigQuery data processing by more than four-fifths.

Benefit Impact
Update experiment results more frequently Better SRM and MDE monitoring without budget concerns
Run updates faster Flattened queries execute in a fraction of the time
Scale experiment volume The "more you test, less you pay" promise becomes reality
Optimize other analytics Use the same flattened dataset for all BigQuery dashboards

The compounding effect: Lower per-experiment costs + faster refresh rates = exponentially better experimentation program ROI.

Enterprise Experimentation at a Fraction of the Cost

This case study demonstrates how to achieve exceptional BigQuery efficiency with GrowthBook. By combining GrowthBook Pro, GA4Dataform Core and Strategic BigQuery optimization, you can build a cost-effective, high-performance experimentation stack that rivals Enterprise setups—at a fraction of the price. The cost reduction Mintminds achieved with The Social Hub isn’t an outlier. It’s the new baseline for GrowthBook implementations.

About Our Partners

Mintminds is a Certified GrowthBook partner based in the Netherlands. Founded by Rudger de Groot, the team assists companies worldwide with hyper-scaling experimentation using GrowthBook.  

The Social Hub is a European hospitality brand that blends traditional hotel stays with a vibrant, community-focused experience. Its unique hybrid model combines premium design-led short and long-stay hotel rooms with student accommodation, coworking spaces, meeting and event facilities, restaurants and bars, 24-hour gyms, and open-to-the-public spaces like rooftops, parks, and cultural venues.

AI Evals vs. A/B Testing: Why You Need Both to Ship GenAI
Experiments

AI Evals vs. A/B Testing: Why You Need Both to Ship GenAI

Jan 20, 2026
x
min read

Most teams building with GenAI are flying blind. They've replaced unit tests with vibes and shipped prompts that "felt right" to three engineers on a Friday afternoon.

This isn't a criticism—it's a diagnosis. For decades, we operated under a deterministic paradigm. The contract between developer and machine was explicit: Input A + Code = Output B. Always, without fail. In this world, success was binary. A unit test passed or it failed.

Generative AI has shattered this contract. We have moved from deterministic engineering to probabilistic engineering. We are no longer building binaries; we are managing stochastic agents that produce a distribution of probable outputs. You cannot assert(x == y) when x and y can change every time.

Gian Segato (Anthropic) eloquently sums up this shift: “We are no longer guaranteed what x is going to be, and we're no longer certain about the output y either, because it's now drawn from a distribution…. Stop for a moment to realize what this means. When building on top of this technology, our products can now succeed in ways we’ve never even imagined, and fail in ways we never intended” (Building AI Products In The Probabilistic Era).

As seismic as this shift may be, we’re focusing on a single aspect of it here: the shift from the domain of verification (is it correct?) to the domain of validation (is it good?).

This shift has left teams scrambling to define quality. Many have fallen into the trap of thinking AI Evaluations (Evals) are a replacement for A/B testing. They aren't.

And, for those in a hurry, here’s the point:

  • AI Evals check for competence—can the model do the job?
  • A/B testing checks for valuedo users care?

You cannot ship a good AI product without both AI Evals and A/B testing.

The Limits of Vibe Checking

In the early days of the LLM boom, “Prompt Engineering” was largely a feeling-based art. Devs would tweak a prompt, run it three times, read the output, and decide if it “felt” better.

This manual inspection—”vibe checking”—leverages human intuition, which is great for nuance but terrible for scale.

Vibe checking suffers from three critical flaws:

  1. Sample size: You might test 5 inputs. Production brings 50k edge cases.
  2. Regression invisibility: Making a prompt “polite” might accidentally break its ability to output valid JSON. You won’t feel that until the API breaks.
  3. Subjectivity: One engineer’s “concise” is another’s “curt.”

As ML Systems Researcher, Shreya Shankar notes, “You can’t vibe check your way to understanding what’s going on.” Manual inspection is mathematically insufficient for understanding probabilistic systems at scale.

To solve this, the industry turned to AI Evals.

💡

For an excellent intro to AI Evals, check out Shreya Shankar and Hamal Husain on Lenny’s Podcast.

What Are AI Evals?

AI Evaluations are an attempt to systematize the vibe check—turning qualitative judgment into quantitative metrics. They're a way to programmatically test the probabilistic parts of your application: prompts, models, and parameters.

But the term "Eval" is overloaded. When someone says "we're running evals," they might mean any of three things.

3 Types of AI Evals and Why They Matter

1. Model Evals

Model evals are benchmarks like MMLU or HumanEval. They're useful for choosing a provider (GPT-5 vs. Claude Opus 4.5), but they tell you almost nothing about your specific application. A model might ace GSM8K (math reasoning) and still be a terrible customer service agent. Worse, these public benchmarks are increasingly contaminated—models have seen the test questions during training, inflating scores that don't transfer to novel problems. (We wrote a whole article about why “The Benchmarks Are Lying To You.”)

2. System Evals

System evals are what matter most. These test your end-to-end pipeline: prompt + RAG retrieval + model. The key metrics here are things like hallucination rate, faithfulness (does the answer stick to the retrieved context?), and relevance.

Many teams now use LLM-as-Judge—a strong model grading outputs on subjective criteria like tone, helpfulness, and coherence. It scales better than human review, but inherits the same limitation: it measures whether an answer seems good, not whether users act on it.

3. Guardrails

Guardrails are real-time safety checks—toxicity filters, PII detection, jailbreak prevention. Important, but a different concern than quality.

All three share a critical constraint: they measure competence, not value. Whether you run evals offline in your CI/CD pipeline against a curated "Golden Dataset," or online against live traffic in shadow mode, you're still asking the same question: Can this model do the job?

Some evals do capture preference—human ratings, side-by-side comparisons, thumbs up/down. But these are still proxies. A user clicking "thumbs up" in a sandbox isn't the same as a user returning to your product tomorrow. Evals measure stated preference; A/B tests measure revealed preference through behavior.

What evals can't tell you is whether users will care enough to stick around.

Where Evals Fall Short

Even within the realm of evals, a model that looks good in controlled conditions can fall apart in production.

The DoorDash engineering team documented this problem in detail. They built a new ad-ranking model that performed well in testing—but when deployed to real users, its accuracy dropped by 4.3%. The culprit? Their test data was too clean. The model had been trained assuming it would always have fresh, up-to-date information about users. But in the real world, that data was often hours or days old due to system delays. The model had been optimized for conditions that didn't exist in production.

This principle applies even more to LLM applications. LLMs are sensitive to prompt phrasing, context length, and retrieval quality—all of which behave differently in production than in curated test sets.

Consider a concrete example: you optimize a customer service prompt for faithfulness—it sticks strictly to your knowledge base and never hallucinates. Evals look great. But in production, users find the responses robotic and impersonal. Satisfaction drops. You optimized for accuracy; they wanted empathy.

This is the core limitation of evals: they measure capability, not value. Even when you run evals against live traffic, you're testing whether the model can do something—not whether that something matters to users.

Why You Should Use A/B Testing with Your AI Evals

If evals are the unit test, A/B testing is the integration test with reality. It’s the only way to measure what actually matters: downstream business impact like retention, revenue, conversion, engagement, and user satisfaction.

But running A/B tests on LLMs introduces challenges that didn't exist in traditional web experimentation. (For an introduction to the topic, see our practical guide to A/B testing AI.)

Challenges of Running A/B Tests on AI

1. The Latency Confound

Intelligence usually costs speed. If you test a fast, simple model against a smart, slow one and the variant loses—why? Was the answer worse or did users just hate waiting three seconds?

Isolating "intelligence" as a variable often requires artificial latency injection: intentionally slowing the control to match the variant. Only then can you measure what you think you're measuring.

2. High Variance

LLMs are non-deterministic. Two users in the same variant might see meaningfully different responses. This noise demands larger sample sizes and longer test durations to reach statistical significance.

A button-color test might reach significance in a few thousand sessions. An LLM prompt test—where output variance is high and effect sizes are often small—might need 10x that, or weeks of runtime, to detect a meaningful difference.

3. Choosing the Right Metric

Choosing the right metric is harder for AI features than for traditional UI changes. A chatbot might increase engagement (users ask more questions) while decreasing efficiency (they take longer to get answers). Align your success metric with actual business value, not just surface activity.

These realities create a tension. A/B testing AI gives you certainty, but certainty takes time. If you have twenty prompts to evaluate, a traditional A/B test could take months. And during those months, a significant portion of your users are experiencing inferior variants.

Enter Multi-Armed Bandits

For prompt optimization—where iterations are cheap, and the cost of a suboptimal variant is low—multi-armed bandits offer a different trade-off. Instead of fixed traffic allocation, they dynamically shift users toward winning variants as data accumulates. You sacrifice some statistical rigor for speed and reduced regret.

🎰

Check out our deep-dive on how they work in GrowthBook.

Comparing A/B Testing to Multi-Armed Bandits

FeatureA/B TestingMulti-Armed BanditsPrimary GoalKnowledge. Determine with statistical certainty if B is better than A.Reward. Maximize total conversions during the experiment.Traffic AllocationFixed for the duration.Dynamic. Automatically shifts traffic to the winner.Best Use CaseMajor model launches, pricing, UI changesPrompt optimization, headline testing

Bandits aren't a replacement for A/B testing. They're a complement—best suited for rapid iteration loops where you're optimizing within a validated direction, not making major strategic bets.

How to Use AI Evals and A/B Testing Together

An infographic titled "The LLMOps Pipeline" illustrating four vertical stages to filter risk: Offline Evals, Shadow Mode, Safe Rollout, and A/B Test. The chart shows a downward progression from low-cost, fast technical checks (labeled "Competence") to higher-cost, accurate business measurements (labeled "Value"). It highlights that while early stages catch technical errors like hallucinations, only the final A/B Test stage proves actual business value like retention and revenue.

At GrowthBook, we see the highest-performing teams treating evals and experimentation not as separate islands, but as a continuous pipeline—each stage filtering out risk with progressively more expensive (but more accurate) methods.

Using AI Evals and A/B Testing Together in Practice

Stage 1: The Offline Filter (CI/CD)

A developer creates a new prompt branch. The CI/CD pipeline automatically runs evals against the Golden Dataset. If faithfulness drops below 90% or latency exceeds the threshold, the build fails. Bad ideas die here, costing pennies in API credits rather than user trust.

Stage 2: Shadow Mode (Production, Silent)

The prompt passes offline evals and gets deployed—but users never see it. The new model processes live traffic silently, logging predictions without surfacing them.

This is online evaluation: you're still measuring competence (latency, accuracy, edge case handling), but now against real-world conditions. DoorDash's 4% accuracy gap between testing and production is exactly the kind of discrepancy shadow mode is designed to surface—before users experience the degraded results.

Stage 3: Safe Rollout

Shadow mode passes. Feature flags gradually release the new model to users. You're monitoring guardrail metrics: error rates, refusal spikes, support tickets. If something tanks, you flip the flag and revert instantly—no code rollback required.

🦺

Use GrowthBook's Safe Rollouts to monitor guardrail metrics and rollback automatically.

Stage 4: The A/B Test (Causal Proof)

The rollout survives. Now you run the real experiment: new model vs. baseline, measured on business metrics. Not "faithfulness" but retention. Not "relevance" but conversion. This is the only stage that proves value.

Conclusion: AI Evals plus A/B Testing for GenAI

You cannot A/B test a broken model. It’s reckless. And you cannot Eval your way to product-market fit. It’s guesswork.

To ship generative AI that's both safe and profitable, you need both: rigorous evals to ensure competence, and robust A/B testing to prove value. The pipeline between them—shadow mode, safe rollouts—is how you get from one to the other without breaking things.

As Segato warned, our products can now fail in ways we never intended. This pipeline is how we catch those failures before users do.

We've moved from is it correct? to is it good? Evals answer the first question. A/B tests answer the second. You need both.

Frequently Asked Questions

Can AI Evals replace A/B testing?
No. AI Evals and A/B testing serve different purposes in the development lifecycle. Evals measure competence—accuracy, safety, tone—whether run offline or online. A/B testing measures business value through revealed user behavior: retention, revenue, conversion. Evals tell you the model works; A/B tests tell you it's worth shipping.

What is the difference between Offline and Online Evaluation?
Offline evaluation happens pre-deployment using a static Golden Dataset to check for regressions and quality. Online evaluation happens in production using live traffic (e.g., shadow mode). Both measure competence, but online evaluation catches issues—like feature staleness or latency spikes—that don't appear in controlled conditions.

How do you handle latency when A/B testing LLMs?
Latency is a major confounding variable because "smarter" models are often slower. If a slower model performs worse, it's unclear if users disliked the answer or the wait time. To fix this, engineers use Artificial Latency Injection—intentionally slowing down the control group to match the variant's response time, isolating "intelligence" as the single variable.

What is "Vibe Checking" in AI development?
"Vibe checking" is the informal process of manually inspecting a few model outputs to see if they "feel" right. While useful for early exploration, it is unscalable and statistically flawed for production systems because it fails to account for edge cases, regressions, or large-scale user preferences.

When should I use a Multi-Armed Bandit instead of an A/B test?
Use a Multi-Armed Bandit when your goal is optimization (maximizing reward) rather than knowledge (statistical significance). MABs are ideal for testing prompt variations or content recommendations because they automatically route traffic to the winning variation, minimizing regret. Use A/B tests for major architectural changes or risky launches where you need certainty.

What is the best way to deploy AI models safely?
Use a staged pipeline. Start with offline evals in CI/CD to catch regressions. Then use shadow mode to test against live traffic silently. Next, use feature flags to release to a small percentage of users while monitoring guardrails. Finally, run a full A/B test to measure business impact. Each stage filters out risk before exposing users to problems.

What is LLM-as-Judge?
LLM-as-Judge is an evaluation technique where a strong model (like GPT-4 or Claude) grades the outputs of your system on subjective criteria such as tone, helpfulness, and coherence. It scales better than human review but shares the same limitation as other evals: it measures whether an answer seems good, not whether users will act on it.

What is the difference between stated and revealed preference in AI evaluation?
Stated preference is what users say they like—thumbs up ratings, side-by-side comparisons in a sandbox. Revealed preference is what users actually do—returning to your product, completing tasks, converting. Evals capture stated preference; A/B tests capture revealed preference. The two often diverge.

Dark Patterns in A/B Testing: How Short-Term Optimization Leads to Product Enshittification
Experiments

Dark Patterns in A/B Testing: How Short-Term Optimization Leads to Product Enshittification

Jan 12, 2026
x
min read

Why optimizing for short-term A/B test wins can degrade user trust and product quality. A look at common dark patterns in experimentation, why they “work,” and how better metrics can help teams build products that create real long-term value.

A post supposedly from a software engineer at a meal delivery company went viral recently. It accused the unnamed company of unscrupulously manipulating pricing, fees, and salaries to increase revenue. One of the things they did was to run an A/B test on a “Priority delivery” fee. According to the post, there were no product changes to make delivery faster, but instead, they delayed regular deliveries.

“We actually ran an A/B test last year where we didn't speed up the priority orders, we just purposefully delayed non-priority orders by 5 to 10 minutes to make the Priority ones "feel" faster by comparison. Management loved the results. We generated millions in pure profit just by making the standard service worse, not by making the premium service better.” (Source: Reddit)

While there are some questions about the veracity of this post, such dark patterns in A/B testing and product development are absolutely being done. And this raises an important question about the ethics of using these techniques in experimentation.

What Are Dark Patterns?

Dark patterns are product design or implementation choices that deliberately nudge, coerce, or mislead users into behaviors that primarily benefit the company. They often come at the expense of the user’s understanding or long-term satisfaction. 

For a comprehensive taxonomy, see deceptive.design, which catalogs these patterns in detail. 

How Are Dark Patterns Used in A/B Testing?

In the context of A/B testing, dark patterns typically appear when experiments are optimized narrowly for short-term business metrics, such as a conversion rate, without regard for whether the underlying change actually improves the product. Often they are introduced as a response to an organization’s goal metric that fails to capture the complete picture (see Goodhart’s Law and the dangers of metric selection). 

Common Dark Patterns Used in Experiments

  • Artificial degradation: Making a baseline experience worse (for example, slowing delivery times as above or adding friction) so that a paid tier or alternative appears more attractive.
  • Obscured choice: Designing UI variants that make it harder to opt out, cancel, or choose a lower-cost option, then validating them via A/B tests that show higher revenue.
  • Price obfuscation: Experimenting with fees, surcharges, or defaults in ways that users only discover late in the funnel.
  • Emotional manipulation: Leveraging urgency, guilt, or fear (“Only 2 left!”, “People like you choose…”) to drive behavior, then justifying it with statistically significant lifts.

A/B testing itself is not the problem. The problem is using experimentation as a shield: “the data says it works” becomes a way to avoid asking whether the outcome is aligned with user value or long-term trust. It hides the real question of whether we should do this at all.

Short-Term Wins, Long-Term Costs of Unethical Experimentation

Dark patterns can look good in the short term. They are engineered to do so. Revenue goes up, conversion improves, and dashboards turn green. These tactics exploit goodwill with your current user base and long-term measurement blind spots, creating lifts that are easy to recognize immediately. The costs, however, tend to be delayed and externalized.

Dark patterns in A/B testing introduce several long-term risks for organizations.

  1. Reputational Risk
    Users are not irrational. They may not always articulate why they are unhappy, but they notice when a product feels hostile, manipulative, or nickel-and-dime driven. Trust erodes quietly and then suddenly. When stories like the viral post above surface (whether accurate or not), they resonate precisely because users already suspect this behavior.
  2. Legislative and Regulatory Risk 
    Many dark patterns operate in gray areas that are increasingly of interest to regulators. Fee transparency, deceptive defaults, and coercive UX are now explicitly called out in regulations in multiple jurisdictions (see the EU’s Digital Services Act (DSA) and the California Privacy Rights Act (CPRA)). An A/B test that boosts revenue today can become legal exposure tomorrow, complete with internal documentation showing intent.
  3. Internal and Cultural Risk
    Engineers, designers, and PMs generally want to build products that help people. When teams are repeatedly asked to ship features that intentionally worsen user experience, morale suffers. The best people notice. Over time, this can lead to disengagement or attrition, especially among senior contributors who have other options.
  4. Risk from Competition
    Applying dark patterns that don’t improve the product opens the door, in the long term, for competitors to build a better product and put your company at risk.

In other words, dark patterns trade long-term value for short-term gains. 

Practical Solutions to Avoid Dark Patterns in Experimentation

There are some practical ways to help reduce these risks and avoid the enshittification of products. Chief among these are adopting value principles and establishing ethics committees. 

Value principles, like Google’s “Don’t be evil”, are frequently treated as aspirational marketing artifacts rather than operational constraints. Many tend to be vague or non-actionable and open to interpretation, which provides no meaningful protection against dark patterns. Finally, even if they are actionable and adopted as policy, they can come into tension with other incentives at the company, such as bonuses or career progression. Google, after all, ditched “Don’t be evil” in 2018. 

Ethics committees are used at some larger companies to ensure consistent application of company values. However, they can face the same issues as the values above, particularly if the company is facing financial pressure; the ethics team can be high on the list of cuts

The most practical way to avoid dark patterns is not an ethics committee or a vague principle statement; it is using the right metrics.

If you only measure immediate revenue or conversion, you will eventually design experiments that extract value rather than create it. To counteract this, teams need to deliberately include metrics that reflect longer-term outcomes.

Example experimentation metrics to use to avoid dark pattern behavior

  • Retention
  • Repeat usage
  • Complaint rates
  • Refunds
  • Customer support contacts
  • Brand sentiment
  • Qualitative feedback

Not all of these can be perfectly measured- or measured at all (like the likelihood or cost of losing key employees). In the real world, the data will never be perfect. Good product judgment will still be required, as there will always be uncertainty. An experiment that produces a short-term lift but could be seen to damage trust should be treated with skepticism, even if the lift is excellent. 

When Experimentation Leads to a Better Product

Ultimately, the goal of experimentation is not to prove that you can move a number. It is to learn how to make something people genuinely want. A/B testing is a powerful tool in the service of that goal, but the further you drift from it, the more your “wins” become signals of underlying enshittification rather than progress. Make sure your metrics reflect your real goals as much as possible.  

In the long run, the most effective optimization strategy remains the simplest: make the product better.

7 Steps to Better Experiment Design
Experiments
Platform

7 Steps to Better Experiment Design

Dec 22, 2025
x
min read

A practical checklist for running A/B tests you can trust

From predictive model accuracy at Facebook and experiment design at X (formerly Twitter), to building the best experimentation platform used by Dropbox, Sony and Upstart with GrowthBook, I've spent the last six years shaping how some of the largest tech companies measure success and ship features.

Across companies, industries, and scales, I’ve seen the same pattern repeat: experimentation rarely fails because teams don’t understand A/B testing mechanics. It fails because experiments are poorly designed—unclear goals, misaligned metrics, weak baselines, flawed randomization, or decisions made without a plan for ambiguous results.

The teams that get the most value from experimentation aren’t running more tests. They’re running better ones. They’re deliberate about what they’re trying to learn and disciplined about how results turn into decisions.

This article distills the most reliable experiment design practices I’ve learned from years of work in the field. If you already know how A/B testing works and want results you can trust—and act on—these seven steps are a strong place to start.

(For a deeper technical walkthrough, see GrowthBook’s Experimentation Best Practices)

1. Define the Goal Clearly

Every experiment should answer a specific question.

Start by writing down the problem you’re trying to solve in plain language. Is it activation? Retention? Conversion efficiency?

A good test of clarity is whether you can write a concrete hypothesis, such as:

“Users who complete the new onboarding flow will reach the activation milestone 10% more often than users in the existing flow.”

Clear goals prevent experiments from drifting into vague “did anything change?” territory.

In practice: Teams at Dropbox use tightly framed hypotheses to avoid shipping changes that move surface-level engagement but fail to improve long-term collaboration or retention.

2. Choose the Right Success Metrics

Once the goal is clear, metrics follow.

Every experiment should have:

  • One primary metric that defines success
  • A set of secondary metrics for context
  • Guardrail metrics to catch unintended harm

Focusing on too many metrics creates confusion. Tracking too few hides important tradeoffs—especially when multiple metrics are evaluated simultaneously (see GrowthBook’s guidance on multiple testing corrections).

Use your secondary metrics to improve your understanding of what drives your primary metric. They also help you check-in periodically with your primary metric, ensuring it is well-defined and driving you towards your business goals.

Teams at Khan Academy use experimentation to iterate on learning experiences while remaining deeply thoughtful about how success is measured in an educational context.

3. Know Your Baseline

You can’t interpret change without knowing where you started.

Before launching an experiment:

  • Understand current performance
  • Measure normal variance
  • Calibrate expectations for realistic lift

A change from 4% to 5% conversion is only meaningful if you know how stable 4% really is.

In practice: One GrowthBook customer—a large European marketplace—moved away from before-and-after analysis after realizing they couldn’t separate real lift from seasonality. Establishing proper baselines made results interpretable and decisions easier.

4. Understand Leading vs. Lagging Indicators

Not all metrics respond at the same speed.

  • Leading indicators provide fast feedback and are often better suited for short-term experiments.
  • Lagging indicators validate long-term impact and strategic alignment.

High-performing teams use both, but they’re intentional about which metric actually determines success.

Optimizing only for lagging indicators slows learning. Ignoring them risks local optimization.

5. Define the Experiment Population and Randomization Strategy

Decide who should be included in the experiment—and exclude everyone else.

Best practices include:

  • Randomizing users as close to the experience as possible
  • Ensuring assignment persists across sessions
  • Using a true control group
  • Keeping designs simple when traffic is limited

If you don’t have enough users, avoid multi-variant tests.

In practice: One GrowthBook customer, a major European retailer, was running underpowered tests. They moved from partial traffic to testing on 100% of visitors—dramatically reducing time to confidence and revealing insights that challenged long-held assumptions.

If you’re using feature flags to control exposure, GrowthBook’s approach to running experiments with feature flags is designed specifically for this kind of setup.

6. Validate Your Setup Before You Trust Results

You can’t analyze what you can’t connect.

Before launching real experiments, confirm that:

  • Exposure data joins cleanly with outcome data
  • Identifiers are consistent
  • Metrics are computed correctly

Then run an A/A test—two identical variants with no visible change.

In practice: Teams operating at scale use A/A tests to catch instrumentation and analysis issues early. If multiple uncorrelated metrics “win” in a no-change test, or multiple A/A tests fail with clear issues, something is broken. GrowthBook strongly recommends this as a validation step (A/A testing documentation).

7. Decide How Long to Run the Experiment

Ending experiments early increases false positives. Letting them run forever slows learning.

Plan duration in advance based on:

  • Expected variance
  • Minimum detectable effect
  • Available traffic

If you need flexibility, approaches like sequential testing can help—but only if you understand the tradeoffs.

Bonus: Plan for All Outcomes

Only 10–30% of experiments produce a clear winner. That’s normal.

High-performing teams plan for this reality before launching:

  • Low-cost features may ship on directional evidence
  • High-cost features require stronger confidence
  • Neutral results still generate valuable learning

Experiments aren’t always about maximizing win rates. In some cases, they prevent huge losses. In other cases, their primary value is learning about user behavior.

Final Thought

Experimentation isn’t about proving you’re right. It’s about discovering what’s true.

Every experiment—even a neutral one—teaches you something about your users and your assumptions. Teams that stay curious, document learnings, and iterate deliberately are the ones that compound results over time.

That’s what turns experimentation into a real competitive advantage.

FAQ: Experimentation & A/B Testing in Practice


How do you decide whether an A/B test result is actionable?
When the results all point to the same decision, even when accounting for uncertainty. If you would ship even if the results were at the bottom end of the confidence intervals and you've collected a reasonable amount of data, ship!

Why are so many A/B test results inconclusive?
Because most product changes simply don’t meaningfully change behavior. Neutral results often reveal what users don’t care about, guiding better future experiments.

How long should an experiment run?
Long enough to reach sufficient statistical power—not until a metric looks good.

When should you ship a result that isn’t statistically significant?
For low-risk, low-cost changes with stable guardrails. High-risk features need stronger confidence.

What’s the biggest mistake teams make with experimentation?
Treating experimentation as validation instead of learning.

Announcing GrowthBook 4.2: Product Analytics & Experimentation at Scale
Releases
4.2
Product Updates

Announcing GrowthBook 4.2: Product Analytics & Experimentation at Scale

Nov 11, 2025
x
min read

At GrowthBook, our mission is to provide the insights you need to build better products that grow your business faster. With GrowthBook 4.2, we’ve added a beta version of GrowthBook Product Analytics. Now our users will have a single integrated platform for feature management, experimentation, and product analytics.

In addition, we’ve continued to enhance the developer experience, making experimentation at scale and integration into any stack easier than ever. Finally, for companies seeking an alternative to Statsig, our Statsig to GrowthBook Migration Kit automates importing feature gates and dynamic configs while replacing Statsig SDKs with GrowthBook SDKs.

Release 4.2 is available immediately to both our cloud and self-hosted users. Visit our Pricing page for details about Starter, Pro, and Enterprise options. 

GrowthBook Product Analytics (Beta)

Adding Product Analytics to the GrowthBook platform closes the loop for development. Now, you can go from feature management to experimentation to product analytics in a single tool. While in beta, Product Analytics will be available to all users.

Turn your warehouse data and metrics into actionable product insights. Explore user behavior, share dashboards, and make smarter decisions about what to build next. With Product Analytics, you will be able to:

  • Build and share dashboards that combine graphs, pivot tables, and text
  • Create custom charts and tables from any data in your warehouse
  • Use GrowthBook SQL Explorer with our AI-powered text-to-SQL capabilities to query, aggregate, and group data
  • Access any metric defined in GrowthBook and track its performance over time
Conversion Rate by Country line chart
Build charts with any data in your warehouse using SQL Explorer
Line chart showing support ticket volume
Analyze any metrics defined in GrowthBook
Pivot table showing user role by country
Slice and dice data with flexible pivot tables

This Product Analytics beta provides a glimpse of what’s to come as GrowthBook develops more self-service tools for building, analyzing, and exploring all of your product data. Let us know what you think in our Slack community!

Statsig to GrowthBook Migration Kit

With the OpenAI acquisition of Statsig, we saw a spike in interest in GrowthBook. Product teams looking for alternatives expressed concern about what would happen to their data. Others worried that the product might be discontinued or deprioritized. To make the transition from the acquired platform to an open-source alternative as effortless as possible, we created the Statsig to GrowthBook Migration Kit, free for all users.

  • Statsig Importer instantly copies over feature gates, dynamic configs, and segments.
  • Statsig Code Migration Tool (powered by Claude Code) automatically replaces Statsig SDKs with GrowthBook SDKs.

Enterprise Enhancements

The 4.2 features below continue our investment in the developer experience that makes GrowthBook a top choice for product development teams with high volume apps and advanced experimentation programs. 

Metric Slices: Simplify Experiment Design

When users create experiments, they often want to look at a number of metrics across common dimensions like product categories or device types. This can lead to the need to manage a number of metrics. Metric slices solves this problem. Enable auto slices on a Fact Metric once, and GrowthBook automatically generates drill-down analyses for each dimension value across all experiments using that metric.

Shows metric slices and chance to win for revenue per user by product category
View revenue per user metric by product category

Instead of creating separate “Orders” metrics for each product category or device type, you can enable Auto Slices on those columns with a single metric which means fewer redundant metrics, faster setup, and cleaner reporting.

Incremental Refresh

We revamped our Data Pipeline Mode to lower query costs and improve performance for long-running experiments and high-traffic apps. By storing intermediate results and incrementally refreshing them, we’ve seen users save up to 85% in query costs. This first version is available on BigQuery, Presto, and Trino. We’ll be adding support for more data warehouses based on customer demand.

Official Metrics

Many organizations rely on a trusted set of “official” metrics. GrowthBook now makes these easier to manage by letting admins mark and edit official metrics directly from the UI (previously API-only). This helps standardize measurement, reduce confusion, and promote consistency across teams.

New SQL Template Variables

You can now access custom field values and phase data directly in your metric and experiment SQL, unlocking several use cases:

  • Fine-tuned query optimization using non-date partition keys
  • Reuse of SQL definitions with minor tweaks per experiment
  • More accurate joins between experiment exposure and phase data

Custom Validation Hooks

GrowthBook has always been flexible — and now it’s even more so. Self-hosted enterprise users can write custom JavaScript validation hooks that run in secure V8 isolates. Use them to:

  • Require tags on feature flags
  • Prevent targeting rules containing PII
  • Enforce naming conventions or internal policies

These hooks let teams automate governance without slowing down development.

Edge Remote Eval

Edge Remote Eval lets client-side SDKs offload feature flag evaluation to a backend server, preventing targeting logic from leaking to users. Previously, this required managing your own GrowthBook proxy servers. Now, you can deploy a Cloudflare Workers–based Remote Eval server — a fast, low-cost, zero-maintenance alternative built on Cloudflare’s global infrastructure.

Quality-of-Life Improvements

Big thanks to all of our users who reported bugs, shared feedback, and contributed ideas to this release on GitHub or Slack.

Many small improvements add up to a big boost in usability:

  • Faster and more relevant search algorithm for features, metrics, and experiments 
  • Create feature rules in multiple environments at once
  • Better column-type detection for BigQuery Fact Tables
  • Add metric row filters based on Boolean columns
  • Reduced webhook noise (no more notifications for unpublished drafts)
  • Slack and Discord notifications now include more detailed change info
  • Custom pre-launch checklist items can be scoped to specific projects
  • Faster database schema browsing, even with hundreds of tables
  • New setting to disable legacy metrics for smoother transition to Fact Tables
  • Sortable experiment results tables — quickly see top or bottom performers

Plus dozens of smaller fixes and performance improvements.

2025: A Year of Rapid Innovation

The 4.2 release is GrowthBook’s sixth major update in 2025, capping off what has easily been the biggest year of innovation in our company’s history. GrowthBook launched over 45 new features across four major themes in 2025:

  • Experimentation at Scale: New metrics, templates, dashboards, and analytics
  • Feature Management: Safe rollouts and feature analytics
  • Artificial Intelligence: A new MCP server and embedded AI capabilities
  • Developer Experience: Managed data warehouse, native Vercel integration, 24+ updated SDKs, enhanced server-side rendering, and support for new CMSs and FerretDB

Whether you’re on the Starter plan ready for more advanced experimentation and analytics or a Pro user building a culture of experimentation, we’re ready to help you grow. We’re excited to see what you build — and how you use these new tools to learn faster.

7,000 GitHub Stars and Counting
News

7,000 GitHub Stars and Counting

Oct 30, 2025
x
min read

Thank you for making GrowthBook the world’s largest open-source experimentation platform

GrowthBook passed 7,000 stars on GitHub this month thanks to you. Your support confirms our commitment to experimentation-led development and open-source transparency. We see you testing every day in the 100 billion+ feature flag lookups we handle, and the thousands of organizations actively using GrowthBook each month. 

To celebrate this milestone, let’s look back on how we’ve grown and ahead to where we’re going. Our goal is to help you go faster at scale. Let’s see how we do it.

GrowthBook growth metrics showing 100 billion+ feature flag lookups

What’s New with GrowthBook in 2025?

GrowthBook released more than 45 new features in our cloud and self-hosted experimentation platform in 4 key areas: data exploration, developer experience, advanced experimentation, and improving the experiment lifecycle with AI. As an engineering-first company, we believe that experiments should be easy and cheap to run so you can learn constantly. 

Better Data Exploration

What good is an experiment if you can’t easily analyze the results? GrowthBook provides full transparency by exposing the underlying SQL for your experiments. But we know you wanted more ways to explore your data, debug issues, and create custom reports and visualizations without the context switching. Now you can explore your data and build custom dashboards. 

Complexity happens fast when it comes to data analysis across teams and departments. Metric slices give everyone flexibility without complexity. For example, instead of separate revenue metrics for each product type, you can use metric slices to automatically generate distinct revenue metrics for each product type (such as “apparel” or “equipment”). Teams benefit from more granular and relevant analysis without duplicating definitions. Everyone stays on the same page. 

Accelerating Experimentation Culture

Why do so many engineering teams build their own experimentation platforms? So they get exactly what they want. GrowthBook helps teams migrate from homegrown to an experimentation culture by giving developers what they want with control. Customizable dashboards and frameworks help more teams run more experiments faster and learn from the results.   

That’s why we developed experiment dashboards. Developers, data teams, and product managers create their own custom view to go deep on individual experiments. They get exactly what they need to highlight interesting results, hide the noise, and begin to tell a story with the data that everyone in the organization can understand. 

The Experiment Decision Framework helps teams make systematic, consistent decisions about when and how to conclude experiments. GrowthBook’s default modes include “do no harm” and “clear signal” with the option to customize with your own rules so you can iterate quickly.

For developers who want to skip the setup of a data source for our warehouse-native solution, we launched a Managed Warehouse option. Now, your team can go straight to feature management, experimentation, and product analytics without the data connection, cost, and refresh hassles.

Advanced Experimentation

The more experiments you run, the more advanced your experimentation program becomes. We believe that so many of you support GrowthBook because of the high bar we set for statistical rigor. We continued that commitment with features for sophisticated metrics, automated decision-making, and comprehensive measurement capabilities for high-frequency testing programs. Measure the long term impact of changes and control outcomes with holdouts, multi-arm bandits, and safe rollouts

With Insights, GrowthBook’s executive dashboard offers a 10,000-foot view across all of your organization’s experiments to understand what you’ve done and what you’ve learned. Help your team go further, faster by learning from experiments, exploring experiment timelines, and analyzing metric effects and correlations. Filter by project and data range, view by win rate, scaled impact, and velocity.

Improving the Experiment Lifecycle with AI

It’s time to talk to your experimentation platform. The MCP server streamlines workflows and enables AI-powered automation and insights within your development environment. Connect to your favorite LLMs to manage feature flags, experiments, and other tasks without switching contexts. The MCP server works with Cursor, Claude, VS Code, and it’s open source. 

We’ve also embedded AI into GrowthBook. You can use natural language questions to generate SQL. Your GrowthBook assistant helps you follow best practices by checking hypotheses, summarizing metric descriptions, generating experiment summaries, and comparing past experiments to avoid duplication. 

Looking Ahead: The Future of Experimentation at GrowthBook

We continue to be inspired by our GitHub stargazers, Slack community members, and all the experimenters out there, committed to making everything better. As we prepare for the year ahead, we’re looking at a few key themes.

  • In this time of consolidation and disruption, data security and governance matter more than ever. Our warehouse-native approach lets you keep your data in-house under your control.
  • As AI-generated code becomes more pervasive, experimentation provides an essential check on whether code works and benefits the business.
  • Fostering a culture of experimentation does more than draw the signal from the noise. It helps you fail sooner, in the smallest ways possible, so you can accelerate success. 

Here's to the next 7,000 stars and beyond! If you haven't already, check out GrowthBook on GitHub—we'd love to see what you experiment with next.

Ready to join the experimentation revolution? Star us on GitHub, join our Slack community, or dive into the code. The future of product development is open, transparent, and data-driven. Let's build it together.

Ready to ship faster?

No credit card required. Start with feature flags, experimentation, and product analytics—free.