The Uplift - Modern Product Development Blog

The Uplift Blog

Experiments

Designing A/B Testing Experiments for Long-Term Growth

Elyse Fox

Apr 6, 2026

min read

Ronny Kohavi — Stanford PhD, Ex-VP and Technical Fellow at Airbnb, formerly Microsoft and Amazon — is one of the top cited researchers in Computer Science and a leading voice in experimentation. He recently joined Luke Sonnet, Head of Experimentation at GrowthBook, for a webinar sharing best practices, mistakes to avoid, and surprising insights into how often experiments actually succeed. Watch Designing Experiments for Long-Term Growth on demand.

This article covers the key principles Ronny and Luke shared for designing experiments that drive long-term growth — from understanding the importance of experimentation, why you shouldn’t ship on flat results, the key metrics you should track, and how to create a shipping criteria framework. Whether you're just getting started with experimentation or looking to sharpen how your team makes decisions, these are the foundational concepts that separate programs that deliver real impact.

In science, randomized controlled experiments are the gold standard, sitting at the top of the hierarchy of evidence. A/B tests are the online equivalent and the most reliable tools teams have for determining whether a change actually has an effect — whether that's a new feature, a UI change, a pricing change, or a backend optimization.

The problem is that most teams haven't done the harder work first: agreeing on what success actually looks like before the data comes in. Without that foundation, even a well-run experiment produces a result nobody knows how to act on.

Experimentation is How You Stop Guessing: Embrace the High Failure Rate

Humans are systematically bad at predicting what will work and assessing the value of ideas. You cannot reliably judge which ideas are valuable before testing and will be wrong far more often than most teams expect. An effective experimentation program is critical for focusing effort toward what actually works.

Here is some surprising success rate data from across the industry:

Company	Experiment Success Rate	False Positive Risk — Probability that a statistically significant result is a false positive	Reference
Microsoft	33%	5.9%	Kohavi, Crook and Longbotham 2009
Avinash Kaushik	20%	11.1%	Kaushik 2006
Bing	15%	15%	Kohavi, Deng and Longbotham, et al. 2014
Booking.com	10%	22%	Manzi 2012, Thomke 2020, Moran 2007
Google Ads	10%	22%	Manzi 2012, Thomke 2020, Moran 2007
Netflix	10%	22%	Manzi 2012, Thomke 2020, Moran 2007
Airbnb Search	8%	26.4%	Kohavi

Microsoft's 33% success rate stands out, but this came at a cost. Significant upfront work went into scoping and refining ideas before they ever entered an experiment, which directly impacted that number.

The median organization sees roughly 10% of experiments move the metrics they were designed to improve. Given this success rate, we can compute the False Positive Risk (FPR) — the probability that a statistically significant result is actually a false positive. At a 10% success rate with standard thresholds (𝛼=0.05, 80% power), that risk is around 22%, meaning roughly 1 in 5 'successful' experiments are actually false positives. Most teams assume p < 0.05 means they will rarely make mistakes, but the math shows otherwise.

The most impactful teams are the ones with the infrastructure to test fast and realign priorities based on evidence. A $120M improvement at Bing sat in the backlog for months because nobody thought it was worth testing. At Airbnb, the biggest win was a one-line code change. Neither of these could have been predicted. Both required running the experiment.

The Importance of Building–and Aligning on–A/B Testing Key Metrics

An experimentation program is only as good as the metrics it optimizes for. These metrics include:

Success or goal metrics: Defines why an organization or product exists and what success looks like (stock price, revenue, market share, etc.) These are the real objectives, but are not easy to move or measure in the short-term.
Driver metrics: Short-term metrics that are the signals believed to predict movement in success metrics. These are what you actually measure to signal success.
Overall Evaluation Criterion (OEC): The weighted combination the organization agrees to optimize for, typically composed of a few success and driver metrics. Defining a good OEC is one of the hardest and most important things an experimentation program does.

What Goes Wrong Without a Good OEC

Real-world scenarios from search engines (Bing, Google) and booking sites (Airbnb, VRBO) illustrate how badly things can go wrong with poor OECs, despite well-meaning intentions.

The Search Engine Example

At Bing, naively using queries per user as the OEC would have led to very poor decisions. The example he gave was a ranking bug that returned terrible search results. This increased queries by 10% due to users reformulating queries several times and increased ad revenue by 30%. The short-term metrics look great, but the product is broken.

More optimal metrics to track here would be to minimize queries per session (users should be able to find answers quickly) and maximize sessions per user (repeat usage indicates high value). Bing has a suite of metrics they actually track, including sessions per user, queries per user, time to success, revenue per user and more. We'll cover a framework for identifying and aligning on good OECs later in the article.

A Booking Site Example

Similarly, a booking platform such as Airbnb that ignores satisfaction signals like user rating and instead optimizes purely for conversion rate is optimizing for the wrong thing. If users book listings they end up hating, they don't return.

A better OEC would also include a measure of satisfaction, such as the user's star rating, so you can build machine learning models that predict whether this user will book a listing they love and rate five stars. Deciding on the trade-off between multiple metrics, such as revenue and user satisfaction, is a key business decision.

The Flat Result Trap: The Most Expensive Mistake in Product

Getting your OECs right is important, but only if you're willing to act on what the data actually tells you. A flat result means an experiment didn't produce a statistically significant improvement in the OEC. Shipping flat means deploying that feature anyway. It was discussed that this is a decision error in nearly every case.

One example from Bing was a major effort with ~100 engineers to introduce a third pane to the search window. The experiments failed to show value, but it shipped to all users anyway because it was determined to be a strategic business move. A year later after countless additional experiments failed to show value, the 3^rd pane was rolled back at significant costs to Bing.

Had Bing acted on what the data told them to begin with, they could have failed much faster, avoiding months of sunk cost and instead redirected their engineering resources toward something that actually moved the needle.

Debunked: Common Justifications for Shipping Flat

Ronny shared the four primary reasons he has seen teams use to rationalize the decision to ship flat and dives into the real implications of each.

Justification #1: It’s flat, we’re not hurting the users or business

A flat result doesn't mean no effect exists. All it tells you is “we didn’t find enough evidence of an effect.” The experiment could simply be underpowered. "Not statistically significantly worse" is not the same as safe to ship. The true effect could still be negative.

Justification #2: Team morale depends on shipping:

Shipping a flat feature to protect morale means celebrating shipping rather than actually moving goal metrics, which can also complicate the codebase and require maintenance costs. The culture should be results-oriented and simply recognize that many ideas fail. Hold a learning review, share what was discovered, and move on. Failures that generate learning are worth celebrating.

Justification #3: It’s an enabler for future work:

You can cut through this justification with one question: if we ship this and deprecate the old version, would we ever roll it back? At Bing, the answer was yes. Every flat enabler that ships becomes code that must be maintained and a foundation you'll keep building on even when the follow-on value never arrives.

Justification #4: It’s strategic:

Strategic conviction is not a substitute for evidence, and as the data shows, even small changes are hard to predict correctly. Set a vision, but move toward it in small, testable steps. Test a meaningful component first, get data, then adjust.

A Framework for Making Better Experimentation Decisions

With the importance of good metrics and understanding of what can go wrong without them clearly laid out, the conversation then shifts to a practical approach for building a decision framework that connects short-term measurements to long-term goals without overcomplicating the process.

Bridging Short-Term Experimentation Metrics to Long-Term Goals

The messy reality is that most measurable short-term metrics don’t align 1:1 with business goals, so we must instead build frameworks to do so.

Start by identifying your long-term goals and what you can actually measure. From there, identify the short-term metrics that are the strongest indicators of those long-term goals. These are the signals that move in the right direction when the product or business is genuinely improving.

Once you've identified the right metrics, put guardrails in place. Guardrails are secondary metrics you monitor to ensure that improving your primary metric isn't coming at the expense of something else that matters, such as revenue, retention, or user satisfaction. They don't have to move, but they can't go backwards.

A word of caution: overcomplicating things and tracking too many metrics can make it difficult to act. Before running an experiment, think critically about what you would do if your metrics told conflicting stories afterward. This exercise forces clarity around prioritization and how you make business decisions around tradeoffs. The goal is to identify the key signals you can build a decision framework around so you know exactly how you'll act on them.

A Real-World Shipping Example: LLM Chatbots

An example that highlights this concept is an AI chatbot company. They can't measure customer lifetime value in a two-week experiment. Instead, they’ll need to look at the short-term metrics that signal value, such as distinct sessions per user, topic breadth, short-term subscription conversion, and how often responses are copied externally. Build a framework connecting these to the long-term goal, validate against historical data, and you have an OEC you can actually experiment on.

But throwing all of these metrics into your results dashboard can complicate the picture. If some results are flat or vaguely negative, while others are statsig negative, and others are statsig positive, then how do you make a shipping decision?

Most metrics in a typical experiment cluster near zero. A single strong signal stands out, illustrating why clear shipping criteria matter before results come in.

This is exactly where clearly defined shipping criteria earns its value.

Shipping Criteria: Enabling Independent Shipping Decisions at Scale

Translate your metrics into explicit shipping criteria that are determined prior to an experiment launching. This is a decision framework that enables independent shipping decisions and eliminates bias from decision-making during the evaluation phase.

Some decisions are very straightforward, such as the example below. With the revenue change being equal, you would choose the latter with higher Daily Active Users.

When revenue is equal, the choice is clear: ship the variation with higher DAU. Defined criteria make this decision automatic.

However, a clearly defined framework for shipping criteria becomes increasingly necessary in situations where metrics conflict, such as in the example below, where DAU is higher in the first experiment, but revenue is higher in the second. In this situation, you need to understand the tradeoff between these metrics that you’re willing to accept when shipping.

This approach encodes your decision-makers' preferences into a repeatable framework so shipping decisions are consistent, defensible, and free from bias.

When DAU and revenue point in different directions, you need a pre-agreed framework to make the call. This is exactly where shipping criteria earn their value.

Luke’s Twitter Example

An example from Twitter highlights how this works in practice. Daily Active Users (DAU) was a key metric for Twitter, but they wanted to make sure that people were using the product repeatedly and over time to see that they're getting value out of it in a wide variety of applications. Some of the measured indicators included tweets created, likes, and other forms of engagement. They used the decision framework below to determine when to ship:

If DAU is up and stat sig → ship
If DAU is negative → rollback
If DAU is up, not stat sig and no guardrails are negative:
- If engagement metrics are up (tweets created, likes, etc.) → ship
- Otherwise → experiment review
Murky results → rollback

This type of framework scales. It forces tradeoffs to be agreed on before you're under pressure from a live result.

A key note to remember is that your metric models will likely drift over time. This is something teams need to revisit regularly as their product and business evolves. The metrics that predicted success a few months ago may not be the right ones today.

Closing: Shift the Experimentation Culture

Ronny and Luke close with a shared belief: the teams that win at experimentation aren’t always the ones with the most resources or sophisticated tools, but the ones that have built a culture around learning.

The most important piece of advice is to shift the organizational mindset from celebrating shipping to celebrating learning. Most ideas will fail. The teams that internalize this stop treating failed experiments as something to hide and start treating them as the mechanism by which they get smarter and faster over time.

That cultural shift is supported by the practical framework Luke outlined. When you have clearly defined metrics, explicit shipping criteria, and a shared understanding of your tradeoffs, experimentation becomes the foundation for confident, independent decision-making at scale.

Key Takeaways

Most experiments fail. The median industry success rate is ~10%, meaning you will be wrong far more often than you expect. An effective experimentation program is how you find what actually works.
False positive risk is higher than most teams realize. At a 10% success rate, roughly 1 in 5 "winning" experiments are actually false positives, even when running at p < 0.05.
Your experimentation program is only as good as the metrics it optimizes for. Poorly defined OECs lead to decisions that look good on paper, but break the product.
Shipping flat is a decision error in nearly every case. "Not statistically significantly worse" is not the same as safe to ship. The true effect could be negative and the code will have maintenance costs.
Short-term metrics rarely align 1:1 with long-term business goals. Build an explicit framework connecting the two and put guardrails in place to protect what actually matters.
Define your shipping criteria before the experiment runs, not after. This eliminates bias, enables independent decision-making, and forces tradeoffs to be agreed on in advance.
Shift the culture from celebrating shipping to celebrating learning. The teams that win at experimentation are the ones that treat failed experiments as the mechanism by which they get smarter.

Want to go deeper? Ronny teaches two online courses on Maven

Accelerating Innovation with A/B Testing: Ronny’s flagship course and recommended starting point for most practitioners

Advanced Topics in A/B Testing: A follow-on to Accelerating Innovation with A/B Testing for practitioners with a solid foundation in p-values, statistical power, and OEC design

‍

Releases

Product Updates

3.4

GrowthBook Version 3.4

Jan 15, 2025

min read

It’s been 2 months since our last release, and we’re excited to bring you some highly requested features to kick off the new year. This release includes over 150 changes, and we’ve highlighted some of the biggest ones below.

Custom Fields and Experiment Templates

Experiment Templates showing default field configuration including hypothesis, metrics, and targeting conditions

GrowthBook has always been flexible with custom tags and full markdown support, but now we’re making it even easier to standardize your workflows.

Custom Fields: Define your own fields for feature flags and experiments. Link to Jira tickets, tag impacted product surfaces, and enforce team-wide documentation standards.
Experiment Templates: Configure default values for all experiment fields, including hypothesis, tags, targeting conditions, metrics, and more. Create templates for different types of experiments your team runs. When starting a new experiment, your team can use a template, with the option to make this step mandatory for consistency.

These features help enforce consistency and structure across your entire organization, and we’re really excited to see all the ways they get used!

Available to all Enterprise customers. Read more about Custom Fields and Experiment Templates in our docs.

Shareable Experiment Reports

Public shareable experiment link showing results view for sharing with external stakeholders

Need to share experiment results with stakeholders outside GrowthBook? Now you can generate public shareable links for specific experiments.

Defaults to private, but you can opt in to share on an experiment-by-experiment basis
Share insights in your company Slack, collaborate with external partners, update leadership, or even showcase big wins on LinkedIn.

Read more about Shareable Reports in our docs. This feature is available to all organizations, both free and paid.

New Metrics - Retention, Count Distinct, and Max

We’ve added new kinds of metrics you can define on top of Fact Tables.

Retention: Measure the percentage of users who return within a specific time window. For example, a “Week 2 Retention” metric that tracks the percentage of users who engaged with your app 7-14 days after seeing your experiment.
Count Distinct: Aggregation option for mean, ratio, and quantile metrics. For example, a “Unique Videos” metric that counts all of the distinct video ids a user watched, ignoring repeats.
Max: Aggregation option for mean, ratio, and quantile metrics. For example, a “High Score” metric that measures the highest score each user obtained in your game, no matter how many attempts it took to get there.

Retention metrics are available to Pro and Enterprise customers, and the new aggregation options are available to all organizations, both free and paid. Read more about these New Metrics in our docs.

Environment Forking

When creating a new environment, you can now choose a parent environment to “fork” from - for example, creating a new Staging environment that is a fork of Production. This will copy all feature flag rules so the new environment starts out as an exact clone of the parent. After that point, the environments will be treated independently.

Environment forking becomes really powerful when combined with our REST API. For example, your CI/CD pipeline could fork a new ephemeral environment for each PR and automatically clean it up when the PR is closed.

The UI for manually creating new environment forks is available to all organizations, but programmatic access via the API is only available to Enterprise customers. Read more about environment forks in our docs.

CMS Integrations - Contentful and Strapi

Contentful and Strapi integration setup for running A/B tests on headless CMS content

A/B testing inside your CMS just got easier. We now support Contentful and Strapi, two of the most popular headless CMS platforms.

Read our new guides for Contentful and Strapi to see how easy it is to experiment with your CMS content. We’d love to add more integrations like this in the future, so let us know which ones you most want to see!

Updated Node.js and Edge SDKs

We’ve made some huge changes to our JavaScript SDK to better support server-side applications. The new `GrowthBookClient` class is up to 3x faster and more memory efficient for Node.js applications. Read the new Node.js SDK docs or check out specific tutorials for Express.js or Deno/Hono.

We also revamped our Edge SDKs to be more flexible. New Lifecycle Hooks let you perform custom logic at various stages. This allows for custom routing, user attribute mutation, header and body (DOM) mutation, and custom feature flag and experiment implementations – while still preserving the ability to automatically run Visual and URL Redirect experiments and SDK hydration.

Simulate Features

Way back in GrowthBook 2.5, we launched Archetypes to help you simulate how a specific feature flag behaves for a set of user attributes. Now there is a dedicated landing page for managing archetypes, along with a new “Simulate” section. Now, you can simulate all your feature flags at once and instantly verify how features will behave for any user.

The ability to simulate features is available to all Pro and Enterprise customers, but saving user attributes as a reusable “Archetype” is available only to Enterprise customers.

Experiments

The Best A/B Testing Platforms of 2025: Features, Comparisons, and Expert Recommendations

Jan 4, 2025

min read

Imagine making every product decision with data, powered by the best A/B testing platforms of 2025. These tools have become essential for businesses hungry to innovate faster and build with confidence. This new generation of tools prioritizes performance, flexibility, and organization-wide applicability, ushering in a paradigm known as experimentation-driven development. No longer confined to marketing departments, A/B testing is now a cornerstone for entire product teams. In this guide, we’ll explore key platforms, focusing on their strengths, limitations, and suitability for different needs. By the end, you’ll have a clearer understanding of which platform is right for your organization.

Modern Platforms: Innovating for Today’s Needs

GrowthBook: Experimentation-Driven Development at its best

We built GrowthBook to be the tool we always wished we had—one that balances developer-friendly workflows with robust experimentation capabilities. We excel in flexibility, scalability, and developer-friendly features, and we seamlessly integrate feature flagging and A/B testing to deliver unmatched usability and precision. Picture your team quickly toggling features while running precise experiments—all within a platform that feels like it was designed just for developers. Here's where GrowthBook is unique:

Warehouse-Native Integration: GrowthBook integrates directly with your existing data warehouse, ensuring low-latency experiment evaluations and reliable metrics.
Open-Source and Self-Hosting Options: Ideal for industries with stringent compliance and data sovereignty requirements, GrowthBook’s open-source nature empowers teams to self-host if needed.
Developer-First Approach: Robust SDKs, CI/CD compatibility, and transparent SQL insights allow engineering and data teams to fine-tune experiments with precision.
Seamless Integration: GrowthBook easily fits into existing workflows, leveraging tools you already use to maximize ROI.

With a focus on unlocking experimentation-driven product development, we're the trusted choice for teams aiming to scale innovation while maintaining performance.

Statsig: All-in-One Simplicity with Limits

Statsig provides A/B testing, feature flagging, and session recording in a unified platform. While it’s sufficient for teams seeking a simple, all-in-one solution, its statistical methods, while including features like CUPED, may not be as robust as those offered by platforms specifically designed for data scientists. For example, it lacks flexibility in supporting different types of experiments, such as those with very large or very small user bases. The platform’s rising costs, especially when scaling beyond the initial 5 million events included in the Pro plan, make it less appealing. Additionally, its limitations in integrating with data warehouses may not meet the needs of organizations with sophisticated data practices

Eppo: Statistical Precision for Data-Driven Teams

Eppo shines in its statistical depth, offering precise and actionable experiment results. Its warehouse-native architecture aligns with organizations that prioritize high-quality experimentation. However, Eppo’s feature flagging functionality, while supporting core features such as feature gates and rollouts, may not be as comprehensive as some competitors'. For example, it may lack advanced features like user segmentation or real-time monitoring found in more mature platforms. For data-driven organizations primarily focused on experimentation, Eppo provides an excellent foundation. However, teams seeking a broader feature flagging toolset with more advanced capabilities may need to consider alternatives.

Legacy Platforms: Struggling to Keep Up

LaunchDarkly: Feature Management First, Experimentation Second

LaunchDarkly excels in feature flagging but treats A/B testing as an afterthought. This lack of integration can lead to a clunky user experience, making it less suitable for teams aiming for seamless experimentation workflows.

Optimizely: High Costs, Fragmented Experience

Once a market leader, Optimizely now faces challenges with its high pricing and fragmented user experience. While its A/B testing capabilities remain robust, the platform’s cost makes it viable only for large enterprises. Compared to Optimizely, GrowthBook reduces both cost and complexity.

Adobe Target: Limited Flexibility in a Closed Ecosystem

Adobe Target is tightly integrated into Adobe’s ecosystem, making it a logical choice for existing Adobe customers. However, its high costs and lack of flexibility make it less appealing for agile teams seeking modern experimentation workflows. GrowthBook is a clear alternative to Adobe Target, with warehouse-native analytics.

Other Platforms: Niche Capabilities

PostHog: Lightweight Analytics with Basic Experimentation

PostHog focuses on product analytics and offers basic experimentation features. While its open-core model appeals to startups, the platform’s limited self-hosting capabilities and lightweight experimentation tools make it less suitable for teams with advanced needs. For advanced experimentation and robust feature flags, consider GrowthBook compared to PostHog.

Choosing the Right Platform

When choosing an A/B testing platform, think about what matters most to your organization. Are you looking for scalability, compliance, or seamless integration with your current tools? Matching these priorities with the right platform can make all the difference.

For advanced experimentation workflows: GrowthBook delivers unmatched flexibility, scalability, and developer-first features.
For data teams: Eppo offers statistical rigor and warehouse-native integration but falls short with limited feature flagging capabilities, making it less ideal for teams needing a comprehensive solution.
For all-in-one solutions: Statsig provides simplicity but may not scale with advanced needs.
For feature flagging-first teams: LaunchDarkly suffices but lacks depth in experimentation.
For legacy ecosystem users: Optimizely and Adobe Target remain options, albeit costly and limited in flexibility.
For lightweight needs: PostHog provides budget-friendly analytics-driven tools for smaller teams.

Conclusion

Innovation thrives on experimentation—it’s how teams transform ideas into measurable success. Choosing the right A/B testing platform can accelerate your ability to iterate, scale, and innovate. GrowthBook’s modular design makes it perfect for organizations aiming to scale. Imagine starting with basic experiments and effortlessly expanding into enterprise-grade workflows—it’s a platform built to evolve with your team’s needs. Whether you’re prioritizing compliance, advanced experimentation, or seamless developer integration, GrowthBook’s strengths make it the clear leader in the 2025 A/B testing platform landscape.

Ready to scale your experiments?

Get started with GrowthBook for free today.

Move Fast Without Breaking Things: How GrowthBook and Rollbar Empower Product and Ops Teams

Feature Flags

Platform

Move Fast Without Breaking Things: How GrowthBook and Rollbar Empower Product and Ops Teams

Dec 2, 2024

min read

As technology evolves at breakneck speed, staying ahead means constantly innovating. Product teams are always experimenting with new features to improve user experiences, while operations teams focus on keeping systems reliable and performant. But what happens when experimentation introduces risks—like unexpected errors or performance issues?

That’s where the integration of GrowthBook’s feature flagging and Rollbar’s error monitoring comes in. Together, they empower product and ops teams to collaborate effectively, innovate confidently, and safeguard user experiences.

Innovate Confidently With Real-Time Error Monitoring

GrowthBook’s feature flags make it easy to roll out features or experiments to specific user segments. Rollbar’s real-time error monitoring ensures you’re alerted immediately if something goes wrong, allowing your team to make fast, informed decisions.

Example: Imagine you’re rolling out a new recommendation algorithm to 10% of users. Rollbar detects an issue affecting a specific browser. With GrowthBook, you can disable the feature for those users instantly, without impacting the rest of the rollout. It’s experimentation without unnecessary risk.

Experiment Without Disruptions

Ops teams often face a tough trade-off between moving fast and keeping systems stable. By integrating Rollbar with GrowthBook, you create a safety net for experiments. With feature flags, your team can quickly address issues without waiting on a developer to ship a fix.

Clear Insights for Continuous Improvement

Both product and ops teams rely on data to refine their strategies. The GrowthBook and Rollbar integration gives you visibility into how new features and experiments impact your application’s performance. With these insights, you can iterate faster and more effectively, making adjustments with precision.

Key Benefit: Use real-time data to confidently refine your experiments and features.

Building Confidence in Experimentation

Innovation doesn’t happen without experimentation, but for experimentation to thrive, it needs buy-in across the organization. GrowthBook and Rollbar make this easier by combining robust monitoring with the flexibility of feature flags. Teams can move quickly while still prioritizing reliability, making experimentation a safe, scalable part of your growth strategy.

Faster, Safer Innovation Starts Here

Product and ops teams need tools that let them move quickly without sacrificing quality. The GrowthBook + Rollbar integration is built for exactly that. It’s the ultimate way to deliver user experiences that delight while keeping your systems stable.

Want to see it in action? Learn more about the integration and discover how GrowthBook and Rollbar can help your team innovate smarter and safer.

Introducing Multi-Armed Bandits in GrowthBook

Analytics

Experiments

Platform

Product Updates

3.3

Introducing Multi-Armed Bandits in GrowthBook

Nov 15, 2024

min read

Bandits have been released in beta as part of GrowthBook 3.3 for Pro and Enterprise customers. See our documentation on getting started with bandits.

Multi-armed bandits allow you to test many variations against one another, automatically driving more traffic towards better arms, and potentially discovering the best variation more quickly.

Bandits are built on the idea that we can simultaneously…

… explore different variations by randomly allocating traffic to different arms; and
… exploit the best performing variations by sending more and more traffic to winning arms.

Graph showing the exploration and exploitation stages of bandits where traffic is allocated to better performing variations — Bandits start with equal allocations across variations and then allocate more traffic to the variations that perform better.

‍

In online experimentation, bandits can be particularly useful if you have more than 4 variations you want to test against one another. Scenarios where bandits can be helpful include:

You are running a short-term promotional campaign, and want to begin driving traffic to better variations as soon as there is any signal about which variation is best.
You have many potential designs for a call-to-action button for a user sign-up, and you care more about just picking the best flow that leads to sign-up.

Furthermore, bandits work best when:

Reducing the cost of experimentation is paramount. This is true in cases like week-long promotions, where you don’t have time to test 8 versions of a promotion and then launch it, so you want to test 8 versions and begin optimizing after just a few hours or on the first day.
You have a clear decision metric that is relatively stable. Ideally, your decision metric should not be an extreme conversion rate metric (e.g. < 5% or > 95%), or if it’s a revenue metric, you should apply capping to prevent outliers from adding too much variance.
You have many arms you want to test against one another, and care less about learning about user behavior on several metrics than about finding a winner.

Bandit leaderboard from GrowthBook, showing 3 variations' performance over time. Variation 2 is the clear winner! — Bandits help identify the best-performing arm among many variations.

Read more about when and how to use bandits in our documentation.

The following table provides an summary of the differences between bandits and standard experiments.

Characteristic	Standard Experiments	Bandits
Goal	Obtaining accurate effects and learning about customer behavior	Reducing cost of experimentation when learning is less important than just shipping the best version
Number of variations	Best for 2-4	Best for 5+
Multiple goal metrics	Yes	No
Changing variation weights	No	Yes
Consistent user assignment	Yes	Yes (with Sticky Bucketing)

What makes GrowthBook’s bandits special?

GrowthBook's bandits rely on Thompson Sampling, a widely used algorithm to balance exploring the value of all variations while driving most traffic to the better performers. However, our bandits differ in a few ways that ensure they work well in the context of digital experimentation.

Consistent user experience

Some bandit implementations do not preserve user experience across sessions, making them tricky to use when a stable user experience is important. Because bandits dynamically update the percentage of traffic going to each variation, if you run it on users who return to your site or product and do not preserve their user experience, they may be exposed to multiple variations over the course of a bandit.

This can lead to:

Bad user experiences where your product frequently changes for an individual customer.
Biased results.

GrowthBook uses Sticky Bucketing, a service that allows you to store user variations in some service, such as a cookie, so that when they return, they always get the same experience, even when the bandit has updated variation weights.

Setting up Sticky Bucketing in our HTML SDK is as easy as adding a parameter to our script tag.

<script async  data-client-key="CLIENT_KEY"  src="<https://cdn.jsdelivr.net/npm/@growthbook/growthbook/dist/bundles/auto.min.js>"  data-use-sticky-bucket-service="cookie"></script>

Implementing Multi-Armed Bandits in GrowthBook

Accommodates changing user behavior

GrowthBook’s bandits use a weighting method to prevent changing user behavior over time from biasing your results.

What is the issue? As your bandit runs, two things are changing: your bandit updates traffic allocations to your variations, and the kind of user entering your experiment changes (e.g., due to day-of-the-week effects). The correlation between these two can cause biased results if not addressed.

Imagine the following scenario:

You run a bandit that updates daily. You start your bandit on Friday and you have two variations that have a 50/50 split (100 observations per arm). You observe a 45% conversion rate in Variation A, and a 50% conversion rate in Variation B. After the first bandit update, the weights become 10/90 (just as an illustration, the actual values would be different). The total traffic on Saturday is also 200 users, but this time Variation B gets 90% of the traffic. Conversion rates on weekdays tend to be higher than on weekends, regardless of variation. On Saturday, you observe a 10% conversion rate in Variation A and a 15% conversion rate in Variation B. On both Friday and Saturday, Variation B has larger conversion rates, but if you naively combine the data across both days, Variation A looks like the winner:

Metric Category	Examples	Importance
Latency & Throughput	Time to first token, Completion time	Users abandon slow services
User Engagement	Conversation length, Session duration	Indicates valuable user experiences
Response Quality	Human ratings ("Helpful?"), Regenerate requests	Directly reflects user satisfaction
Cost Efficiency	Tokens per request, GPU usage	Balances performance with budget

The combined conversion rate for Variation B is 27.5%, while for Variation A it is 39%, despite Variation B outperforming Variation A on both days of the experiment. Clearly, something is wrong here. In fact, sharp-eyed readers might notice this is a case of Simpson’s Paradox.

How did we solve it? We use weights to estimate the mean conversion rate for each bandit arm under the scenario that equal experimental traffic was assigned to each arm throughout the experiment. In this scenario, we can recompute the observed conversion rates as if the arms received equal traffic (e.g., on Saturday we had 10/100 conversions instead of 2/20). Using these adjusted conversion rates, the combined conversion rates now make sense:

	Friday	Saturday	Combined
Variation A	45/100 = 45%	10/100 = 10%	55/200 = 27.5%
Variation B	50/100 = 50%	15/100 = 15%	65/200 = 32.5%

Variation B is now the winner. By accounting for changes in overall traffic each day, the combined conversion rate now appropriately reflects differences in conversion rates and traffic variation over time. Our bandit applies a similar logic to ensure that day-of-the-week effects and other temporal differences do not bias your bandit results.

Built on a leading warehouse-native experimentation platform

GrowthBook is the leading open-source experimentation platform. It is designed to live on top of your existing data infrastructure, adding value to your tech stack without complicating it.

GrowthBook integrates with BigQuery (GA4), Snowflake, Databricks, and many more data warehouse solutions. If your events land in your data warehouse within minutes, then GrowthBook bandits can adaptively allocate traffic within hours or less.

GrowthBook integrates with your existing data infrastructure (e.g. BigQuery, Databricks, Snowflake, Postgres, ClickHouse, and more)

Get started

To get started with Bandits in GrowthBook, check out our bandits set-up guide, or if you’re new to GrowthBook, get started for free.

Releases

Product Updates

3.3

GrowthBook Version 3.3

Nov 12, 2024

min read

We’ve been hard at work on GrowthBook version 3.3, which includes Multi-Arm Bandits, powerful new metric capabilities, a new design system, and an exciting announcement! Here’s everything you need to know:

Multi-Arm Bandits

Multi-Arm Bandit experiment showing dynamic traffic allocation shifting toward top-performing variations in real time

Multi-Arm Bandits are experiments that dynamically allocate traffic to maximize efficiency, sending more traffic to top-performing variations as tests run. This minimizes losses and reduces risks when testing multiple variants.

Multi-Arm Bandits were months in the making and we're excited to share it with everyone. Bandits are currently in beta and available to all Pro and Enterprise customers.

Metric Improvements

In this release, we've made huge improvements to every aspect of metrics and fact tables. We don't have room to list them all, but here are a few highlights.

‍

Metric Groups‍

Save and organize groups of metrics so you can streamline experiment setup. Groups are passed by reference so updates will apply to all experiments using the group. You can also adjust the order in which metrics are shown:

Live SQL Preview

‍Now available when creating or editing fact table metrics, so you can see how metrics behave under the hood in real time.

Inline and User Filters

‍Simplify the UI and enable a new class of metrics for fact tables that were previously impossible. Learn more

These upgrades empower you to create, organize, and analyze metrics more efficiently than ever before.

New Design System

We've started migrating GrowthBook to a new design system built on Radix UI. While this work is ongoing, you’ll already notice slicker visuals, better keyboard navigation, and improved Dark Mode support!

🚀 Built-in Managed Warehouse (coming soon)

Not every team has the resources to set up and maintain its own data warehouse, which is why we’re launching a fully managed ClickHouse option within GrowthBook Cloud!

This Built-in Warehouse sits on top of our existing SQL and stats engine, so it’s fully compatible with all of our advanced experimentation settings, and you’ll automatically benefit from ongoing improvements to our Warehouse-Native offering.

We have a limited number of spots in our private beta, so reach out if you’re interested in being among the first to try it out!

Feature Flags

Platform

Product Updates

Announcing GrowthBook on JSR

Nov 4, 2024

min read

GrowthBook is committed to supporting modern platforms, bringing advanced feature flagging and experimentation to where you are. We’re excited to announce the availability of our JavaScript SDK on JSR, the modern open-source JavaScript registry. This integration empowers JavaScript and TypeScript developers with a seamless experience for implementing and managing feature flags in their applications.

JSR simplifies the process of publishing and importing JavaScript modules, offering robust features like TypeScript support, auto-generated documentation, and enhanced security through provenance attestation. This collaboration brings these benefits to GrowthBook users, streamlining the integration and utilization of feature flagging in their development workflows.

Using the GrowthBook JS SDK via JSR offers an excellent developer experience, with first-class TypeScript support, auto-generated documentation accessible in your code editor, and more.

How to Install GrowthBook from JSR

Get started with GrowthBook using the deno add command:

deno add jsr:@growthbook/growthbook

Or using npm:

npx jsr add @growthbook/growthbook

The above commands will generate a deno.json file, listing all project dependencies.‍

{
  "imports": {
    "@growthbook/growthbook": "jsr:@growthbook/growthbook@0.1.2"
  }
}

‍deno.json

Use GrowthBook with Express

Let’s use GrowthBook with an Express server. In our main.ts file, we can write:

import express from "express";
import { GrowthBook } from "@growthbook/growthbook";

const app = express();

app.use(function (req, res, next) {
  req.growthbook = new GrowthBook({
    apiHost: "<https://cdn.growthbook.io>",
    clientKey: "sdk-qtIKLlwNVKxdMIA5",
  });

  req.growthbook.setAttributes({
    id: req.user?.id,
  });

  res.on("close", () => req.growthbook.destroy());

  req.growthbook.init({ timeout: 1000 }).then(() => next());
});

app.get("/", (req, res) => {
  const gb = req.growthbook;

  if (gb.isOn("my-boolean-feature")) {
    res.send("Hello, boolean-feature!");
  }

  const value = gb.getFeatureValue("my-string-feature", "fallback");

  res.send(`Hello, ${value}!`);
});

console.log("Listening on port 8000");
app.listen(8000);

‍Finally, you can run the following command to execute:

deno -A main.ts

Depending on how you’ve set up your feature flags in GrowthBook (sign up for free), the response will be different:

Web browser showing default response, "Hello, fallback!" — Express server response showing "Hello, fallback!" when no feature flag value is configured in GrowthBook

Check out our official docs to learn more about feature flags, creating and running experiments, and analyzing experiments.

What’s next?

With GrowthBook's JS SDK now on JSR, it’s even easier to bring the power of feature flags and A/B testing to any JavaScript environment.

Check out JSR, the JavaScript registry built for the modern web.
Read the announcement on Deno's blog.
Learn more about GrowthBook’s JS SDK.

Feature Flags

5 Ways to Use Feature Flags for Smarter Releases

Oct 25, 2024

min read

At their most basic, feature flags are like light switches: they let you turn features on and off. But behind this simple function lies a world of possibilities. With feature flags, you can ship faster, reduce risk, and deliver personalized experiences without constant code changes or complicated deployments. In this guide, we’ll explore 5 essential ways to use feature flags in GrowthBook to optimize your software development process and ship smarter.

Morpheus meme: What if I told you... You can ship faster and more safely

1. A/B Testing

One of the most common use cases for feature flags is to power A/B tests. A/B testing lets you compare a control (your current version) with one or more variants to see which performs better. Whether you're optimizing call-to-action copy, testing pricing models, or tweaking product recommendations, A/B testing helps you make data-driven decisions.

Mean Girls meme: Get in loser... You're in this variant

Example: Testing Calls to Action

Pretend you have a coffee company called, Split Bean, founded by former nuclear physicists. Their current homepage headline, "Engineered for Perfection: Coffee Crafted by Scientists," communicates their unique story. But what if a punchier headline would sell more beans? Time to set up a test to find out.

Split Bean coffee company homepage showing the control headline

Create a Feature Flag: In GrowthBook, navigate to the Features section and create a new feature called headline. Set the default value to your current headline.

Creating a feature called headline in GrowthBook

Add an Experiment: After saving the flag, add an Experiment Rule to split traffic between your original headline and the new variant. For a punchier approach, we’ll try “Aromas That’ll Slap You Awake Faster Than a Cold Shower.”

Two modals showing 1. Adding an A/B experiment and 2. Adding two headline variations for those modals

Split Bean coffee company homepage with the variation headline active

Analyze the Experiment: We'll need to wait for some customers to visit, but which version do you think will win?

A/B testing is a powerful way to ensure you’re making decisions based on real data, which is key to driving engagement and loyalty, and feature flags make it easy to deliver tailored experiences to different user segments, and, in turn, provide a better product for your customers. 💪

2. Personalization with Feature Flags

Personalization is the key to driving engagement and loyalty, and feature flags make it easy to deliver tailored experiences to different segments of users.

Example: Rolling Out a Custom Blend Feature

Our favorite coffee company, Split Bean, has launched a custom-blend feature that lets users choose beans, roast levels, and artwork.

But this new feature isn’t available to just anyone—initially, it was only available to US customers on an annual subscription. Feature flags offer a streamlined way to restrict custom blends to only these customers.

Create a Feature Flag: Set up a custom-blend flag that defaults to false.
Add Targeting Rules: Use GrowthBook’s Forced Rules to define who can see the feature—users in the US with an annual subscription.

Wondering where the location and subscription values come from? In GrowthBook, define these attributes via SDK Configuration → Attributes. Then, in your app, pass these same attributes to your GrowthBook instance. Go to docs.

By using feature flags, Split Bean can easily manage who gets access to their custom blend feature and refine their marketing efforts with precision.

Meme showing a cat dressed to the nines with coffee. The headline says "This custom blend is purrrfect"

This level of personalization can be applied to various scenarios, from displaying cookie banners based on location to offering exclusive content for premium subscribers.

3. Targeted Releases

When rolling out a new feature, testing with a small, trusted group of users first is a smart way to catch potential issues. This is where targeted releases come in handy.

Example: Split Bean Coffee Review

Split Bean is launching a new features section to showcase all the glowing reviews from their customers. While adding the section seems relatively straightforward, it requires several moving parts: database calls, responsive styling, and the inclusion of user-generated content—all of which could break unexpectedly!

By using feature flags, Split Bean can release the feature to internal beta users.

Like before, create a feature flag. Call it show-reviews with a default value of false, so the reviews will be off for all users.
Add a Forced Rule, set the value to true, and define rules so that reviews show when a user’s company is Split Bean and their beta attribute is true.

Forced value dialog with beta and company attributes set to true and Split Bean, respectively

Save these changes and publish the updated feature flag. When internal beta users visit the site, they'll see the new review section and—importantly—see if anything has gone disastrously wrong 🙃

Once you're confident the feature works as expected, you can gradually expand access to a larger audience, mitigating risks along the way.

4. Canary Releases

Anakin Padme meme: Panel 1: Shipped the update. Panel 2: With a canary release? Panel 3: Panel 4: With a canary release, right?

Even with thorough testing, releasing a major feature to all users at once can be risky. Engineers know this truth all too well. A canary release helps reduce this risk by rolling out the feature to a small percentage of users first.

Until the mid-1980s (!), miners used actual caged canaries to test for clear and odorless poisonous gases that would kill the small birds before the miners. This grim history gave rise to the saying "canary in the coal mine" and, by extension, a "canary release," which is likewise used as an early-detection method for bugs.

Example: Percentage Rollout

Split Bean is launching a new multi-page checkout flow, designed to reduce cart abandonment. The new flow passed A/B tests and beta testing, but there’s still a chance things could go sideways. With a canary release, Split Bean enables the new checkout for just 5% of users to monitor for any unexpected issues.

Create a Feature Flag: Set up a new-checkout-flow flag in GrowthBook.
Add a Rollout Rule: Assign the new flow to 5% of users. As you gain confidence in the feature’s stability, gradually increase the rollout.

When the rollout hits 100%, remove the flag and make the change permanent in the codebase. This approach lets you catch potential issues early, making for smoother, low-stress releases.

5. Operational Flags

Finally, feature flags aren’t just for new features—they also give you powerful operational control over your app. One key use case is the kill switch ☠️ , which allows you to quickly disable problematic features in case of emergencies.

Example: Payment Integration Kill Switch

Split Bean’s CEO just made a deal with a new payment processor to save a few points per transaction. While the engineering team seems to have integrated it without issue, what happens if it fails during a high-volume period? That’s a lot of beans to lose.

With GrowthBook, every feature flag comes with a built-in kill switch. If something goes wrong, you can toggle the new-payment-processor flag off instantly to prevent further issues and revenue loss.

Feature flag page in GrowthBook, with the kill switch interface highlighted

In addition to kill switches, operational feature flags can be used for things like maintenance mode, where you temporarily disable parts of your app during an update. Just like Apple’s “Be right back” message during product launches, Split Bean can set a flag to pause sales while it announces new blends.

Making it possible for anyone to categorically change your app at the literal flick of a switch might seem risky in itself. GrowthBook provides the ability to add a confirmation pop-up to the switch to prevent accidental clicks. Go to Settings → General → Feature Settings → Require confirmation when changing an environment kill switch.

Conclusion: Ship Faster and Smarter with Feature Flags

Feature flags offer more than just an on/off switch—they provide a flexible and powerful way to manage your app’s features and optimize user experiences. From A/B testing to canary releases and operational controls, feature flags in GrowthBook let you ship faster, reduce risk, and stay ahead of the competition.

Get Started Today: Create a free account, and see how easy it can be to build and deploy smarter!

Experiments

Convincing Leadership to Adopt A/B Testing

Oct 11, 2024

min read

Many of today's leading companies rely on A/B testing to measure the impact of product changes and remove guesswork from decision-making, allowing teams to make data-driven decisions rather than relying on intuition. Changing from an intuition-based process to an experimentation-driven one can be difficult, especially without buy-in from leadership. Getting buy-in from leadership is the single largest determinant for a successful experimentation program. This post outlines a strategic approach to gain buy-in from leadership, starting small and demonstrating measurable impact.

Highlight the Value of A/B Testing

The first step in convincing leadership is to present A/B testing as a tool for continuous improvement rather than an extra burden. Here are a few key points to emphasize:

Data-Driven Decision Making: A/B testing replaces guesswork with data, allowing the company to make informed decisions based on user behavior and preferences. It’s an objective way to measure the impact of product changes.
Reduced Risk of Launching Ineffective Features: By testing new features on a small subset of users before rolling them out widely, you minimize the risk of launching something that doesn't resonate with users or hurts performance.
Impact on Metrics that Matter: Connect A/B testing to the company’s core KPIs (e.g., conversion rates, user retention, revenue growth). Demonstrate how it can directly impact the metrics that leadership cares about most.

Start Small and Show Results

It’s often easier to get buy-in for something new when the initial investment is low. Start by running a few small-scale tests that are low-risk but have the potential for noticeable results. This strategy builds momentum and shows leadership the tangible benefits of A/B testing without requiring a major overhaul of the existing product development process.

Example: Improving Conversions with a Website Change
In one of our recent experiments, we tested a reordering of elements on our Getting Started page. The goal was to see if changing the layout could improve engagement, specifically how many new accounts were created by an organization after sign-up.

We ran the A/B test for three weeks, comparing the old design to the new one. The result? A 25% increase in the number of accounts that created an organization. Even more exciting, those accounts were 200% more likely to convert into paying customers than those with the older layout. This experiment was part of a series of iterative tests, demonstrating how small changes, backed by data, can have a large impact on core business metrics.

Steps

Step 1: Identify a Feature or Project

Select a feature or project where the impact is uncertain and where developers are open to testing. Ideally, this is a feature that has not yet started or there is some concern about its impact. Choose a feature that’s tied to an important KPI for the business and gets enough traffic to generate results within 2 weeks.

Step 2: Integrate Testing into the Product Development Workflow

Float the idea of running an A/B test for this project and have the test integrated into the launch plan. Pick the goal metrics and estimate the experiment duration, given the power needed to detect the expected effect. Let the team know that should the experiment fail, you will likely roll back the feature and try again (or move on). The goal is to demonstrate how seamlessly A/B testing can be integrated into the product development workflow, enabling smarter decisions without slowing the process.

Step 3: Share Results

Once a few small tests have been completed and shipping decisions have been made based on the results, it's time to communicate this to leadership. Experiment review meetings can help demonstrate that the results, while sometimes counter-intuitive, are valuable. Be sure to communicate:

The hypothesis behind the test and the variants (with screenshots if applicable).
The results, including the impact on key metrics.
What decisions did you make based on this data? (shipped, rolled back, reworked)
How can these insights inform future product decisions?

Make sure to highlight wins and losses - use the language of 'saves' for features that have a negative impact on metrics. These projects were prioritized, and you might not have realized their negative effects without testing.

Address Leadership Concerns

Leadership may have concerns about adopting A/B testing. Here are some concerns and how to address them:

Time and Resources: Leaders might worry that A/B testing will slow down product development or require too many resources. Reassure them that, when implemented strategically, A/B testing can streamline decision-making and lead to more effective use of resources by focusing on rapid iterations and MVPs to test that either verify or contradict a hypothesis.
Concern with Iterative Development: Some leaders may believe that truly innovative products are not created with iterative processes like experimentation-driven development. Counter this by emphasizing that A/B testing allows the company to take calculated risks and learn quickly before investing heavily. There are no projects that cannot be tested in some ways to measure interest, even if they are wildly innovative.
Cultural Resistance: Sometimes, leadership may resist shifting from intuition-driven decisions to a data-driven culture. In these situations, positioning A/B testing as a tool to enhance rather than replace intuition can help. A/B testing provides a feedback loop that sharpens decision-making.

Build a Culture of Experimentation

Once you’ve successfully demonstrated the value of A/B testing on a small scale, the next step is to foster a culture of experimentation. Encourage leadership to see A/B testing as an ongoing process that fuels innovation and continuous improvement. Over time, teams will become more comfortable using A/B testing to validate decisions and optimize the user experience.

Make Testing Routine: Incorporate A/B testing into every product development cycle as a natural part of the process.
Encourage Cross-Department Collaboration: A/B testing should be embraced by product teams and marketing, design, and engineering. When different departments are aligned around experimentation, the results are more impactful.
Celebrate Wins and Learn from Losses: Recognize successful tests and the insights gained from failed ones. A/B testing is about learning, not just about winning.

Quantifying the ROI of A/B Testing

Leadership often wants a clear financial justification for investing in an experimentation program. Unfortunately, the ROI for experimentation is a complicated number. A/B testing can be used for optimizations or more straightforward A/B tests where the impact of the results is very clear (and definitely communicate those). However, it is hard to determine the ROI of spending 3 months building something that, through an iterative testing program, is more successful than if you had built based on intuition. It can be hard to quantify how much time was saved by not building a feature as well. Also, even failed tests provide value by preventing the launch of potentially harmful features and allowing your team to learn what your users like. Remind leadership that every test provides insights that can drive smarter decisions in the future. Experimentation programs done well maximize learnings, the effects of which can be hard to put a number on.

‍Conclusion

Convincing leadership to adopt A/B testing requires a thoughtful, measured approach. By starting small, demonstrating clear impact, and integrating testing into the product development, you can build trust in the methodology. Over time, A/B testing can become an essential part of decision-making, leading to better products and stronger results.

If you’re looking to introduce A/B testing into your company, remember: start simple, stay aligned with business goals, and showcase results, no matter how small. In doing so, you’ll not only convince leadership but also set the foundation for a culture of data-driven innovation.

Releases

Product Updates

3.2

GrowthBook Version 3.2

Sep 26, 2024

min read

We’re proud to announce the release of GrowthBook 3.2. This release includes many requested improvements to Saved Groups, experiment alerting, metrics, and the visual editor, among others.

Check out the full details of this release below—and stay tuned for some major features we’re working on that will be coming very soon!

Big Saved Groups

Saved Groups UI showing CSV upload, ID search with pagination, and project-scoped group management

Saved Groups in GrowthBook are a great way to target users by ID or email, but the UI and SDK implementations made it difficult to scale beyond a few dozen values. In this release, we’ve made several changes to better support this use case.

There’s a brand new UI for managing Saved Groups with CSV upload support, searching and browsing IDs with pagination, and other quality of life improvements.
A new setting for SDK Connections lets you pass Saved Group values by reference. This optimization can drastically reduce the payload size sent to SDKs when you frequently reuse groups across multiple features or experiments. Passing values by reference is currently supported only in JavaScript and React SDKs, but the rest will be supported soon.
You can now restrict Saved Groups to specific projects. This is especially useful for larger organizations with many teams that want to better organize their Saved Groups.

Experiment Significance Alerts

Webhook notification showing experiment goal metric significance alert sent to Slack

We completed a big overhaul of our webhook notification system, which will allow us to rapidly add support for new events and filtering capabilities going forward. First up is one of our most requested features—the ability to alert when a goal metric in an experiment reaches significance. You can now configure alerts for this event and send them to Slack, Discord, or a custom endpoint.

Keep an eye out for new events as we add them, and let us know what you’d most like to see next! We’re super excited about all of the new use cases this will unlock.

Metric Insights

Fact metric graphs showing daily average, daily sum, and histogram alongside a Recent Experiments list with Lift column

We made two big changes to metrics in this release. First, we added graphs to fact metrics. After creating a fact metric, you can now run an analysis that will look at recent data and display several helpful graphs depending on the metric type. For mean metrics, for example, we show an overall count, daily average, daily sum, and a histogram of metric values. This can help you verify that the metric is set up correctly and reporting the values you expect before adding it to an experiment.

Secondly, we revamped the Recent Experiments list on metric pages. You can now sort by different columns, and, most importantly, there is a new column for Lift, which shows how much the metric changed in each experiment. This lets you answer a critical question—which experiments had the biggest impact on my metric?

Visual Editor Improvements

Visual editor showing inline text editing with direct double-click interaction on page elements

We’ve been making a number of UX improvements to the visual editor over the past month to make it easier to use and less error-prone. The most noticeable change is the ability to edit element text directly on the page. Just double-click on any text element and start typing!

We have a lot more exciting changes planned here, so stay tuned!

Vercel Integration

Vercel Edge Config sync showing automatic feature flag updates within seconds of changes in GrowthBook

GrowthBook has always had first-class support for Vercel and the Next.js ecosystem (it’s what GrowthBook itself is built with), and we’re proud to now support even more use cases on this platform.

You can now configure any SDK Connection to sync data to Vercel Edge Config. Any time a feature flag or experiment changes in GrowthBook, we automatically update Edge Config with the latest value within seconds. This is especially powerful for server-side rendering, where latency is critical—reading from Edge Config is much faster than making a network request to GrowthBook’s servers.

We’ve also added a guide to our docs detailing how to integrate GrowthBook with the @vercel/flags library and the Vercel Toolbar for an even more seamless experience.

Azure SCIM Support

This latest release supports SCIM user provisioning for enterprises using Azure AD (also known as Microsoft Entra). You can now fully configure your users and teams within Azure, and they will be synced to GrowthBook. SSO and SCIM require a GrowthBook Enterprise license.

Ready to ship faster?

No credit card required. Start with feature flags, experimentation, and product analytics—free.

Get Started

Book a Demo

Simplified white illustration of a right angle ruler or carpenter's square tool.

White checkmark symbol with a scattered pixelated effect around its edges on a transparent background.