The Uplift Blog

Subscribe
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
Introducing the First MCP Server for Experimentation and Feature Management
Platform
Product Updates
4.0
AI

Introducing the First MCP Server for Experimentation and Feature Management

May 19, 2025
x
min read

AI coding tools make it easy to write code and produce features much more quickly than ever before. But as we increase the amount of code we can create, we still need to ensure those features will work.

Don't let vibe coding become vibe shipping.

With the official GrowthBook MCP Server, you can add feature flags, safe rollouts, A/B tests, and more to your app without ever leaving your code editor. VS Code, Cursor, Windsurf, Claude—whatever you use, now it can talk directly to GrowthBook.

When adding flags is this easy, there’s no excuse for risky deployments.

What’s an MCP Server?

MCP stands for Model Context Protocol, a standard that lets AI tools (like LLM editors) integrate with platforms like GrowthBook.

Most modern AI tools already support “tool calling,” the ability for an LLM to trigger specific developer-defined actions. MCP formalizes this into a plug-and-play protocol. So instead of writing custom glue code for every integration, AI tools and platforms can just connect over MCP and instantly work together.

GrowthBook’s MCP Server is the first complete example of this in the experimentation space. It’s open source, fully featured, and designed to streamline the way you work with feature flags and experiments.

For example, here we use the GrowthBook MCP to add a feature flag to our React app. The MCP Server creates the flag in GrowthBook, setting the flag type, default value, name, and more. It returns a direct link to the flag on GrowthBook, so you can quickly configure an experiment or safe rollout. Finally, the component code is updated to include the flag using the GrowthBook React SDK.

What You Can Do

The GrowthBook MCP Server currently supports 14 tools. Here are some highlights:

  • Create a new feature flag and insert it into your code
  • Set up a force rule for a flag, e.g., to restrict a feature to beta testers
  • Detect stale safe rollouts and remove them from your codebase
  • Generate type definitions for flags
  • Create and insert a new SDK Connection
  • Search GrowthBook docs from your editor

See the full list of Tools in the docs.

Setup

Add the GrowthBook MCP Server to most AI tools with the following JSON:

{
  "mcpServers": {
    "growthbook": {
      "command": "npx",
      "args": ["-y", "@growthbook/mcp"],
      "env": {
        "GB_API_KEY": "YOUR_API_KEY",
        "GB_USER": "YOUR_NAME"
      }
    }
  }
}


For detailed installation instructions for specific AI tools, see our docs.

What’s Next

It’s early days for MCP. As support grows for features like Resources and Prompting, we’ll be updating the GrowthBook MCP Server to take full advantage.

GrowthBook’s MCP Server is already super useful, but there is always more we can do. Like MCP, GrowthBook is open source. Try it, improve it—we’d love to see what you build.

Get Started

The GrowthBook MCP Server is live and ready. Use it in Cursor, Claude, and VS Code, or anywhere that supports MCP.

Have ideas, bugs, or weird edge cases? Open an issue or find us on Bluesky or LinkedIn—we’d love to hear what you build.

GrowthBook Version 3.6
Releases
3.6
Product Updates

GrowthBook Version 3.6

May 1, 2025
x
min read

This release includes an exciting new feature flag rule, a long-awaited addition to experimental results, an integration sure to make PMs happy, and much more! Keep reading for details.

Safe Rollouts

Safe rollout modal with different states like ship now and revert now
Safe Rollout modal showing Ship Now and Revert Now states with guardrail metric monitoring

Introducing our newest type of feature flag rule—Safe Rollouts! 

Safe Rollouts let you gradually release a feature while monitoring guardrail metrics for regressions. It’s designed to be simple to use, so there’s no reason not to wrap it around every bit of code you release. In fact, we used Safe Rollouts to release Safe Rollouts in GrowthBook Cloud just this week. So meta!

Under the hood, Safe Rollouts uses sequential testing and one-sided confidence intervals to continuously monitor guardrail metrics and quickly detect harmful changes without inflating the false-positive rate.

We couldn’t wait to get this into your hands and hear your feedback, so this initial release is intentionally minimal. Don’t worry, though, we have a lot planned for the near future: automated ramp-ups (10% → 25% → 50% → 100%), time series view of results, deep dives, and more. Stay tuned!

Time Series

Time series chart
Time Series view in experiment results showing how a metric has changed throughout the lifetime of an experiment

We’ve added one of our most requested features to experiment results: a Time Series view for metrics! Now you can expand any metric and see how it has changed throughout the lifetime of the experiment. The best part? We were able to add this without any additional (and expensive) SQL queries against your data warehouse, so it all comes at no extra cost.

Official Jira Integration

GrowthBook Jira Integration
Official Jira integration showing a linked GrowthBook feature with status visible directly in a Jira issue

We just launched our first official Jira integration! You can now install the GrowthBook app from the Atlassian Marketplace and easily link a feature or experiment to any Jira issue. See key details and up-to-date statuses directly in Jira without switching contexts.

Check out the Jira Integration docs to learn more and get started.

Decision Framework Events and API

decision framework, showing a ship now box instruction box
Decision Framework showing a Ship Now recommendation with REST API and webhook support for custom workflows

In the previous release, we launched the Experiment Decision Framework to provide UI recommendations on when an experiment is ready to stop and which decision you should make (ship or roll back). In this release, we extended this info to both the REST API and Webhooks. That means you can get an alert in Slack whenever an experiment is ready to call or build your own custom workflows around these events.

Dev Tools for SSR

Dev Tools for GrowthBook
Dev Tools browser extension showing SSR support with feature flag overrides persisted via cookie to the backend

Our GrowthBook Dev Tools browser extension has always been a great way to debug and QA feature flags, but it was limited to client-side applications only. We’re excited to finally bring the same developer experience to the back-end, starting with Server-Side Rendered (SSR) Javascript applications. All it requires is a few small changes to your back-end GrowthBook implementation.

So how does it work? When you override a feature flag, experiment, or attribute in DevTools, we persist the override in a cookie. This is sent to the back end and applied locally before the request is processed. Then, debug logs from the back-end SDK are injected into the rendered HTML and captured by DevTools.

Fact Table JSON Columns

It’s common to have a single shared “Events” table in your data warehouse where everything from page views to purchases is logged together. While a JSON blob column is a convenient way to store meta info about each event, it also makes it harder to query and use in metric definitions.

Fact Tables in GrowthBook now have native support for JSON columns. When creating metrics, you can easily reference nested fields in row filters and even use them as metric values. No more writing custom SQL filters and trying to remember the syntax.

Fun fact: Every single database engine decided to come up with its own syntax for querying JSON data, so we had to reimplement this feature 10+ different times.

👉 Find the full release notes on GitHub.

Feature Flagging at Scale: 5 Power Tools You Shouldn’t Skip
Feature Flags

Feature Flagging at Scale: 5 Power Tools You Shouldn’t Skip

Apr 14, 2025
x
min read

You've got 99 feature flags, and guess what? That is the problem.

And it's only the beginning.

Now your team has hundreds of developers, designers, and PMs. Your infrastructure is distributed across microservices, edge networks, and experimental AI side quests. Your users? They're accessing your app from a shiny MacBook Pro, or a beat-up 5-year-old Android, or—somehow—the seat-back screen of a Boeing 777 to Frankfurt.

How do you keep it all from breaking? And when it does break—because, let's be real—it will... how do you figure out why?

In this post, we'll walk through 5 can't-miss tools to help you scale your feature flagging operation without losing your sanity. If you're serious about progressive delivery, safe rollouts, and not making your team overly sweaty on each release, this one's for you.

1. Prerequisites: Flags on Flags on Flags

Most modern feature flagging platforms (GrowthBook included) let you target features by audience segments. For example:

  • Internal QA testers on desktop in Canada
  • Pro users on mobile who've made 5+ orders in the last 3 months
  • Beta testers who DM'd your CEO on X

Cool, right?

But what about when one feature depends on another?

Let's say you're prepping your long-awaited 3.0.0 release. It includes:

  • New settings UI
  • Fresh checkout flow
  • Product carousels that aren't garbage
  • Dark mode (#1 requested feature)

You want to flip the whole release on with a single flag—but keep the ability to turn off any part if it breaks.

That's where prerequisite flags come in. You create a release-3-0-0 flag, and make each individual feature (settings menu, checkout, etc.) depend on that release flag. This way, no part of the release goes live unless release-3-0-0 is true. If something starts throwing errors? Toggle that one feature off without touching the rest.

How to do it in GrowthBook:

  1. Create the top-level flag (release-3-0-0).
  2. Set it to false in your production environment but true in dev.
  3. Create a dependent flag like new-settings-menu.
  4. Add release-3-0-0 as a prerequisite. Done.
Setting a prerequisite feature in GrowthBook
Setting a prerequisite feature flag in GrowthBook to control a bundled release
release-3-0-0 flag set as a prerequisite for the settings-menu flag
release-3-0-0 flag set as a prerequisite for the settings-menu flag

Still curious?

Check out the docs or watch an explainer video with Clippy 📎

2. Simulation & Archetypes: Know What Your Flags Will Do Before They Do It

Here's a real flag rule we've seen (truth be told, it's actually simplified here):

  • Kill switch is off
  • Account age > 30 days
  • Region isn't restricted
  • User isn't in an override list
  • Country is GB
  • A/B test returns true

Yikes. This is already complex—and you know Joey from Marketing has some other targeting they're itching to add.

How do you understand at a glance how this flag will evaluate without feeling like Charlie:

Pepe Silvia meme with Charlie looking at papers connected together with red string
Meme showing the complexity of debugging feature flag rules without simulation tools

In GrowthBook, Simulation lets you plug in attributes—like country, browser, etc.—and instantly see the flag's result. You'll know before you ship whether a rule will work or explode.

See how features evaluate with simulation
See how features evaluate with simulation

But adding in those attributes time and again becomes tedious fast. Archetypes are a solution to that pain, letting you save common user types (like "internal tester" or "new mobile pro user") and then easily see how any flag is evaluated for them.

Simulate overview
GrowthBook Archetypes overview showing flag evaluations across multiple saved user types

💡

Head to SDK Configuration → Archetypes → Simulate to get a bird’s-eye view of all flag evaluations for any given user type.

3. Dev Tools: See What Your Flags Are Up To in the Wild

You've tested internally, you've simulated, and your flag has gone live. But wait—your teammate says they can't see it. There's the ping from QA. And support says users can't find the promised dark mode and if they can't dark mode, what's the point?

Time to start the arduous debugging process...

Hide the pain harold meme. first panel says wow. that's a major bug. second panel says glad i have dev tools
Meme about discovering a major bug and being glad you have GrowthBook Dev Tools

GrowthBook Dev Tools, a browser extension for Chrome and Firefox, is here to help 💁

With Dev Tools, you can:

  • Inspect live flag values in your app
  • See experiment variations and why they were assigned
  • View current user attributes
  • Override values to test different states
  • Lots more!

Dev Tools works with most frontend SDKs (just make sure enableDev: true is set in your config). Support for some backend SDKs coming soon.

See the full walkthrough 📼 to learn more:

4. Staleness Detection: Clean Up Your Crusty Old Flags

You launched a feature 2 months ago. The flag's still in production, but no one knows why. And now it's silently directing 100% of traffic to Variation A ... and nothing else 😐

That's a stale flag, and it's building up tech debt that's sure to cause confusion, bugs, and more work down the line.

GrowthBook detects staleness automatically. If a flag hasn't been touched in a while and is sending all users to the same variant, you'll see a clock icon in the  Flags view. (Hover over the icon to learn why it's stale.)

GrowthBook flags with the stale column highlighted
GrowthBook feature flags list with stale flag detection column highlighted

By alerting you to stale flags, GrowthBook helps you keep your features current and your codebase clean.

🧠 More on how staleness detection works

5. Code Refs: Find Your Flags in the Actual Codebase

Your PM wants to delete the new-headline-cta flag. Is it still used in the frontend? Backend? Nowhere? 🤷

Instead of grepping through every repo, use Code References in GrowthBook.

Code refs, showing where feature flags are used in code
GrowthBook Code References showing where a feature flag is used across repos with file names and line numbers

It shows you:

  • Repos and file names
  • Line numbers and code snippets
  • Links to the exact spot in your code

To enable it:

  1. Add the GitHub Action (GitLab and other platforms supported, too!)
  2. Go to Settings → Feature Settings → Code References and enable.
  3. Make cleaning up flags suck a little bit less.
code references settings page
GrowthBook settings page for enabling Code References via GitHub Action

Feature flags are pretty much a necessity for modern software development—but you gotta keep them under control. As your app (and your team) grows, so does the complexity. The tools above will help you stay ahead of it all.

👉 Want to try them out for yourself? Start for free or say hi on Slack.

Why Fintechs Should Shrink Their Attack Surface—Not Just Get Certified
Platform

Why Fintechs Should Shrink Their Attack Surface—Not Just Get Certified

Mar 28, 2025
x
min read

TL;DR

  • Security certifications aren’t enough to stop breaches.
  • The most effective way to secure customer data is to reduce your attack surface.
  • Self-hosting tools can significantly reduce your risk of a breach.

When I got an email from my bank saying “nothing to worry about,” my gut told me otherwise.

The message referenced a data breach at Evolve Bank & Trust—one of the infrastructure providers for fintech platforms such as Mercury, Affirm, and Wise. The attackers reportedly accessed 33 terabytes of data—a staggering amount, possibly encompassing most of Evolve’s Azure Cloud storage.

What’s unsettling is that Evolve wasn’t negligent by traditional standards. They held all the right security certifications: SOC 2 Type II, HIPAA, HITRUST CSF, PCI DSS. And yet, their defenses were breached.

We don’t yet know exactly how—but this much is clear:
Compliance alone doesn’t keep customer data safe.

What Keeps Data Safe? A Smaller Attack Surface.

Security professionals often talk about “attack surface”—the number of ways a system can be accessed or exploited. The more entry points, the greater the risk.

In fintech, where trust and regulation are paramount, minimizing your attack surface is non-negotiable.

In 2022, the financial sector suffered 566 data breaches, exposing over 254 million records.

SaaS tools that run in the public cloud often expand your attack surface—regardless of their certifications. This is especially dangerous in highly regulated industries like banking, healthcare, and insurance.

Why Self-Hosting Is the Best Way to Reduce Risk

The safest data is the data that’s never exposed to the internet. When you self-host, you keep tools and infrastructure inside your private network or behind your firewall, significantly reducing risk.

Self-hosting doesn’t have to slow you down. Most modern platforms, including GrowthBook, offer full-featured self-hostable versions of their services. You get the innovation you need without opening new doors for attackers.

GrowthBook provides:

  • Self-hosted feature flagging with complete control over deployment
  • Secure A/B testing powered by your own data warehouse
  • Open-source transparency with auditable code and customizable infrastructure

The Bottom Line

If you work in fintech, healthtech, or any industry handling sensitive data, it’s time to move beyond compliance checkboxes.

Self-hosting your experimentation stack is one of the most effective ways to keep your customers safe while still shipping fast.

Learn more about how GrowthBook supports self-hosting for enterprise-grade security.

How to A/B Test AI: A Practical Guide
Platform
Experiments
AI
Guides

How to A/B Test AI: A Practical Guide

Mar 17, 2025
x
min read

Large Language Models (LLMs) are evolving at breakneck speed. Every month, new models claim to be faster, cheaper, and smarter than the last. But do those claims actually hold up in real-world applications?

Why Traditional Benchmarks Fall Short

Benchmark scores might give you a starting point, but they don’t tell you how well a model performs for your specific use case. A model that ranks highly on a standardized test might generate irrelevant responses, introduce latency, or incur significantly higher production costs.

As Goodhart’s Law states: "When a measure becomes a target, it ceases to be a good measure." LLM providers are incentivized to optimize for benchmark scores—even if that means fine-tuning models in ways that improve test results but degrade real-world performance. A model might ace summarization tasks on a leaderboard but struggle with accuracy, latency, or user engagement in actual applications.

Implementing A/B Testing for AI Models

A/B testing offers a structured approach to compare two or more versions of an AI model in a live environment. By deploying different models or configurations to subsets of users, organizations can measure performance against key business and user metrics, moving beyond theoretical benchmarks.

Parallel Model Deployment
Deploying multiple AI models in parallel has become increasingly feasible. This approach allows for real-world testing by assigning users to different model versions and tracking metrics such as accuracy, latency, and cost. Tools like GrowthBook facilitate this process, enabling quick deactivation of underperforming models without disruptions. Additionally, platforms like LangChain, Ollama, and Dagger streamline model deployment, making experimentation more seamless.

Optimizing AI Prompts
When switching models isn't practical, optimizing performance through prompt variations is a viable alternative. For example, testing different prompt structures—such as "Summarize this article in three bullet points" versus "Provide a one-sentence summary followed by three key insights"—can reveal optimal configurations. This method is particularly useful when model switching is either too costly or unnecessary.

Best Practices for A/B Testing AI Models

To conduct effective A/B tests for LLMs, follow these best practices:

  • Randomized User Allocation: Divide users into distinct groups, each experiencing only one variant. Tools like GrowthBook enable persistent traffic allocation, ensuring consistency.
  • Single Variable Isolation: Modify only one variable per experiment—be it model version, prompt wording, or temperature setting—to clearly attribute outcome differences.
  • Incremental Rollouts: Start experiments with limited traffic (e.g., 5%) and gradually scale up as results confirm improvements. Feature flagging tools like GrowthBook enable the immediate deactivation of models or prompts that introduce errors or negatively affect metrics.

Choosing the Right Metrics

The success of an A/B test depends on tracking meaningful metrics.

Metric CategoryExamplesImportance
Latency & ThroughputTime to first token, Completion timeUsers abandon slow services
User EngagementConversation length, Session durationIndicates valuable user experiences
Response QualityHuman ratings ("Helpful?"), Regenerate requestsDirectly reflects user satisfaction
Cost EfficiencyTokens per request, GPU usageBalances performance with budget

Overlaying business Key Performance Indicators (KPIs), such as retention or revenue, with model-specific guardrails like response latency ensures that improvements in one area do not negatively impact another.

Design a Sound Experiment

To ensure valid conclusions, follow these best practices:

  • Hypothesis Definition: Clearly state measurable hypotheses (e.g., "Adding an example to prompts will increase accuracy by 5%").
  • Sample Size Estimation: Use power analysis to determine required sample sizes due to stochastic outputs.
  • Consistent Randomization: Assign users consistently; avoid mid-experiment switching.
  • Logging & Data Collection: Capture detailed user interactions and direct/indirect signals of response quality.
  • Statistical Analysis:
    • Continuous metrics: Use t-tests or non-parametric tests.
    • Categorical outcomes: chi-square or two-proportion z-tests.
    • Interpret significance carefully—statistical significance alone isn't enough; consider practical relevance and cost-benefit trade-offs.

Real-world Case Studies

Case Study 1: Optimizing Chatbot Engagement
A team tested a reward-model-driven chatbot against their baseline. The reward-model variant resulted in a 70% increase in conversation length and a 30% boost in retention, validating theoretical gains through real-world experimentation.

Case Study 2: AI-Generated Email Subject Lines
Nextdoor compared AI-generated subject lines against their existing rule-based approach. Initial tests showed minimal benefit; after refining their reward function based on user feedback, subsequent tests delivered a meaningful +1% lift in click-through rates and a 0.4% increase in weekly active users.

Key Takeaways: How to Implement A/B Testing for AI

  • Think beyond benchmarks. Real-world user impact matters more.
  • Test models, prompts, and configurations—small tweaks can drive big changes.
  • Use feature flags to enable safe, controlled rollouts.
  • Measure both performance and cost—faster isn’t always better if it’s too expensive.
  • Continuously iterate. AI models change fast—so should your testing strategy.

Conclusion

With the rapid evolution of AI, blindly trusting benchmark scores can lead to costly mistakes. A/B testing provides a structured, data-driven way to evaluate models, optimize prompts, and improve business outcomes. By following a rigorous experimentation process—with well-defined hypotheses, meaningful metrics, and iterative improvements—you can make informed decisions that balance accuracy, efficiency, and cost.

🚀 Ready to implement AI experimentation at scale? Start A/B testing today.

GrowthBook Version 3.5
Releases
Product Updates
3.5

GrowthBook Version 3.5

Mar 3, 2025
x
min read

With this release, we focused on user experience and productivity improvements. There’s a completely revamped Dev Tools browser extension, a Power Calculator, an experiment Decision Framework, and tons of UI and SDK improvements. Read on for more details.

Dev Tools Browser Extension

Dev Tools Browser Extension: Experience a complete design overhaul with new features, including Firefox support and dark mode.

Our Chrome Dev Tools extension has been an essential tool for debugging feature flags and experiments since 2022, but we knew it could do even more.  That’s why we completely rebuilt it from the ground up, making it faster, more powerful, and a lot more pleasant to use. There are too many improvements to list them all, but here are some highlights:

  • Firefox support!
  • Complete design overhaul (including dark mode!)
  • Event logs and SDK health checks
  • Sync Archetypes from GrowthBook to quickly simulate different users
  • New popup entrypoint (just click the GrowthBook extension icon)

Download for Chrome or Download for Firefox to get started, or check out the docs to learn more. 

Power Calculator

Power Calculator: Accurately estimate experiment run times and minimum detectable effects using your historical data.

Many existing A/B test calculators estimate required sample sizes for statistical power but lack access to your historical data and specific metric definitions, limiting their accuracy. Recognizing this gap, we've developed our new built-in Power Calculator to provide more precise and tailored estimates.​

This tool operates in two straightforward steps:​

  1. Define Your Audience: Specify which users will be exposed to your experiment by selecting a past similar experiment, choosing a segment, or building an audience from a fact table.​
  2. Set Your Goal Metrics: Identify the metrics that matter most to your experiment's success.

Once these steps are completed, the Power Calculator performs quick queries against your data warehouse, providing you with the estimated run time and minimum detectable effect (MDE) based on your historical data.​

You can access the Power Calculator from the top of the Experiments page. Leveraging historical data requires a Pro or Enterprise license; however, a manual version is available for free accounts. For detailed instructions and a deeper statistical understanding, please refer to our documentation.

Experiment Decision Framework

Experiment Decision Framework showing Ship Now, Roll Back, and Ready for Review statuses on running experiments

It can be hard to know when to stop an experiment and make a decision. To help with this, we’re launching the first version of our Experiment Decision Framework. When enabled, running experiments will now show some additional info:

  • Unhealthy - if there are data quality issues or the test is too low-powered
  • No data - if the test has been running for 24 hours and there’s no data yet
  • ~X days left - how long until the test reaches the desired statistical power
  • Ship now - if all goal metrics are positive and statistically significant
  • Roll back now - if all goal metrics are negative and statistically significant
  • Ready for review - if the test has run for long enough, but there is no clear winner

For now, you must manually enable the Decision Framework under General Settings to get these new statuses. This feature is available to all Pro and Enterprise customers. We are continuing to improve this feature and are looking for feedback, so let us know your thoughts!

Read the docs for more info.

SDK Updates

SDK Updates: Our SDKs now support pre-requisite features and sticky bucketing across multiple platforms.

Some of our SDKs have fallen a little behind, so we’ve been working hard to bring them all up to spec.

Pre-requisite Features and Sticky Bucketing are now supported in the latest versions of all of the following SDKs:

Open Feature is now supported for Web (JavaScript, React), Node.js, Java, and Python.

OpenFeature provider support added for Web, Node.js, Java, and Python SDKs

We’ve also released dozens of bug fixes, performance improvements, enhanced thread safety, and more across many of our SDKs. Check out the individual release notes for more info.

We’re continuing to invest more time in all our SDKs to ensure every language and framework offers a top-notch developer experience. If you find something that can be improved, please let us know!

Design system improvements showing dark mode updates and UI polish across feature and experiment pages

Last but not least, our design system migration is moving along quickly. You should notice big improvements to dark mode, feature and experiment pages, and more consistency and UI polish in general across the entire app. Let us know if you have any feedback!

To explore all the changes in this release, please visit our release notes.

The Hidden Complexities of Building Your Own A/B Testing Platform
Analytics
Experiments

The Hidden Complexities of Building Your Own A/B Testing Platform

Feb 18, 2025
x
min read

Years ago, at Education.com, we decided to build our own A/B testing platform. We had a large amount of traffic, a data warehouse already tracking events, and enough talented engineers to try something “simple.” After all, how hard could it be? But as with most engineering projects, it quickly became evident that what seemed straightforward morphed into a complex, high-stakes system, one where one bug will invalidate critical business decisions.

In this post, we’ll break down the hidden complexities, costs, and risks of building your own A/B testing platform that you may not be thinking of when you first start out on this endeavor.

Experiment Description

On the technical side of running an experiment, you need a way to tell your systems how you will assign users to a particular variant or treatment group in an experiment. Most teams begin with the basics, like deciding how many variations to run (A/B, A/B/C, or more) and how to split traffic (e.g., 50/50 vs. 90/10). Initially, it might look like you only need to encode a handful of properties. But as your platform use grows, you’ll discover you need more parameters.

  • Number of Variations: The scope can expand from simple A/B to multi-variate tests with multiple variations.
  • Split Percentages: Instead of fixed splits, you may require partial rollouts (10% one day, 50% the next) or dynamically adjusted traffic (Bandits).
  • Randomization Seeds: Deterministic assignment so every data query lines up with the same user-variant grouping.
  • Remote configurations. You may want ways to pass different values to the experiment from your experimentation platform.
  • Targeting Rules: Fine-grained controls, such as showing a test only to premium subscribers in California, can quickly add complexity.

Hard-coding these elements can be tempting, but it rarely scales. As your needs evolve, a rigid approach may lock you into time-intensive updates—especially when product managers want new ways to target or measure experiments.

User Assignment

Ensuring that each user sees the same variant across multiple visits or devices may sound easy. But in practice, deterministic assignment can trip you up if you don’t handle user IDs, session IDs, and hashing logic carefully.

  • Stable Identifiers: Teachers logging in from shared computers, students on tablets, or parents switching between mobile and desktop.
  • Hashing & Randomization: You want an algorithm that’s fast and produces an even distribution.
  • Server vs. Client-Side: Server-side experimentation is great for removing some of the problems with flickering, but may lack certain user attributes at assignment time. Client-side is more flexible but can cause quick visual shifts as JavaScript loads.
  • Timing Issues: Caching layers or missing user identifiers can lead to partial or double exposures, invalidating experiment data.
  • Cookie consent: Determining when you are allowed to assign them, what counts as an essential cookie vs a tracking one.

Mistakes here—such as users seeing multiple variations—can invalidate your results and lead to user frustration.

Targeting Rules

Precise targeting often starts simply—“Show the new treatment to first-time users”—but quickly grows. Soon, you’re juggling rules like “Display Variation A only to mobile users in the U.S., except iOS < 14.0, and exclude anyone already in a payment test.”

To avoid chaos, focus on these key areas:

  • Defining Attributes: Collect and securely store user data (location, subscription status, device type).
  • Overlaps & Exclusions: Prevent one user from landing in conflicting experiments.
  • Evolving Segmentation: Plan for marketing and product teams to constantly discover new slices of your user base.
  • Sticky Bucketing: Once a user sees a variant, they should continue to get this variant even if other settings change to not invalidate their data. This quickly gets tricky deciding on when to sticky a user, and when to reassign

The client side is more flexible but can cause abrupt visual change. Without a thoughtful targeting system, the tangle of conditions becomes unmanageable, undermining both performance and trust in your experimentation platform.

Data Collection

Every time a user is exposed to a test variant, you must log it accurately. If you already have a reliable event tracker or data warehouse, you’re in good shape—but it doesn’t eliminate problems:

  • Data Volume: Logging millions of events for high-traffic applications can overwhelm poorly designed systems.
  • Pipeline Reliability: Data loss or delays can lead to inaccurate analyses.
  • Separation of Concerns: The last thing you want is for your site’s main functionality to slow down because your experiment-logging system chokes under load.

Performance Issues

Your experimentation system must never degrade user experience. That means your platform must be built with the following in mind:

  • Latency: Assignment and targeting logic should run in milliseconds to avoid flickering.
  • Fault Tolerance: If the platform goes down, the product should revert to a default or safe state, not crash outright.
  • Decoupling: Keep experiment code out of critical paths to prevent a single failure from taking out your entire product.

Metrics Definition

Metrics are the backbone of A/B testing. However, there are aspects of metrics used for experimentation that are not obvious at first.

  • Customization: Each metric might require customization from the default. Does it need different conversion windows, minimum sample sizes, or specialized success criteria? (e.g., “logged in at least 5 times within 7 days”)
  • Metadata Management: Who owns each metric? How is it defined? Are you duplicating metrics under different names?
  • Flexibility: Hard-coding a handful of metrics quickly becomes a bottleneck when new use cases emerge.

We discovered product managers, data scientists, and marketing teams each had unique definitions of “success.” We needed a system to capture these definitions and keep them consistent across the organization.

Experiment Results

Analyzing results might be the most critical step:

  • Conversion Windows: Ensure that only events occurring after exposure are counted, and while the experiment was running.
  • Data Joins: Merging experiment exposure logs with event data often requires complex queries that can tax your data warehouse.
  • Periodic Updates: Experiment results change over time, so you’ll want to have a way to update results periodically.

Any mismatch between exposure events and downstream metrics can lead to spurious conclusions—sometimes reversing what you thought was a clear “win.”

Data Quality Checks

There are innumerable ways your data can be messed up, leading to unreliable results and a lack of trust in your platform. Here are some of the most common ones:

  • SRM (Sample Ratio Mismatch): A study from Microsoft found that ~10% of experiments failed due to assignment errors. Regularly test that actual split percentages match your intentions.
  • Double Exposure: If a user is unintentionally counted in two variations, their data should be excluded. If the percentage of users getting multiple exposures is high, you probably have a bug in your implementation.
  • Outlier Handling: A handful of power users can skew averages. Techniques like winsorization help maintain balanced metrics.

At Education.com, we regularly scanned for sample ratio mismatches—often uncovering assignment bugs we didn’t even know existed.

Statistical Analysis

Interpreting results is where the real magic—and risk—happens. When you're making a ship/no-ship decision, you need to be sure your analysis is as accurate and trustworthy as possible. A wrong decision caused by an issue with your statistics can be very costly, but finding a systemic issue with your statistics after months or years can be catastrophic. Relying on a single T-test can be misleading in real-world testing, especially if you’re “peeking” mid-experiment. Frequentist, Bayesian, and sequential methods each have trade-offs: how you handle multiple comparisons, ratio metrics, or quantile analyses can drastically impact conclusions. Underestimating these nuances may lead to false positives, costly reversals, or overlooked wins. If you don’t have deep in-house expertise, consider leveraging open-source statistical packages—like GrowthBook’s—to maintain rigor and reduce the chance of bad data driving bad decisions.

User Interface

Even the most advanced experimentation platform falls flat if only a handful of engineers can operate it. A well-designed UI empowers

  • Non-Technical Teams: A user-friendly dashboard lets product, marketing, and data teams set up and monitor experiments without engineering support.
  • Collaboration & Documentation: Capture hypotheses, share outcomes, and maintain a history of past tests—so insights don’t disappear when someone leaves.
  • Real-Time Visibility: Spot anomalies (like misallocated traffic splits) early and fix them before they skew results.

Neglecting the UI may save development time initially, but it can stifle adoption and limit the overall effectiveness of your experimentation program.

Conclusion: The Realities of Building In-House

Building your own A/B testing platform is much more than a quick project—it’s effectively a second product that ties together data pipelines, statistical models, front-end performance, and organizational workflows. Even small errors can invalidate entire experiments and erode trust in data-driven decisions. Ongoing maintenance, ever-changing requirements, and new feature requests often overwhelm the initial appeal of a DIY approach.

Unless it’s truly core to your value proposition, consider a proven, open-source solution (like GrowthBook). You’ll gain robust targeting, advanced metrics, and deterministic assignment—without shouldering the full cost and complexity. This way, your team can focus on what really matters: shipping features that users love.

Experiment Metrics Simplified: Retention, Count Distinct, Max
Analytics
Product Updates
4.0
Experiments

Experiment Metrics Simplified: Retention, Count Distinct, Max

Jan 18, 2025
x
min read

New Metrics to Answer Key Experiment Questions

Data teams face a common challenge: extracting actionable insights from experiments without adding complexity. GrowthBook’s latest metrics — Retention, Count Distinct, and Max — are designed to simplify this process, helping you measure long-term impact, unique user interactions, and peak performance without needing to dive into SQL. Here’s what you need to know.

Retention Metrics: Simplify Long-Term Impact Analysis

Retention metrics measure how many users return or engage with your product within a defined time frame after being exposed to an experiment. Traditionally, this involves juggling SQL queries and timestamp logic, but GrowthBook removes the friction with an easy-to-use interface.

How It Works:

  1. Select your core metric (e.g., user logins).
  2. Specify the time window you want to measure (e.g., 7-14 days post-exposure).

GrowthBook automatically calculates user retention across your experiment variants, helping you understand long-term engagement and whether changes resonate with users.

Example Use Case: Track Week 2 retention rates to determine if a new feature encourages users to return between 7-14 days after release.

Count Distinct Metrics: Track Unique Interactions Without SQL

Count Distinct Metrics lets you measure the number of unique entities (like users, products, or transactions) influenced by an experiment. Tracking unique entities is crucial because it uncovers patterns of diverse engagement and highlights how specific features meaningfully drive user actions. This eliminates manual deduplication and gives you precise data.

Use Cases:

  • Unique Videos Watched: Measure the number of distinct videos viewed by each user.
  • Unique Products Purchased: Count how many different products a user buys during an experiment.
  • Distinct Checkout Sessions: Track diverse payment methods or transaction types.

Why It Matters:
Product teams often focus on driving meaningful interactions, not just volume. Count Distinct provides a deeper understanding of engagement diversity, allowing teams to build features that encourage richer user experiences.

Max Metrics: Identify Peak Performance Effortlessly

Max metrics capture the highest value achieved by users during an experiment, providing insights into peak-performance behaviors.

Use Cases:

  • High Score: Track the top game scores users achieve, regardless of attempts.
  • Peak Spending: Identify the highest transaction value for each user.
  • Fastest Time: Measure the best completion times for key workflows.

Why It Matters:

Peak metrics highlight the outliers and top-performing scenarios that often drive key business outcomes. They’re invaluable for understanding the upper limits of user behavior.

Designed for Every Team

Retention Metrics are available to Pro and Enterprise customers and are perfect for analyzing long-term engagement trends.

Count Distinct and Max Metrics are available across all GrowthBook organizations, making them accessible whether you’re self-hosted or on a free plan.

Why Use These Metrics?

By integrating Retention, Count Distinct, and Max metrics into your workflows, you can:

  • Measure long-term user engagement without manual SQL.
  • Understand unique and diverse user interactions.
  • Pinpoint peak performance to identify standout successes.

Ready to Level Up Your Experiments?

Explore these new metrics in GrowthBook today and equip your team to make faster, data-driven decisions. It’s about simplifying the complex while getting results that matter.

Customize Your Experimentation Workflow with Custom Fields and Shareable Experiments
Experiments
3.4
Product Updates

Customize Your Experimentation Workflow with Custom Fields and Shareable Experiments

Jan 16, 2025
x
min read

Running a successful experimentation program isn’t just about analyzing results—it’s about seamlessly integrating experimentation into your team’s workflow, ensuring consistency, and making insights easy to share. GrowthBook’s latest features, Custom Fields and Shareable Experiments, are designed to help teams streamline their processes and scale experimentation more effectively.

Build Your Ideal Experiment Setup with Pre-Launch Checklists and Custom Fields

Enterprise teams can now add structured metadata to experiments and feature flags, enabling clear ownership, compliance tracking, and alignment with engineering workflows. By incorporating custom fields into your pre-launch checklist, you can ensure experiments are properly structured and ready for success from the start.

With Custom Fields, teams can:

  • Link experiments directly to project management tools (e.g., Jira, Linear)
  • Assign ownership and approvals for streamlined accountability
  • Track technical dependencies and platform constraints
  • Standardize resource impact assessments
  • Manage regulatory, privacy, and regional requirements
  • Connect experiments to OKRs and broader business goals

Custom Fields are configurable at the organization level and can be marked as required or optional. This ensures that teams capture all necessary context while avoiding unnecessary complexity.

Learn more about Custom Fields and Pre-Launch Checklists to build the perfect experimentation workflow.
Custom Fields are available exclusively for Enterprise customers.

Share Experiments, Your Way

Experiment results need to be accessible and actionable. That’s why GrowthBook now offers two flexible ways to share experiments—internally with your team or externally with public stakeholders.

GrowthBook experiment sharing menu showing options for live experiment sharing and custom reports
  1. Live Experiment Sharing
    Share a real-time, always-updating view of experiment results with stakeholders who need the latest data. Share live updates during sprint reviews to keep teams aligned and make decisions more quickly. No manual updates are required, ensuring everyone has access to the most current insights.

    See it in action: Explore this live experiment example to see how effortlessly you can share data in real-time.
  2. Custom Reports
    Save snapshots of experiment data for deeper analysis or documentation. Analysts and engineers can:
    1. Select specific date ranges or user segments
    2. Adjust statistical parameters
    3. Add or remove metrics for targeted analysis
    4. Apply custom SQL filters to address outliers or unique use cases
    5. Create stakeholder-specific views tailored to their needs

Custom reports are saved for future reference, helping teams build institutional knowledge and better understand what worked, what didn’t, and why.

Learn more about sharing experiments to improve collaboration and streamline your team’s experimentation efforts.

By making experimentation workflows customizable and insights more accessible, GrowthBook empowers teams to streamline workflows, share insights effortlessly, and accelerate learning from experiments, driving smarter decisions.

Ready to ship faster?

No credit card required. Start with feature flags, experimentation, and product analytics—free.

Simplified white illustration of a right angle ruler or carpenter's square tool.White checkmark symbol with a scattered pixelated effect around its edges on a transparent background.