false
Experiments

Python A/B Testing for Backend Engineers

A graphic of a bar chart with an arrow pointing upward.

Most experimentation tooling was built for the browser.

That's fine if you're testing button colors — but if your team's most important product decisions live in a Flask API, a recommendation engine, or an ML inference pipeline, a JavaScript SDK and a visual editor aren't the right tools. They're a workaround. Server-side Python experiments are the right tool, and this guide covers how to build them correctly from the ground up.

This article is written for backend engineers, ML engineers, and product managers working on Python-first teams — people who need to test logic that never touches the browser and want results they can actually trust. Here's what you'll learn:

  • How the GrowthBook Python SDK works — the Experiment class, traffic controls, targeting conditions, and how to read the Result object
  • When to run experiments server-side versus client-side, and how to make that call based on what you're actually testing
  • The design decisions that determine whether your experiment produces real signal or just noise — traffic weights, sample size, metric selection, and user exposure timing
  • How to move from one-off Python scripts to a repeatable experimentation program that scales beyond what any single script can manage

The article moves in that order: SDK mechanics first, then architecture decisions, then experiment design, then program maturity. Each section builds on the last. If you're starting from zero, read straight through. If you're already running experiments and hitting a specific wall, jump to the section that matches where you're stuck.

The two definitions of Python experiments (and why conflating them breaks your testing practice)

The phrase "Python experiments" means two different things depending on who's using it, and conflating them leads to real confusion when teams try to build a serious testing practice. The first meaning is Python as a scripting language for exploratory work — running a notebook, testing a hypothesis, prototyping logic.

The second is Python as the SDK layer for running controlled, statistically valid A/B tests and feature flags inside production services. This article is about the second definition. Understanding the distinction is the prerequisite for everything that follows.

Two definitions, one article

When a data scientist says they're "running a Python experiment," they might mean a Jupyter notebook comparing two model architectures. When a backend engineer says the same thing, they might mean a live traffic split between two pricing algorithms serving real users. Both involve Python.

Only one is a controlled experiment in the sense that matters for product decisions — with deterministic user assignment, defined traffic weights, and a result you can act on with statistical confidence.

The structured version of a Python experiment is what an SDK layer enables: an Experiment class with a defined key, a set of variations, and a run() method that returns a Result object telling you exactly which variation a user received and whether they were included in the experiment at all. That's the mechanism this article covers.

Why Python is where the high-impact experiments live

Most experimentation tooling was built front-end-first. Visual editors, URL redirect tests, JavaScript SDKs — these tools were designed for marketing teams optimizing landing pages and conversion flows in the browser. That's a legitimate use case, but it's not where the highest-stakes product decisions execute for most engineering organizations.

Python powers recommendation engines, pricing logic, search ranking, ML model selection, and the API response logic that determines what users see before the front end ever renders anything. The decisions that move retention and revenue often happen in Python services, not in the browser.

Character.AI's Landon Smith, Head of Post-Training, described the value directly: "We can compare model variants against real user outcomes — guiding our research in the direction that best serves our product." That's a Python-layer experiment. The front end had nothing to do with it.

What the SDK layer actually adds

The gap between "ad hoc Python branching" and "a controlled experiment" is larger than it looks. A raw if/else block based on a user ID modulo check has no statistical guarantees, no consistent assignment across services, no targeting conditions, and no connection to your analytics pipeline. It's a coin flip with extra steps.

A Python experimentation SDK closes that gap. GrowthBook's Python SDK handles deterministic user assignment through consistent hashing, supports traffic weighting and partial rollouts, accepts MongoDB-style targeting conditions to restrict exposure to specific user segments, and returns a structured Result object with properties like inExperiment, value, and hashAttribute. That's the infrastructure that turns a branching statement into an experiment you can analyze and trust.

The community has noticed this gap too — the tea-tasting Python package, which gained significant traction in practitioner circles, exists specifically to bring statistical rigor to A/B test analysis in Python environments. It handles the mechanics of determining whether a result is real or random: significance tests, techniques for reducing the variance that makes small effects hard to detect, and tools for calculating how much traffic you need before a result is trustworthy. The demand for structured tooling is real.

Backend engineers and ML teams are the real audience for Python experiments

The audience for Python experiments is not the same as the audience for visual A/B testing tools. It's backend engineers running Flask, Django, or FastAPI services who need to test logic that never touches the browser. It's ML engineers who want to compare model variants against real user outcomes rather than offline benchmarks.

It's product managers embedded with Python-first teams who need a testing framework their engineers will actually use.

If your team's most important product decisions execute in Python — and for most engineering organizations, they do — then server-side Python experimentation isn't a workaround for the lack of a JavaScript SDK. It's the right tool for the job.

Running inline Python experiments: assignment logic, traffic controls, and the result object

Inline experiments in the GrowthBook Python SDK are self-contained by design. All experiment logic — variation assignment, traffic splitting, targeting evaluation — runs locally within the SDK without making third-party requests. That means no latency penalty from a remote configuration call on every request, and no single point of failure in your experiment infrastructure.

For backend teams running Python services at scale, this is a meaningful architectural advantage over approaches that require a round-trip to an external service before serving a response.

The core abstraction is the Experiment class. Understanding its parameters, and knowing how to read the Result object it returns, is everything you need to go from zero to a running experiment.

The minimum viable experiment: key and variations

Every experiment requires exactly two parameters: a key and a variations array. The key is a globally unique string that identifies the experiment for tracking purposes. The variations array holds the values users will be assigned to — and those values can be any Python data type. Strings, integers, floats, tuples — the SDK doesn't constrain you to boolean flags or string labels.

That flexibility matters more than it might seem. An experiment controlling pricing tiers can use variations=[9.99, 14.99, 19.99]. An experiment testing a composite UI configuration can use variations=[("blue", "large"), ("green", "small")] and then access result.value[0] and result.value[1] independently.

This makes the Experiment class usable for ML model parameter testing, recommendation algorithm variants, and backend logic branches — not just copy changes.

Controlling traffic: weights, coverage, and hash attributes

Once you have a key and variations, the next set of decisions is about who gets into the experiment and how they're assigned.

The weights parameter accepts a list of floats that must sum to 1.0. A three-way experiment with an uneven split looks like weights=[0.5, 0.25, 0.25], routing half of included traffic to the control and splitting the remainder equally between two treatment arms. If you omit weights, traffic is distributed evenly across variations.

The coverage parameter controls what fraction of eligible users are included in the experiment at all, expressed as a float between 0 and 1. Setting coverage=0.1 runs the experiment on 10% of users — useful for slow rollouts where you want to limit exposure before you're confident in a change.

The hashAttribute parameter determines which user attribute drives variation assignment. It defaults to "id", but for B2B products where every user at a company must see the same variation, you'd set hashAttribute="company". This ensures deterministic, company-level assignment without any server-side state storage. It's also worth setting hashVersion=2 to use the current hashing algorithm.

Targeting conditions and namespace isolation

The condition parameter accepts a dictionary using MongoDB-style query syntax. A condition like {'beta': True} restricts the experiment to users where that attribute is set. More complex conditions — {'country': 'US', 'browser': {'$in': ['chrome', 'firefox']}} — work the same way, composing attribute checks without requiring custom filtering logic in your application code.

For teams running multiple concurrent experiments, the namespace parameter handles mutual exclusivity. It takes a tuple of (name, start, end) where start and end are floats between 0 and 1 representing a slice of a named namespace. Two experiments using namespace=("pricing", 0, 0.5) and namespace=("pricing", 0.5, 1) occupy non-overlapping ranges of the same namespace, guaranteeing that no user is enrolled in both simultaneously.

This is one of those details that's easy to skip in early experimentation and painful to retrofit later when you're running dozens of tests.

Reading the result object

Calling gb.run(experiment) returns a Result object with six properties worth knowing. result.inExperiment is True if the user was assigned to a variation and False if they were excluded for any reason — failed targeting condition, outside coverage threshold, or otherwise. result.value holds the actual variation value, while result.key gives you the string index of the assigned variation ("0", "1", etc.).

Whether the user was randomly assigned through hashing is captured in result.hashUsed — this is False when a variation was forced, which is useful for distinguishing QA overrides from real traffic in your analytics. Finally, result.hashAttribute and result.hashValue tell you which attribute was used and what its value was at assignment time.

The standard pattern is a conditional check on inExperiment before acting on the result: if the user wasn't included, you fall through to default behavior without any special handling.

Async vs. legacy client

The SDK ships with two client patterns. The legacy synchronous client (GrowthBook) is straightforward for Flask applications and standalone scripts — instantiate it, call gb.run(), call gb.destroy() for cleanup. The modern async client (GrowthBookClient) is designed for FastAPI and async Django, requires await client.initialize() before running experiments, and passes the user context as a second argument to await client.run(experiment, user).

The async client also uses snake_case property names on the result object — result.in_experiment and result.hash_used — rather than the camelCase used by the legacy client. Mixing the two naming conventions is a runtime error, so it's worth being deliberate about which client pattern your codebase standardizes on.

Server-side vs. client-side Python experiments: why the default answer is already in your architecture

If you're running a Python backend, the question of where to run your experiments isn't really a coin flip. The architecture you already have — API services, ML inference pipelines, Django or FastAPI applications — is exactly where server-side experimentation is most powerful. Client-side testing has a legitimate role, but it's a narrower one than the tooling ecosystem might lead you to believe.

Why server-side is the right default for Python backend teams

The core advantage of server-side experimentation is that the assignment decision happens before anything reaches the user. As GrowthBook's documentation puts it, server-side testing "allows you to run very complex tests that may involve a lot of different parts of the code, and span multiple parts of your application." For a Python team, that means you can experiment on pricing logic, recommendation algorithms, API response structures, or multi-step checkout flows — not just surface-level UI elements.

There's no flickering. Because the variation is determined on the server before the response is sent, users never see a flash of the wrong content while the experiment loads. This isn't a minor aesthetic concern — it's a measurement concern. Flickering introduces noise into behavioral data and erodes the trust of anyone reviewing results.

For teams running ML models, there's a more specific advantage worth calling out. Deterministic hashing for assignment means a given user or entity will always receive the same variation across API calls without requiring you to store state from the experimentation platform. If you're A/B testing two versions of a ranking model or a recommendation engine, that consistency is essential — and it's something client-side tools weren't designed to provide.

Performance is also cleaner on the server side. The GrowthBook Python SDK evaluates experiments locally with zero network calls per flag check, using a cached JSON payload that delivers sub-millisecond evaluation. For Python services where latency budgets are tight, that matters.

When client-side testing is still the right call

Client-side testing earns its place in one well-defined scenario: visual changes that live entirely in the browser. Button colors, headline copy, layout variations, call-to-action placement — these are changes where the experiment is fundamentally about what the user sees, not about what your backend computes. A visual editor handles exactly this use case, letting non-engineers run experiments on UI elements without touching source code.

URL redirect experiments are another legitimate client-side use case, particularly for marketing teams testing landing page variants across different acquisition channels. These don't require engineering involvement and shouldn't — they belong in a tool designed for that workflow.

The honest tradeoff is this: client-side testing is faster to set up for teams without embedded technical resources, and it's a reasonable entry point for organizations just beginning to build an experimentation culture. But it comes with the flickering problem, and mitigations — loading the SDK earlier in the page, using inline experiments — add complexity that partially offsets the simplicity advantage.

The decision is simpler than the "it depends" framing suggests

The decision framework is simpler than most "it depends" discussions make it sound. If the change you're testing lives in backend code, API logic, a database query, or an ML model, run it server-side. The Python SDK is built for exactly this, and your existing infrastructure is the right place for the assignment logic to live.

If the change is a visual element that a designer or marketer needs to control without a deployment cycle, client-side tooling — specifically a visual editor or URL redirect experiment — is the right tool. That's not a concession; it's a clean division of responsibility.

If your experiment spans multiple application layers — say, a new checkout flow that changes both backend pricing logic and frontend presentation — server-side is still the anchor. The backend assignment is the source of truth, and the frontend renders accordingly.

For Python teams, the default should be server-side. Use client-side where it genuinely fits the use case, not because it was the first option available.

Key experiment design decisions every Python developer should get right

Once you've established that your experiment belongs in your Python backend — not in the browser — the next layer of decisions is about design. The architectural choice is settled; what remains is the configuration that determines whether your experiment produces a result you can trust or noise that wastes weeks of traffic.

The quality of a Python experiment is determined before the first line of test logic runs. By the time you're watching traffic flow into your variations, the decisions that will make or break your results have already been made — or missed. Hash attributes, traffic weights, targeting conditions, namespace isolation, and metric definitions are pre-launch concerns. Get them wrong and you're not running an experiment; you're generating noise that looks like signal.

GrowthBook's Lead Data Scientist Luke Sonnet, PhD, frames this directly: poorly planned experiments waste time and lead to bad decisions. That framing should sit at the front of every developer's mind before they flip the switch on a new test.

Validate your setup before you run a real test

The cheapest insurance policy in experimentation is an A/A test — two identical variations with no actual difference between them. Before committing to a real experiment, run an A/A test to confirm that your traffic splitting and statistical machinery are working correctly. If you see a statistically significant result in an A/A test, you have an implementation bug, not a product insight. Catching that before a real experiment runs saves weeks of wasted data collection.

Set traffic weights and coverage intentionally

A common mistake is reaching for unequal splits — say, 80% control and 20% variant — when the real goal is de-risking a new change. The recommended approach is to keep variation splits equal and adjust overall exposure instead. Rather than an 80/20 split, run 20% overall exposure at 50/50: each variation receives 10% of total traffic. This lets you ramp up gradually without users switching between variations as you increase coverage, which would corrupt your assignment consistency.

On sample size: use 200 conversion events per variation as your conservative minimum threshold. For a feature with a 10% conversion rate running a standard two-way test, that means exposing at least 4,000 total users before you can trust the results. Run the experiment for at least one to two weeks to capture natural traffic variability — weekday versus weekend patterns matter, and a test that starts Friday and ends Monday is measuring something much narrower than you think.

Define metrics before you launch, not after

Metric shopping — selecting the metric that confirms your hypothesis after results come in — is one of the most common ways experiments produce false confidence. The primary metric, or Overall Evaluation Criterion, must be locked before the test launches. A well-structured metric framework distinguishes between goal metrics (what you're trying to move), secondary metrics (learning signals that don't drive shipping decisions), and guardrail metrics (things you must not hurt).

Adding too many primary metrics compounds the multiple testing problem and inflates your false positive rate. Secondary metrics can be added retroactively for learning purposes, but the metric that determines whether you ship should never be chosen after you've seen the data.

Namespaces prevent concurrent tests from contaminating each other

When multiple experiments run simultaneously, there's a risk that users end up enrolled in tests that interact with each other in ways that contaminate both results. The namespace feature in GrowthBook's SDK handles this by enabling mutually exclusive experiment assignment — users in one namespace cannot be enrolled in another.

That said, meaningful interaction effects between concurrent tests are actually quite rare. Running experiments serially to avoid any possibility of interaction is a real velocity cost that's usually not worth paying. Namespaces are the right tool when mutual exclusion is genuinely required; they shouldn't become a reflex that serializes your entire testing program.

Expose users at the right moment

Assignment should happen as close to the actual treatment as possible. If you expose all users to a signup flow experiment at page load — rather than only those who open the signup modal — you're including users who never experienced the variation in your analysis. That inflates noise and reduces your ability to detect real effects.

When assignment must precede exposure for technical reasons, use an activation metric to filter out users who were assigned but never actually saw the treatment. The goal is a clean comparison between users who experienced the change and users who didn't — not a comparison between everyone who could have.

From one-off Python scripts to a scalable experimentation program

There's a predictable arc to how engineering teams start with Python experiments. Someone writes a clean script that assigns users to variants, logs the results, and produces a readable analysis. It works. The team learns something useful. Then they write another one. And another.

Somewhere around the fifth or sixth experiment, the cracks appear: nobody knows what's currently running, two tests are accidentally targeting the same users, a product manager wants to check results but can't read the code, and the learnings from six months ago exist only in a Slack thread nobody can find. The SDK was never the bottleneck. The program was.

The limits of one-off Python experiments

The GrowthBook Python SDK is deliberately lightweight — it handles assignment logic locally, requires no network call to run an experiment, and integrates cleanly into any Python service. That's a genuine strength. But the SDK handles exactly one thing: deciding which variant a given user sees.

Everything else that makes experimentation valuable at scale — visibility into what's running, cross-functional access, cumulative learning, conflict detection between concurrent tests — lives outside the SDK in isolation.

GrowthBook's documentation frames this ceiling directly: at the early stages of experimentation maturity, "A/B tests may be run manually, limiting the number of possible experiments that can be run." Manual coordination is the bottleneck, not Python's capabilities. A well-written script can run one experiment well. It cannot run fifty simultaneously without organizational scaffolding that the script itself cannot provide.

The experimentation maturity model

GrowthBook's documentation draws on the maturity framework from Trustworthy Online Controlled Experiments by Ron Kohavi, Diane Tang, and Ya Xu, organizing experimentation programs into four stages: Crawl, Walk, Run, and Fly.

The Crawl stage is where teams have basic event tracking but no experiments running — data informs intuition, not decisions. Walk is where A/B tests start but run manually, typically producing one to four tests per month depending on traffic; this is where most Python-first teams live, capable of running experiments but constrained by the overhead of setting each one up by hand.

The Run stage is where experimentation maturity becomes a company practice rather than an engineering exercise — teams adopt or build a platform, larger product changes are tested by default, and volume reaches five to one hundred tests per month. Reaching the Fly stage — one hundred to ten thousand or more tests per month — means A/B testing has become the default for every feature, and any member of the product and engineering organization can run tests without filing a ticket.

The progression from Walk to Run is less about writing better Python and more about removing the friction that makes each experiment feel like a project.

What a platform layer adds beyond the SDK

GrowthBook is a unified platform — the Python SDK is one of its core capabilities, not a standalone product. Teams that start with the SDK in isolation are using one piece of a larger system that also includes centralized experiment management, analytics, no-code experiment types, and role-based access. Understanding what the full platform adds is what makes the difference between a team running one experiment at a time and one running fifty.

A platform adds centralized lifecycle management — moving experiments from idea through implementation, measurement, and shared results in a single place rather than across scattered scripts and spreadsheets. It adds no-code experiment types (visual editor, URL redirects) that allow product and marketing teams to run tests without engineering involvement, which is the only realistic path to high experiment velocity.

It adds retroactive metric addition, so a question that emerges mid-experiment doesn't require a restart. It adds a cumulative impact dashboard that shows the aggregate effect of all experiments on a given metric — because, as GrowthBook's product documentation notes, "each experiment may not have a large effect on your metrics, but many experiments might." And it adds a searchable archive of past experiments, capturing institutional knowledge that otherwise evaporates when engineers move to new teams.

Role-based access is underrated here. The reason most Python-first experimentation programs stay at the Walk stage is that every test requires an engineer. A platform that gives product managers direct access to experiment configuration and results removes that dependency — and that removal is what makes the jump to Run possible.

Diego Accame, Director of Engineering & Growth at Upstart, put the organizational shift plainly: "GrowthBook has changed the way we think about experiments. It allowed us to uplevel our code, speed up decision-making." That's the transition this section is describing — from Python experiments as an engineering artifact to experimentation as a shared organizational capability. Starting with the Python SDK is the right first move; GrowthBook's full platform capabilities are already there when your program is ready for them.

Start running Python experiments that actually drive decisions

The through-line of this article is simple: the gap between a branching statement and a real experiment is larger than it looks, and closing that gap requires getting the fundamentals right before you worry about scale. Deterministic assignment, intentional traffic controls, pre-defined metrics, and exposure timing that matches your actual treatment — these aren't advanced topics. They're the foundation. Everything else builds on them.

Teams that get this right consistently share one trait: they treat experiment design as a pre-launch discipline, not a post-hoc cleanup. Floward's data science team, for example, cut experiment setup time from three days to under thirty minutes after adopting a warehouse-native experimentation platform — not because the experiments became simpler, but because the scaffolding stopped being a bottleneck. That's the operational shift this article is trying to help you reach.

Choosing the right approach: inline SDK experiments vs. feature flag rules

The inline SDK approach covered in this article gives you full control in code — it's the right starting point when your experiment logic lives inside a Python service and your team is comfortable owning the configuration. Feature flag rules managed through GrowthBook's platform are the better fit once you need non-engineers to adjust targeting or traffic without touching a deployment. Neither is universally superior; the right choice depends on who needs to control the experiment and how often that configuration changes.

A practical checklist before you launch your first Python experiment

Run an A/A test first (covered in the experiment design section above) to confirm your assignment logic and analytics pipeline are working before real traffic enters a real experiment. Lock your primary metric before launch, set coverage intentionally rather than defaulting to 100%, and make sure your exposure event fires as close to the actual treatment as possible. These aren't bureaucratic steps; they're what separates a result you can act on from one you'll spend weeks second-guessing.

The signal that you've outgrown one-off Python scripts is organizational, not technical

The honest signal that you've outgrown one-off Python scripts isn't technical — it's organizational. When a product manager can't check results without asking an engineer, when nobody knows what's currently running, when learnings from last quarter live only in someone's memory, the bottleneck is the program, not the code. That's the moment to think seriously about GrowthBook's full platform capabilities.

Role-based access and centralized experiment management exist specifically to remove the engineering dependency that keeps most teams stuck at the Walk stage.

The goal this article is trying to help you reach is experiments that produce decisions, not just data. That's achievable with the tools and patterns covered here — and this article is meant to give you a clear enough picture of the full landscape that you can start without second-guessing every choice.

Where to start depending on where you are

If you haven't run a Python experiment yet: Install the GrowthBook Python SDK, write a minimum viable experiment — a key, two variations, and a hashAttribute that matches your user model — and run an A/A test against your staging environment before anything touches production. The GrowthBook Python SDK documentation at docs.growthbook.io/lib/python is the right starting point.

If you're running experiments but want to improve design quality: Return to the experiment design section of this article and audit your current setup against the five decisions covered there: A/A validation, traffic weight intentionality, pre-defined metrics, namespace isolation, and exposure timing. Fix whichever of those is weakest before running your next test.

If you're hitting the coordination problems described in the scaling section — scattered results, no cross-functional visibility, every test requiring an engineer — evaluate GrowthBook's full platform capabilities. Role-based access and centralized experiment management are the specific features that remove the engineering dependency keeping most teams at the Walk stage.

Table of Contents

Related Articles

See All articles
Experiments

How to Run JavaScript Experiments: Client-Side, Server-Side, and Everything Between

Apr 30, 2026
x
min read
Feature Flags

Top 5 Alternatives to Split for DevOps Teams

May 4, 2026
x
min read
Platform

Top 8 Alternatives to PostHog for Developers

May 5, 2026
x
min read

Ready to ship faster?

No credit card required. Start with feature flags, experimentation, and product analytics—free.

Simplified white illustration of a right angle ruler or carpenter's square tool.White checkmark symbol with a scattered pixelated effect around its edges on a transparent background.