Objective A/B Test Prioritization

Jeremy Dorn

Feb 25, 2021

min read

There are many prioritization frameworks out there for A/B tests (PIE, ICE, PXL, etc.), but they all suffer from one critical problem — subjectivity. The reason A/B testing is so powerful in the first place is that people are really bad at guessing user behavior and the impact of changes. Why are we then using those same bad guesses to prioritize? In addition, many of these frameworks place far too much emphasis on effort — how long a test takes to implement. Except in the rarest of cases, the time it takes to implement a test is far shorter than the time the test needs to run. In most cases, effort should be more of a tie breaker, not a core part of prioritization.

At GrowthBook, we developed a new prioritization framework based on hard data, not subjective guesses. It all centers around an Impact Score, which aims to answer the question — how much can you move the needle on your metric? The Impact Score has 3 components — metric coverage, experiment length, and metric importance.

Metric Coverage

Metric Coverage indicates what percentage of the metric conversions your experiment touches. If you have a Sign Up button in your site-wide top navigation, it has 100% metric coverage for signups because 100% of members will see that button before signing up. Your experiment may not change their behavior, but it at least has the potential to. On the other hand, if you have a Sign Up button on your homepage, you may only see 20% of potential new members. Even if you do an amazing job optimizing the button, it will have no effect on the 80% of people who come in through different landing pages.

Metric Coverage doesn’t just take into account the URLs an experiment runs on; it also needs to factor in targeting rules. If your experiment runs only for users in the US, that reduces coverage. If the test only affects the UI on mobile devices, that lowers it as well.

Calculating Metric Coverage is actually fairly simple. You take the number of conversions your test could possibly influence and divide by the total number of conversions across the entire site. Getting these numbers usually requires checking Google Analytics or using SQL and can be tricky for non-technical users. At GrowthBook, we solve this by generating SQL and automatically querying your database, given a few simple prompts (experiment URLs, user segments being tested, etc.).

Experiment Length

Experiment Length is an estimate of how long the experiment must run to reach significance. In essence, you do a sample size calculation and then divide by the daily traffic each variation will receive.

There are many sample size calculators out there (I recommend this one), and the statistics to implement your own are not too hard, so I won’t cover that here. I will note that the sample size calculation does require a bit of subjectivity — namely, choosing a Minimum Detectable Effect (MDE). If you are making a tiny change that most people probably won’t notice, you will need a lower MDE to pick up the change. Conversely, if you are making a major change, a higher MDE will suffice.

Let’s say you do that and come back with a sample size of 2000 per variation. If your experiment receives 500 visitors per day (for the selected URLs and user segments) and you are running a simple 2-way A/B test, it will take 8 days to finish (2000 / (500 / 2)). A 3-way test with the same traffic would take 12 days.

Because it’s best practice to run an experiment for at least a week, even with very high traffic, we set a minimum length of 7.

Metric Importance

Not all metrics are created equal. A “revenue” metric is more valuable than an “enter checkout” metric, which is more valuable than a “sign up for newsletter” metric.

This part of the equation simply assigns a number to each metric, ranging from 0 to 1 on a linear scale. For example, “revenue” might get a 1, “enter checkout” might get a 0.7, and “sign up for newsletter” might get a 0.2.

Developing this scale can be either entirely subjective or backed by data science and modeling. Companies usually have a relatively small set of metrics that remain fairly stable over time, so this scale can be established once at the organizational level rather than for each experiment.

Putting it all Together

Now we come to the actual Impact Score calculation (on a 0–100 scale):

metricCoverage * (7 / experimentLength) * metricImportance * 100

This optimizes for A/B tests that finish quickly and have significant potential impact on the most important metrics. It does not try to guess how likely the test is to succeed (that’s why we’re testing in the first place).

The Impact Score removes subjectivity from the equation and lets PMs focus on what they are really good at during prioritization — planning around limited engineering/design resources, conflicting experiments, marketing promotions, and other external factors.

Example H2