The Edge Podcast

Twitch: False Negatives Are Killing Your Best Product Ideas

Twitch: False Negatives Are Killing Your Best Product Ideas

Running an experiment is the easy part. The hard part is making sure a wrong answer doesn't quietly bury a good idea for years.

That is the uncomfortable truth Arun Bodapati, director of data science at Twitch, kept returning to on this episode of The Experimentation Edge. Most teams obsess over false positives — the experiment that says "yes" when the real answer is "no." But after building and leading experimentation work at Schwab, Uber, and now Twitch, Bodapati has come to fear a more expensive mistake: the false negative.

Listen to the full episode →

The result that costs the most is the one nobody questions

A false positive, at least, tends to get caught. The effect looks too large, someone digs in, and the team applies more scrutiny. A false negative is quieter and more corrosive.

"False negatives are the killer," Bodapati said. Here is why. A product manager proposes an idea. The team runs an experiment. The result comes back negative — but it is a false negative, an artifact of a weak trigger or an underpowered test rather than a real read on user behavior. The statistical nuance is visible to only a few people. Everyone else, including the executives who allocate resources, hears one thing: "You tried it, it didn't work. Let's move on."

Then the damage compounds. "The worst thing is it gets institutionalized," Bodapati explained — the organization quietly files away that "we did try that intervention, and it did not work." A genuinely good idea can sit on the shelf for years because of one test that was never trustworthy in the first place.

For experimentation leaders, the lesson reframes where rigor matters most. Guardrail metrics and clean analysis protect you from acting on a bad win. But avoiding the false negative protects every future idea that resembles the one you wrongly killed.

Most of the work happens before you push play

If false negatives are the disease, Bodapati's prevention is almost entirely upstream — and it is unglamorous.

His first rule is to spend more time before pushing the play button. That means being clear on the enrollment logic and writing the hypothesis in plain English. "Have a hypothesis and an intervention," he tells product managers, engineers, and data scientists — and then resist the temptation to optimize before the experiment has even run. "You can always optimize later."

The plain-English test doubles as a filter. "If the intervention in plain English is very weak, just don't do the experiment," he said. "You're just wasting time." It is a deceptively strict standard. A surprising number of experiments are launched not because the underlying idea is strong, but because running a test feels like progress. Bodapati's bar is higher: if you cannot articulate a strong intervention in a sentence, the experiment is unlikely to teach you anything, and a null result will only add noise to the record.

Then comes the part he admits is hardest to systematize: experiment hygiene. "What is the actual trigger? Are the events that underlie the trigger reliable?" The honest answer, especially on mobile, is often no. "The client-side events, especially on mobile devices, are notoriously unreliable." An enrollment trigger that fires inconsistently is one of the most common sources of the false negatives he warns against — the experiment never cleanly measured the thing it claimed to.

There is a second, subtler trap in enrollment: over-narrowing. Teams often restrict an experiment to the exact users they believe will respond, because they already have a specific mental model. Bodapati pushes the other way. A small population has little statistical power, which raises the odds of a false negative. His guidance is to run a broad "explore" experiment first — particularly the first time you try a given intervention — and then do the segment analysis after the fact. Heterogeneous treatment effects models let you find the subpopulation that actually responded without sacrificing power up front. "You can always do the analysis ex post to figure out if your hypothesis was correct."

And when a result does come back positive, Bodapati does not simply celebrate. He asks the team for two things: a mechanistic understanding of why the change worked, and which segments are contributing most to the lift. "The numerical result is less interesting than actually describing in plain English what the user behavior was." If no one can explain the behavior, the win may itself be a statistical mirage.

When "we're worried" became something measurable

The clearest demonstration of this discipline at Twitch was pricing — a decision the company had treated as nearly untouchable.

Twitch had not raised subscription prices in a long time and had stayed resistant even as post-COVID inflation reshaped costs across the economy. The hesitation was not really about Twitch's own revenue. "We only make money if our creators make money," Bodapati said, and the company feared a price increase would cost creators their income. The instinct was protective — but it was an instinct, not a measurement.

The complication is that Twitch could not borrow an answer from anyone else. "Unlike Uber, which has real-time elasticities being computed, we don't have that." Twitch is closer to Netflix: prices do not move on the fly. The models the team had were built in a pre-inflation world and no longer described the present. The only honest path was to generate new data.

So the team designed geo-fenced experiments. They raised prices in carefully matched markets and used causal inference on the back end to estimate the true elasticity. The matching was deliberate: UK and Ireland served as a read on the US East Coast, because viewer composition and — crucially for live streaming — time zones lined up, with large audiences tuning in after 5 or 6 PM. The German-speaking markets of Germany, Austria, and Switzerland formed another homogenous block that watched similar creators.

The findings were, in Bodapati's words, "no surprises — just econ 101." Raise prices, lose some units, but make it up on the increase when the net effect is accretive. The one genuine surprise came from gifted subscriptions, a behavior unique to Twitch's sense of community, where viewers buy subs in batches of 15 or 20 for others. Elasticity models suggested a basic price increase could cost the platform some of its biggest gifters, which led the team to experiment with promotions, despite early hesitation, and to demonstrate that promotions could drive incremental revenue.

The deeper shift was cultural. Pricing stopped being a one-time experiment the team ran and walked away from, and became a permanent lever — "a measurement technique to figure out the efficacy of that lever," built to be tuned again and again. Watching elasticity estimates translate into the company's actual financial reports gave executives, engineers, and product managers confidence that experiments could inform one of the most consequential decisions Twitch makes. An area once governed by "thou shall not touch" is now mature enough that the team is exploring bandits.

The takeaway for experimentation leaders

The thread connecting all of it is that an experimentation program is only as trustworthy as the results it produces — and a false negative erodes that trust silently, one shelved idea at a time. The defense is not more sophisticated math after the fact. It is discipline before the test: a strong hypothesis in plain English, reliable triggers, enough power to detect a real effect, and a refusal to run experiments that cannot teach you anything.

Ready to build an experimentation program your whole team can trust? Explore how GrowthBook helps you ship winning experiments with a transparent, open-source statistical engine at growthbook.io.

Table of Contents

Related articles

See All Articles
The Edge Podcast
Squarespace Killed Its Blank Template — and Built Something Better From the Wreckage
Experiments
The Edge Podcast
The 2% close rate increase that turned Ford Credit's product teams into believers
The Edge Podcast
Experiments
How DoorDash runs 12,000 experiments per year across a 3-sided marketplace

Ready to ship faster?

No credit card required. Start with feature flags, experimentation, and product analytics — free.

Simplified white illustration of a right angle ruler or carpenter's square tool.White checkmark symbol with a scattered pixelated effect around its edges on a transparent background.