Experiments

What Makes Experimentation Unique at Chess.com

Ashley Stirrup

Apr 24, 2026

min read

What Makes Experimentation Unique at Chess.com

Chess.com has users who cannot move a pawn and users who play at FIDE competitive ratings. Both groups open the same app. For anyone running an experimentation program, that kind of skill variance changes almost every decision you make.

On the latest episode of The Experimentation Edge, I sat down with Nafis Shaikh, Director of Product Management at Chess.com, to talk about how his team designs experiments for a 10 million daily active user base that spans absolute beginners and rated competitors. Chess.com ran 400 experiments in 2024 and has set a goal of 1,000 in 2025, already 195 deep in Q1. The scale is impressive, but what makes their program genuinely different isn't the volume. It's what they've had to learn about designing for users whose needs pull in opposite directions, and their willingness to push past surface-level test results into something more useful.

Here's what stood out.

One Product, Wildly Different Users

Chess.com turned 20 last year. For most of its history, the product was built by Chess players for Chess players, which worked because the user base was relatively homogeneous. Starting in 2020, the population exploded. Today the app serves roughly 10 million daily active users, and the demographic spread inside that number is extreme.

Some users have never played Chess in their lives. They don't know how the pieces move. They don't know what a pin is. At the other end of the spectrum, Chess.com has FIDE-rated competitive players, people with tournament histories and formal ratings who are using the product to prepare for real games.

This is where one-size-fits-all quietly falls apart. Nafis gave a specific example: the app has an AI coach that talks to users during play, explaining what's happening and offering tips. The way the coach speaks to a rated player is, and has to be, completely different from how it speaks to someone who just learned how the knight moves. Throw advanced concepts at a beginner and you confuse them and make the experience worse. Dumb down the feedback for an expert and it's worthless.

For an experimentation program, this has a concrete implication: every test needs to think about skill segments, not just aggregate results. A feature that lifts engagement overall might be destroying the experience for 20% of users while thrilling another 20%. If you're only looking at the average, you miss it. You also pick the wrong winner.

A Four-Dimension Framework for Deciding What to Measure

Nafis organizes every metric he cares about across four dimensions, and the order matters:

Inflows. How effectively does the product bring new users in?
Engagement. Once they're here, do they do the core thing? For Chess.com that means playing games, doing puzzles, using the coach.
Retention. Do they come back? Measured at D1, D7, and D30, with weekly active users segmented into new, current, and returning cohorts.
Monetization. Do they start a trial, and do they end up paying for the subscription?

The order is deliberate. Nafis is explicit that most products cannot short-circuit to revenue. "You actually have to give people a really solid product that they find value in. They'll come back and use the product more often. And when that tipping point hits, they're more likely to pay for your product because they've found the value in it."

Inside Chess.com this shows up as a deliberate division of labor. The monetization team optimizes monetization metrics. The gameplay team optimizes the core experience. These groups do not get confused about who owns what, and, crucially, the gameplay team isn't pressured to justify every experiment through a revenue lens. The bet is that a better core experience eventually lifts everything downstream.

If you're setting up an experimentation program, this is worth copying. Deciding which metrics each team owns, and which ones they explicitly do not, removes a huge source of noise from experiment results.

Going Beyond "Did the KPI Move?"

The part of our conversation that stuck with me most was Nafis's push to evolve how Chess.com treats experiment results.

A lot of experimentation programs live at the level of "we ran test X, metric Y moved Z%." That's fine. It's necessary. But it's not enough. Nafis calls the next question "so what?" What does this result actually say about how users behave? What were they doing in the control condition that makes them respond this way to the treatment? What does the side effect in that other metric tell you about the kind of user this feature attracts?

He also has strong feelings about write-ups. A result that ships as "yeah, this improves retention" is not worth much. The pattern Chess.com is moving toward is narrative: we launched this specific feature on this date, we saw this lift at this step of the funnel, it carried through to this downstream behavior, and here is what we now believe about our users that we did not believe before.

That discipline is what turns a test count into organizational learning. 1,000 experiments per year is a meaningless number if the team cannot tell you what it learned from them. The writing is where the learning gets captured.

Key Learning: Chess.com Users Prefer to Celebrate Their Wins

Now for the specific experiment that made me laugh out loud on the recording.

Chess.com has a feature called Game Review. After a game ends, the coach walks you through each of your moves and explains where you played well, where you blundered, and where you could have done something different. Game Review is Chess.com's freemium hook: everyone gets one free per day, and if you want more, you need a subscription. It's a huge driver of paid conversions.

The original design assumed something that felt obvious. When a player loses, they want to understand what went wrong. So the entry point to Game Review led with the things they had done poorly: here are your blunders, here are your misses, let's figure out what to fix.

Then the team looked at the data. 80% of Game Reviews were happening on wins.

Think about that for a second. Four out of five times a user reached for Game Review, they weren't trying to debug a loss. They were savoring a victory. The feature was introducing itself with a list of their mistakes, and people were opening it anyway, because what they actually wanted to see was the game where they won.

So they ran a test. Same feature. Same analysis engine. Same subscription gate. The only thing that changed was the entry point: instead of leading with "here are your blunders and misses," they led with "here are the good moves you made."

Game Review starts jumped 25%. Subscription conversions went up meaningfully. Same product, completely different framing, significant lift on the metric that actually pays the bills.

Nafis said he was "somewhat dumbfounded" by the magnitude, but the lesson lines up with something he has seen across every game he has worked on at Zynga, Prodigy, and now Chess.com: "People just want to feel good. Focus on the things that make people feel better about themselves. The world's a hard place and people have difficult lives, and when you come to play a game that's supposed to be enjoyable, focus on the things that are enjoyable."

If you are running any consumer-facing product, this is a test worth trying yourself. Look at every surface where you currently lead with user failure: error states, empty states, review flows, retry prompts, churn emails. Ask whether the default framing could be flipped to celebrate what the user got right instead. Then put the reframe in a test. You will probably be surprised.

Listen to the Full Episode

Chess.com's program isn't unique because of its size or tooling. It's unique because of two things: a user base whose skill range forces segment-level thinking on every test, and a team that refuses to stop at "the metric moved." That combination is what turns a test count into genuine understanding.

You can hear my full conversation with Nafis Shaikh, including why experimentation velocity is itself a productivity metric and the strange challenge of measuring whether users are actually listening to an audio coach, on this episode of The Experimentation Edge.

Listen to the full episode: The Experimentation Edge with Nafis Shaikh, Chess.com

Example H2