false
Experiments

True Positive: Definition and Examples in Testing

A graphic of a bar chart with an arrow pointing upward.

Most teams tracking model performance or running A/B experiments focus on whether their system is catching real positives — but that's only one piece of a four-part picture.

A true positive only means something useful when you understand it alongside false positives, false negatives, and true negatives. Miss that context, and you end up optimizing the wrong thing, sometimes badly.

This article is for engineers, product managers, and data teams who work with classifiers, ML models, or experimentation platforms and want a clear, practical grip on how true positives fit into real evaluation work. Here's what you'll learn:

  • What a true positive is and how it fits into the four-outcome binary classification framework
  • How to read a confusion matrix and why accuracy alone will mislead you on imbalanced data
  • How to calculate the true positive rate (TPR), what it actually measures, and where it goes by other names like sensitivity and recall
  • Why maximizing true positives isn't always the right goal — and how to think about the sensitivity-specificity trade-off
  • How true positives work specifically in A/B testing, including the common practices that inflate false discoveries and suppress real ones

Each section builds on the last. Start with the definition, work through the math, and finish with the practical implications for experimentation — including how GrowthBook handles peeking, multiple testing, and statistical power in ways that directly affect whether your test results are real.

A true positive has a two-part definition — and getting either part wrong breaks your analysis

Precision matters when you're evaluating whether a test, model, or experiment is actually working. The term "true positive" gets used loosely in practice — often as a shorthand for "a correct result" — but that framing is incomplete in a way that causes real problems when you're trying to reason about model performance or test quality.

A true positive has a specific, two-part definition, and understanding it exactly is the foundation for everything else in this space.

Why "a correct result" is not specific enough

A true positive occurs when a test predicts a positive outcome and the underlying condition is genuinely present. Both conditions must hold simultaneously: the test says positive, and the ground truth is positive.

As Wikipedia frames it in the context of diagnostic testing, sensitivity (the true positive rate) is "the probability of a positive test result, conditioned on the individual truly being positive". That word conditioned is doing important work — it means the ground truth is already established as positive, and the question is whether the test correctly reflects that reality.

This distinction matters because "a correct result" is not specific enough. A test that correctly identifies a negative case — someone who doesn't have a disease, a transaction that isn't fraudulent — is also producing a correct result. That outcome is a true negative, which is a different category entirely. A true positive is specifically a correct positive identification.

The four-outcome binary classification framework

No test outcome exists in isolation. A true positive only has meaning when you understand it as one of four possible outcomes in any binary classification system. When a test is applied to a case where the condition either exists or doesn't, and the test either flags it or doesn't, you get exactly four combinations:

Condition Present Condition Absent
Test Positive True Positive False Positive (Type I Error)
Test Negative False Negative (Type II Error) True Negative

The four cells of this table represent the complete universe of outcomes for any binary classifier. A test that labels everything as positive would accumulate a high count of true positives, but it would also generate false positives on every negative case. That's not a useful test. The true positive count only becomes meaningful when you can see it in context alongside the other three outcomes.

The definition holds across domains; what changes is the cost of getting it wrong

The definition is domain-agnostic. The underlying logic — test says positive, condition is real — applies identically whether you're working in healthcare, financial services, or software systems.

In medical screening, a cancer test that correctly flags a patient who actually has cancer is producing a true positive. The clinical stakes here are high: a test with a high true positive rate catches real cases early, enabling timely treatment. Missing those cases — producing false negatives instead — carries serious consequences for patient outcomes.

In fraud detection, a true positive occurs when a model flags a transaction as fraudulent and that transaction is, in fact, fraudulent. Compliance teams rely on these correct identifications to prevent financial losses and protect customer accounts. The system is doing exactly what it's supposed to do.

Software and model evaluation follow the same logic: a classification model produces a true positive when it correctly identifies a positive instance — a defective component on a production line, a spam email in a filtering system, a bug flagged by a static analysis tool that is a genuine defect. The condition exists, and the model found it.

Across all three domains, the definition holds. What changes is the cost structure around errors — how damaging it is to miss a real positive versus how damaging it is to flag a false one. But that's a question of trade-offs, not of the definition itself.

True positives only have meaning inside the full four-outcome confusion matrix

A true positive doesn't exist in isolation. Its meaning only becomes clear when you place it alongside the three other outcomes that any classifier or test can produce: true negatives, false positives, and false negatives.

The confusion matrix is the minimum unit of analysis for evaluating test quality — and if you're only tracking how often your model catches real positives, you're missing most of the picture.

How the four cells map every possible prediction against reality

A confusion matrix is a 2×2 table that maps every prediction a model makes against what actually occurred. One axis represents the actual class; the other represents the predicted class. The diagonal — top-left to bottom-right — captures correct predictions. The off-diagonal cells capture errors.

Predicted Positive Predicted Negative
Actually Positive True Positive (TP) ✓ False Negative (FN), Type II Error
Actually Negative False Positive (FP), Type I Error True Negative (TN) ✓

A concrete example makes this tangible. Consider a cancer screening classifier evaluated on 12 individuals — 8 who actually have cancer and 4 who don't. If the model makes 9 correct predictions but misclassifies 3 — calling 2 cancer patients cancer-free and flagging 1 healthy person as having cancer — you have 2 false negatives and 1 false positive.

The model looks reasonably accurate at first glance, but the error breakdown reveals two very different failure modes with very different consequences.

One note on convention: some sources place actual classes on rows and predicted classes on columns; others reverse this. Both are valid. The table above follows the row-as-actual convention used by Wikipedia and most ML tooling, but you'll encounter both in practice.

False positives (Type I errors): the cost of crying wolf

A false positive occurs when a model predicts positive but the actual outcome is negative. In machine learning, this is formally called a Type I error. In spam detection, it's flagging a legitimate email as spam. In fraud detection, it's blocking a valid transaction.

In A/B testing, the framing is slightly different but structurally identical. GrowthBook's documentation defines a Type I error as a situation where "your metrics all appear to be winners, but in reality the experiment has no effect." The test fires a signal; the signal is wrong. Acting on that signal — shipping a feature, changing a product flow — means making a real decision based on noise.

False positives carry costs that depend entirely on context. In some domains, they're annoying but recoverable. In others, they trigger irreversible actions: unnecessary medical treatment, a blocked customer, a shipped feature that degrades the product.

False negatives (Type II errors): the cost of missing what's real

A false negative is the mirror image: the model predicts negative, but the actual outcome was positive. This is a Type II error. In medical screening, it's a missed diagnosis. In fraud detection, it's a fraudulent transaction that goes through undetected.

In A/B testing, a Type II error occurs when "the data aren't showing a clear winner or loser when actually a variation is much better or worse." The consequence is that teams either collect more data — extending an experiment that already has an answer — or make a blind decision without the signal they needed. Real improvements go undetected; real regressions go unaddressed.

The cost asymmetry between Type I and Type II errors is domain-specific and worth making explicit in any system you're evaluating. Missing a cancer diagnosis is not the same kind of failure as flagging a healthy patient. Missing a winning A/B test variant is not the same kind of failure as shipping a losing one.

Why no single cell in the confusion matrix tells you whether your model is good

No single cell in the confusion matrix tells you whether your model is good. The matrix has to be read as a whole, and the derived metrics — accuracy, precision, recall — each weight the four cells differently.

Accuracy is the most intuitive: (TP + TN) / all predictions. But it's also the most misleading on imbalanced datasets. A model that predicts "negative" for every single input on a dataset where positives appear only 1% of the time achieves 99% accuracy while being completely useless. It has a perfect true negative rate and a true positive rate of zero. As one practitioner put it, such a classifier "could be losslessly replaced by a rock."

Precision (TP / TP + FP) tells you how often a positive prediction is correct — critical when false positives are expensive. Recall, also called sensitivity (TP / TP + FN), tells you how much of the actual positive class you're capturing — critical when false negatives are expensive. These metrics pull in opposite directions, and optimizing one typically costs you the other.

All of these metrics are also threshold-dependent. Change the classification threshold and every cell in the matrix shifts. The confusion matrix you compute at one threshold is not the confusion matrix you'd compute at another, which is why evaluating a classifier at a single operating point is rarely sufficient for understanding its real-world behavior.

True positive rate: the metric that tells you how much of reality your system is actually catching

Understanding what a true positive is gets you halfway there. The more useful question for practitioners is: how do you measure how many real positives your system is actually catching? That's what True Positive Rate (TPR) answers — and it's a metric precise enough to calculate, compare, and optimize.

The TPR formula and what it actually measures

TPR is calculated as:

TPR = TP / (TP + FN)

The numerator is the count of true positives — cases where your model or test correctly identified a real positive. The denominator is the total universe of actual positive cases: every real positive that existed, whether your system caught it (TP) or missed it (FN). The result is a value between 0 and 1, where 1.0 means every real positive was detected and nothing slipped through.

You'll encounter this metric under different names depending on the field. In medicine and statistics, it's called sensitivity. In machine learning, it's called recall. Wikipedia's formal definition captures it cleanly: sensitivity is "the probability of a positive test result, conditioned on the individual truly being positive." All three terms refer to the same calculation.

One practical note: a high TPR tells you the test is sensitive — it catches most real positives. But it doesn't tell you that any specific positive result is correct. Here's why that matters: imagine a disease that affects 1 in 10,000 people. Even a test with 99% TPR will produce more false positives than true positives in that population, simply because there are so few actual cases to find.

The probability that a positive result is real depends on how common the condition is — a separate calculation called positive predictive value (PPV). Conflating TPR with PPV is a common mistake with real consequences.

What high and low TPR signals about your system

A high TPR means your model is catching most of the real positive cases and producing few false negatives. A low TPR means real positives are routinely slipping through — your system is missing what it's supposed to find.

The stakes of a low TPR vary dramatically by domain. In healthcare, a missed diagnosis is a false negative with potentially irreversible consequences. In fraud detection, a missed fraudulent transaction carries direct financial and reputational costs. These are the domains where TPR is typically the primary metric to optimize, because the cost of a false negative far outweighs the cost of a false positive.

In A/B testing, statistical power functions as the domain-specific analog to TPR — it represents the probability that a real effect will be detected, which is precisely what TPR measures in classification contexts.

That said, a high TPR is not universally the right target. Pushing TPR toward 1.0 typically requires lowering the classification threshold, which increases false positives. Whether that trade-off is acceptable depends entirely on the cost structure of your specific problem — a point the next section addresses directly.

TPR and the ROC curve

The ROC (Receiver Operating Characteristic) curve is the standard tool for visualizing how TPR behaves across different classification thresholds. It plots TPR on the y-axis against the false positive rate (FPR) on the x-axis. Each point on the curve represents a different threshold setting.

As you lower the classification threshold, more cases get flagged as positive. TPR rises — you catch more real positives — but FPR rises too, because more negatives get incorrectly flagged. Raise the threshold and the reverse happens: fewer false positives, but more real positives missed. The ROC curve makes this trade-off visible across the full range of possible thresholds rather than at a single fixed point.

A classifier with a curve that hugs the top-left corner of the plot is performing well: it achieves high TPR at low FPR. A curve that runs diagonally from bottom-left to top-right is no better than random guessing. The shape of the curve tells you how much flexibility you have in setting a threshold before sensitivity degrades meaningfully.

The sensitivity-specificity trade-off: why maximizing true positives isn't always the goal

There's an intuitive appeal to the idea that a good classifier should catch as many true positives as possible. In practice, that instinct leads teams astray. Maximizing sensitivity — your true positive rate — always comes at a cost, and understanding that cost is what separates a well-calibrated system from one that creates as many problems as it solves.

The fundamental trade-off: why you can't maximize both

Sensitivity and specificity move in opposite directions. As Wikipedia's treatment of the topic states directly: "higher sensitivities will mean lower specificities and vice versa".

The mechanism is straightforward. When you lower a classification threshold to catch more true positives, you inevitably sweep in more negatives along with them. More true positives means more false positives — which means lower specificity. There's no configuration that escapes this relationship. The question is never whether to accept this trade-off, but where to set it.

When high sensitivity is the right call

The case for prioritizing sensitivity is strongest when a missed positive carries severe consequences. Wikipedia's criterion is precise: prioritize sensitivity "when the consequence of failing to treat the condition is serious and/or the treatment is very effective and has minimal side effects."

Cancer screening is the canonical example. A false negative — a test that misses a malignancy — means a patient goes untreated while the disease progresses. The downstream cost of that miss dwarfs the cost of a false positive, which typically means an additional confirmatory test. The asymmetry is clear: one error is inconvenient, the other can be fatal.

Fraud detection follows similar logic. Missing a fraudulent transaction carries real financial and reputational consequences for a business, while a false positive — a legitimate transaction flagged incorrectly — creates customer friction but is recoverable. In domains like these, false negatives are the more expensive error, and systems should be tuned accordingly.

When specificity deserves priority

The calculus flips when false positives carry their own serious costs. Wikipedia identifies the relevant condition: specificity matters most "when people who are identified as having a condition may be subjected to more testing, expense, stigma, anxiety, etc."

Confirmatory diagnostic testing is the clearest case. A false positive diagnosis can trigger unnecessary treatment, psychological harm, or lasting stigma — costs that are real and sometimes irreversible. The initial screening test can afford to be sensitive; the confirmatory test needs to be specific.

The same principle applies in A/B testing. A false positive in an experiment means declaring a winning variant when no real effect exists, then shipping a change that doesn't actually improve the product. The cost here isn't just a wasted engineering cycle — it's the compounding effect of making product decisions on noise.

Threshold selection is a business decision, not a statistical one

The classification threshold determines where on the sensitivity-specificity trade-off curve your system operates. There is no universally correct setting. The right threshold depends entirely on the relative cost of each error type in your specific context, and that cost calculation belongs to the domain, not the algorithm.

This becomes especially concrete in A/B testing when multiple metrics are evaluated simultaneously. GrowthBook's documentation on experimentation pitfalls notes that testing the same hypothesis across 20 metrics at a 5% significance level produces roughly a 64% probability of finding at least one statistically significant result by chance alone.

That's what happens when sensitivity is implicitly maximized without any mechanism to control specificity. Correction methods like Benjamini-Hochberg and Bonferroni deliberately sacrifice some sensitivity — accepting that a few real effects may go undetected — in order to reduce the rate of false discoveries.

That trade-off isn't a flaw in the methodology. It's the methodology working as intended, calibrated to the cost structure of the problem. The teams that get this right aren't the ones chasing the highest possible true positive rate — they're the ones who have thought clearly about what a false positive actually costs them.

True positives in A/B testing: correctly detecting real effects and avoiding false discoveries

In A/B testing, a true positive has a specific meaning: your experiment correctly identifies a variation that genuinely improves a metric, and you ship it. The test said it won. It actually won. That alignment between the statistical signal and the underlying reality is exactly what experimentation is designed to produce — and it's rarer than most teams assume.

According to GrowthBook's documentation, only about one-third of experiments successfully improve the metrics they're designed to move. Another third have no effect, and the remaining third actually hurt performance. That distribution means the majority of experiments you run will not produce true positives. In that environment, correctly identifying the real winners matters enormously — and so does avoiding the false ones.

Where true positives sit in the A/B testing decision space

The full A/B testing decision space maps directly onto the confusion matrix framework. When you decide to ship a variation and it actually won, that's a correct inference — a true positive. Every other combination where "ship" and "actual outcome" don't align is a Type I error. Shipping a variation that had no real effect, or shutting down one that actually would have won, are both failures of the classification system.

The default significance threshold in most experimentation platforms is 95% confidence. That means even in a perfectly designed experiment with no methodological problems, you'll incorrectly flag a result as significant 5% of the time. As GrowthBook's documentation puts it plainly: "5% of the time, it isn't actually better." That baseline false positive risk is unavoidable — but several common practices make it dramatically worse.

The peeking problem and multiple testing

The peeking problem occurs when teams monitor experiment results continuously and stop a test the moment results look promising. This practice inflates the false positive rate substantially. Because statistical significance fluctuates throughout a test's runtime, repeatedly checking results and stopping early when p < 0.05 appears means you're much more likely to catch a random fluctuation than a real effect.

Sequential testing addresses this by allowing continuous monitoring and early stopping without inflating false positive rates — the statistical method adjusts for the repeated looks so the error rate stays controlled.

Multiple testing compounds the problem in a different way. When you test many metrics simultaneously, the probability that at least one will appear significant by chance increases with each metric added. Google's famous "41 shades of blue" experiment illustrates how cascade testing — running A vs. B, then B vs. C, and so on — can produce mathematically invalid conclusions when not handled correctly.

GrowthBook's documentation notes that adding many metrics to any test increases false positive risk, even in a correctly configured system. The solution is pre-registering your primary metric before the experiment runs, not selecting the most flattering result afterward.

P-hacking is the logical extension of this: continuing to slice data, add metrics, or adjust segments until statistical significance appears, then reporting only the significant finding. It's not always intentional, but the effect is the same — the result looks like a true positive and isn't.

Underpowered tests don't inflate false positives — they suppress true positives

Underpowered tests don't inflate false positives — they suppress true positives. If your experiment doesn't have enough users to detect the effect size you're looking for, real improvements will fail to reach significance. GrowthBook's documentation defines this through the concept of Minimal Detectable Effect (MDE): if the actual effect is smaller than the MDE given your sample size, the test cannot detect it even when it genuinely exists.

The result is a false negative — a true positive that never gets identified.

Variance reduction techniques directly address this. CUPED reduces variance by accounting for pre-experiment user behavior, which means experiments can reach statistical significance with fewer users. That speed matters because it reduces the temptation to peek early.

A/A testing eliminates infrastructure failures before they corrupt your true positive rate

Before you can trust that your A/B test results represent true positives, you need to confirm that your experimentation infrastructure isn't generating spurious signals on its own. A/A testing — running an experiment where both variations are identical — is the standard method for this validation. If an A/A test returns statistically significant results across multiple metrics, the system itself may be broken: traffic isn't splitting correctly, metrics are misconfigured, or the SDK integration has an error.

GrowthBook recommends running A/A tests after setting up a new SDK connection and after any significant changes to your integration, data warehouse, or tracking libraries. One important calibration note: even a correctly configured A/A test may show one or two marginally significant metrics due to random chance at the 5% threshold. That's expected. What's alarming is three or four metrics all showing significance above 99% — that's a signal the system is broken, not just unlucky.

A/A testing doesn't guarantee that your subsequent A/B results are true positives. But it eliminates a category of infrastructure-level failures that would make true positive detection impossible from the start.

Putting it together: calibrating your system to the actual cost of each error

The core insight running through this entire article is simple but easy to miss: a true positive only tells you something useful when you understand it in context. The count of things your system correctly identified means nothing without knowing how many real positives it missed, how many false alarms it generated, and what each of those errors actually costs you.

That four-part frame — not just the single number — is what separates a well-calibrated system from one that looks good on the surface and fails in practice.

Match your sensitivity-specificity balance to the cost of each error type

Before you tune a threshold or evaluate a metric, make the cost asymmetry explicit. In fraud detection, a missed fraudulent transaction is typically more expensive than a blocked legitimate one. In A/B testing, shipping a feature that had no real effect compounds quietly over time in ways a missed winner usually doesn't.

Neither of those cost structures is universal — they're specific to your domain, your users, and your business. The threshold decision follows from that analysis, not the other way around.

The confusion matrix and TPR reveal what accuracy alone conceals

If you're currently evaluating model performance with accuracy alone, the confusion matrix is the right place to start. Pull the full 2×2 breakdown, compute TPR and precision separately, and look at where your errors are concentrated. A model with 95% accuracy on an imbalanced dataset may have a TPR near zero — catching nothing that actually matters. The matrix makes that visible in a way a single aggregate metric never will.

Protect true positive rates in experimentation with rigorous statistical practices

In A/B testing, the practices that inflate false positives — peeking, running too many metrics, post-hoc segmentation — are the same ones that make your true positives harder to trust. Pre-register your primary metric, run an A/A test to validate your infrastructure, and size your experiments to detect the effect you actually care about. Sequential testing and variance reduction techniques are specifically designed to address these failure modes without forcing you to choose between speed and statistical integrity.

What to do next:

  • If you're evaluating a classifier: pull the full confusion matrix, compute TPR and precision separately, and compare performance at multiple thresholds before picking an operating point.
  • If you're running A/B tests: pre-register your primary metric, run an A/A test to validate your infrastructure, and calculate your MDE before launching.
  • If you're setting a classification threshold: write down the cost of a false positive and the cost of a false negative in your specific context before touching the threshold value.

One tension worth keeping in mind: when you add controls to reduce false positives — stricter significance thresholds, multiple testing corrections, higher sample size requirements — you will sometimes fail to detect real effects. That's not a flaw in the methodology. It's the trade-off working as intended. The goal isn't to catch every true positive at any cost. It's to build a system where the errors you make are the ones you've consciously decided are cheaper than the alternative.

Related insights

Table of Contents

Related Articles

See All articles
Experiments

Best 7 A/B Testing tools with Product Analytics

May 8, 2026
x
min read
Experiments

Best 7 Warehouse Native A/B Testing Tools

May 5, 2026
x
min read
Analytics

How to Track Unique Visitors on Your Website

May 4, 2026
x
min read

Ready to ship faster?

No credit card required. Start with feature flags, experimentation, and product analytics—free.

Simplified white illustration of a right angle ruler or carpenter's square tool.White checkmark symbol with a scattered pixelated effect around its edges on a transparent background.