Type I vs Type II error: key differences with examples

Type I and Type II errors are easy to memorize and easy to misuse. The useful question is not which definition is which. It is which mistake your team is more willing to risk.
In product experimentation, every test can be wrong in two directions. You can ship a change that does not actually work. Or you can miss a change that would have helped. Both are statistical errors, but they create different business costs.
This guide explains the difference between Type I and Type II errors, how alpha, beta, and power connect, and how product teams should choose the right tradeoff for A/B testing.
Type I vs Type II error at a glance
The Penn State STAT 500 guide defines Type I error as rejecting the null hypothesis when the null is true and Type II error as failing to reject the null when the alternative is true. Penn State's basic statistical concepts guide gives the same short version: Type I rejects a true null; Type II does not reject a false null.
For product teams, that language becomes clearer when translated into experiments.
- Type I error: the dashboard says "ship it," but the product change did not really help.
- Type II error: the dashboard says "inconclusive," but the product change really did help.
Type I error: the false positive
A Type I error happens when the test finds an effect that is not real.
A/B testing example
Imagine a SaaS team testing a new activation checklist. The null hypothesis says the new checklist has no effect on seven-day activation. The test reports a statistically significant 3% relative lift, and the team ships the variant.
If the true effect is zero and the observed lift was random noise, the team made a Type I error.
The visible mistake is shipping a neutral change. The deeper mistake is learning the wrong lesson. The team may now believe shorter checklists improve activation and use that belief in future roadmap decisions.
What alpha means
Alpha, written α, is the Type I error rate chosen before the test. If a fixed-horizon test uses α = 0.05, then under the null hypothesis and the test assumptions, the procedure rejects the null about 5% of the time over repeated use.
That does not mean the specific winning result has exactly a 5% chance of being false. It means the rule has a long-run false positive rate under the null.
The American Statistical Association statement on p-values is useful because it warns that statistical significance does not measure effect size, practical importance, or the full strength of evidence. A significant result can still be too small to ship, badly instrumented, or one of many lucky findings.
What increases Type I error risk
Type I errors are more likely when teams change decision rules after looking at the data.
Common causes:
- Peeking at a fixed-horizon test and stopping early when it looks good.
- Testing many metrics and celebrating the one that moved.
- Slicing by segment after launch until one group looks significant.
- Changing eligibility or traffic allocation without accounting for it.
- Treating exploratory analysis as confirmatory proof.
GrowthBook's documentation on experimentation problems calls out peeking as a false-positive risk. GrowthBook's A/A testing docs show the same issue from another angle: when variants are identical, significant differences are false positives, and more unrelated metrics increase the chance of seeing at least one.
Type II error: the false negative
A Type II error happens when the test fails to detect an effect that is real.
A/B testing example
Now imagine the same SaaS team tests a subtle onboarding improvement. The real treatment effect is a 1.5% activation lift. That lift would be valuable because many new accounts go through onboarding.
But the experiment was designed to detect only a 5% lift. It runs for two weeks, ends inconclusive, and the team discards the change.
That is a Type II error. The test missed a real improvement because the design lacked enough power to detect the effect the team cared about.
What beta and power mean
Beta, written β, is the probability of a Type II error for a specific alternative effect size. Statistical power is 1 - β: the probability that the test detects an effect when that effect is real.
Power depends on several inputs:
- Sample size.
- Baseline conversion or metric variance.
- Minimum detectable effect.
- Significance threshold.
- Test design and allocation.
The key point is that power is effect-size-specific. A test can have high power to detect a 10% lift and low power to detect a 1% lift. That distinction matters in mature products, where many valuable improvements are small.
What increases Type II error risk
Type II errors are common when teams run underpowered tests.
Common causes:
- Too little traffic.
- Short runtime.
- Noisy metrics.
- Small expected effect size.
- Poor experiment targeting.
- Overly strict significance threshold for a low-risk decision.
- Using a metric too far away from the product change.
If a confidence interval is wide and crosses zero, the test may not be proving no effect. It may be saying the design cannot distinguish a meaningful win from a meaningful loss.
The tradeoff between alpha and beta
You usually cannot minimize Type I and Type II errors at the same time without adding more information.
Penn State's STAT 500 guide states the tradeoff directly: as alpha decreases, beta increases. In product terms, a stricter false-positive policy makes it harder to detect real effects unless you increase sample size, reduce variance, or accept a larger minimum detectable effect.
Lower alpha when false wins are costly
Reduce Type I error risk when a false positive would be expensive or hard to reverse.
Examples:
- Pricing and packaging changes.
- Checkout, billing, or account deletion flows.
- Trust-sensitive AI outputs.
- Security or privacy workflows.
- Infrastructure changes with user-facing impact.
In those cases, a false win can hurt users, revenue, trust, or operations. Require stronger evidence.
Increase power when false negatives are costly
Reduce Type II error risk when missing a real effect would slow learning or leave value on the table.
Examples:
- Low-risk activation copy.
- Onboarding improvements.
- Retention nudges.
- Search ranking refinements.
- Product education changes.
In these cases, the cost of shipping a false win may be low, while the cost of missing a real improvement may be high. You still need discipline, but you may decide to run longer, use variance reduction, target a more relevant population, or lower the minimum detectable effect.
Add information when both errors matter
When both error types are costly, the answer is not just "pick a threshold." Add better information.
That can mean:
- Run the experiment longer.
- Increase traffic allocation.
- Choose a lower-variance metric.
- Use pre-experiment covariates where appropriate.
- Narrow the eligible population.
- Use stronger guardrails.
- Run a follow-up confirmation test.
This is where experiment design beats dashboard interpretation. You reduce error risk before launch by designing a test that can answer the decision question.
How to choose the right error tradeoff
A useful experiment brief should include an error tradeoff, even if the team does not use statistical jargon.
Start with decision reversibility
Ask how easy the change is to reverse.
If the change is easy to turn off with a feature flag, low-risk, and limited in scope, the team can tolerate more uncertainty. If the change rewrites billing logic, affects every user, or creates migration work, the team should require stronger evidence.
Feature flags help because they make reversibility real. A flag-controlled rollout can limit blast radius while the experiment runs and make rollback faster if guardrails move.
Define the minimum practical effect
Statistical significance does not tell you whether an effect is worth shipping. Define that before launch.
Example: "We will ship if the treatment improves activation by at least 1 percentage point, does not harm paid conversion, and does not increase support tickets."
That decision rule protects against both error types. It prevents shipping tiny false wins and prevents dismissing results solely because they did not produce a dramatic lift.
Keep exploratory findings in their lane
Exploratory analysis is useful for finding new ideas. It is risky when treated as proof.
If a segment looks promising after the test, write a follow-up hypothesis. Do not pretend the segment finding had the same error rate as the planned primary analysis.
How the errors show up in product experiments
The easiest way to keep the two errors straight is to map each one to the product decision that follows. A Type I error makes the team too confident in a change. A Type II error makes the team too quick to discard a change. Both can look reasonable in the moment because both come from incomplete evidence.
Example 1: false positive activation win
A team tests a new onboarding checklist. The result is statistically significant, but the team peeked daily and stopped when the p-value crossed 0.05. The result later fails to hold in production.
Likely error risk: Type I.
Better design: choose the stopping rule before launch, use a sequential method if monitoring continuously, and define the primary metric up front.
Example 2: false negative retention change
A team tests a subtle lifecycle email change. Retention improves by 0.8 percentage points, but the test was powered only for a 3-point effect. The dashboard calls the result inconclusive, and the team drops the idea.
Likely error risk: Type II.
Better design: estimate the smallest practical effect before launch, run longer, or target the population where the effect should be strongest.
Example 3: good tradeoff with guardrails
A team tests a new account setup flow. It defines activation as the primary metric, paid conversion and support tickets as guardrails, and a minimum practical effect of 1 point. The result shows a 1.3-point activation lift with guardrails stable.
The team still might be wrong, but the decision is stronger because the test matched the decision rule.
A/B testing checklist
Before launch, write:
- The null hypothesis.
- The expected direction of the effect.
- The primary metric.
- The minimum practical effect.
- The alpha or decision threshold.
- The target power or sample-size logic.
- Guardrail metrics.
- The stopping rule.
- Planned segments.
- What exploratory analysis can and cannot decide.
After launch, report:
- Effect size.
- Uncertainty range.
- Primary metric result.
- Guardrails.
- Sample ratio or assignment checks.
- Any instrumentation incidents.
- Exploratory findings clearly labeled as exploratory.
This checklist keeps the Type I and Type II tradeoff visible. It also makes experiment readouts easier to trust because the team can see which decisions were made before the data arrived.
What to do next
Do not ask whether Type I or Type II errors are "worse" in the abstract. Ask which mistake is more expensive for the decision in front of you.
If shipping a false win would create real harm, reduce Type I error risk. If missing a real improvement would slow learning, increase power and reduce Type II error risk. If both matter, improve the design before launch.
GrowthBook helps by connecting feature flags, experiments, guardrails, and metrics in one workflow. The product can support the process, but the tradeoff still belongs to the team. Decide the risk you are willing to take before the experiment starts.
Related Articles
Ready to ship faster?
No credit card required. Start with feature flags, experimentation, and product analytics—free.

