Experiments

Statistical Validity: What It Means in Research

May 1, 2026

min read

A graphic of a bar chart with an arrow pointing upward.

A test hits statistical significance, the team ships the change, and then nothing moves.

No revenue lift. No engagement bump. Just a clean-looking result that turned out to mean nothing. That gap — between a result that looks valid and one that actually is — is what this article is about.

Statistical validity is not the same as getting a significant p-value. It's not the same as having consistent results, either. A measure can be perfectly reproducible and still be wrong in the same direction every time. Validity is about whether your conclusions accurately reflect what's actually happening in the world — and that depends on decisions made long before you run any analysis.

This guide is for engineers, product managers, and data teams who run experiments or work with research findings and want to understand why valid-looking results sometimes fail. Here's what you'll learn:

What statistical validity actually means and why it's different from reliability
The six types of validity — construct, internal, external, statistical conclusion, face, and criterion — and how each one can fail independently
The most common threats to validity, including the multiple testing problem, p-hacking, peeking, and confounding variables
How sample size, randomization, and measurement choices determine validity before data collection even begins
Why winning A/B test results don't always hold up in production, and what structurally sound experimentation looks like

Each section builds on the last, moving from the core definition through the framework, the failure modes, the design decisions that prevent them, and finally the practical implications for running experiments that produce conclusions you can actually trust.

What statistical validity actually means (and why it's not the same as reliability)

Statistical validity is one of those terms that gets used frequently and understood imprecisely. Before examining how it breaks down into types, or how it gets threatened by bad study design, it's worth establishing exactly what the word means — because the most common misconception about validity is that it's just another word for consistency.

It isn't.

The core definition: accuracy of conclusions, not just reproducibility

Statistical validity is the extent to which the conclusions drawn from a statistical test are accurate and reflective of the true effect found in nature. More precisely, it concerns whether a relationship between variables actually exists and whether the analyses conducted can accurately detect it.

Wikipedia frames it this way: validity is "the main extent to which a concept, conclusion, or measurement is well-founded and likely corresponds accurately to the real world". Notice the word likely. Statistical validity is an inductive, probabilistic claim — it can be stronger or weaker, but it is never certain. This distinguishes it from logical validity, where a valid argument is necessarily truth-preserving. In statistics, you're always making a claim about correspondence to reality, and that claim is always qualified.

Two practical prerequisites follow from this definition. First, you need sufficient data — enough observations to detect the effect you're looking for without being overwhelmed by noise. Second, you need to choose the right statistical approach for the question you're asking. Neither condition alone is sufficient. A massive dataset analyzed with the wrong method, or the right method applied to a sample too small to be informative, both produce conclusions that fail the validity test.

Why consistent results can still be wrong

Here's where the reliability distinction becomes critical. Imagine a scale that consistently reads five pounds heavier than the actual weight of whatever you place on it. Every measurement is perfectly reproducible. The scale is, in that narrow sense, reliable. But every conclusion you draw from it — about whether a package meets shipping weight limits, about whether a patient's treatment is working — is wrong. Reliable, but not valid.

This isn't just a thought experiment. The FORRT Glossary explicitly lists "reliability of measures" as a threat to statistical validity, not a proxy for it. A reliable but invalid measure consistently produces the wrong answer, and consistency just means you're wrong in the same direction every time.

The implicit question many researchers and product teams carry into this topic is: if my results replicate, isn't that enough? The answer is no. Replication confirms that your measurement process is consistent. It says nothing about whether you're measuring the right thing, whether your method is appropriate for your question, or whether your conclusions correspond to anything real. Validity requires all three.

Invalid conclusions cost more than the study you didn't run correctly

The stakes are direct. Invalid conclusions lead to wrong decisions, wasted resources, and — in fields like healthcare or financial modeling — potentially harmful actions, even when the underlying data appears clean and the analysis looks professional.

More specifically, establishing statistical validity gives you four practical things: confidence that your results can be accepted rather than second-guessed, a higher probability that your findings will hold up when others try to reproduce them, assurance that your analytical method is actually suited to its intended purpose, and the ability to optimize your study design before you collect data rather than scrambling to salvage it afterward.

Across product development, clinical research, and any data-driven field, the cost of acting on invalid conclusions is asymmetric — the damage often exceeds what would have been lost by running a better-designed study in the first place.

Validity is a collection of evidence, not a single test

Validity is not a single test you run at the end of an analysis. It's a collection of decisions — which methods you chose, how you built your sample, what you measured, and how you ran the analysis — that either earn or erode confidence that your conclusions reflect something real. Get one of those decisions wrong, and the rest of the analysis can be technically correct and still produce a wrong answer.

That evidence spans multiple dimensions. Construct validity asks whether you're measuring what you think you're measuring. Internal validity asks whether your study design supports causal inference. External validity asks whether your findings generalize beyond your sample. Statistical conclusion validity asks whether your methods were appropriate and your inferences sound. Each dimension is a separate line of evidence, and weakness in any one of them can invalidate conclusions that look strong everywhere else. The following section examines each of these types in detail.

The six types of statistical validity researchers need to know

Statistical validity is not a single dial you turn up or down. It's a multidimensional framework, and a study can score well on one dimension while failing completely on another. Treating validity as a binary pass/fail property is one of the most common mistakes in applied research — and it's exactly why studies that look rigorous on the surface produce conclusions that don't hold up. Understanding the distinct types of validity means understanding the distinct ways a study can go wrong.

Construct validity: are you measuring what you think you're measuring?

Construct validity is the most foundational type, and if it fails, nothing else matters. It asks whether your measurement instrument actually captures the theoretical construct it's supposed to represent. A survey designed to measure "customer satisfaction" may in practice be measuring "ease of checkout" — a related but distinct concept. A conversion metric in an A/B test may be measuring short-term clicks rather than the long-term engagement the team actually cares about.

The failure mode here is subtle: the data can be clean, the statistics can be correct, and the conclusions can follow logically from the numbers — and yet the entire study is answering the wrong question. Every subsequent type of validity rests on construct validity being intact first.

Internal validity: did your intervention actually cause the outcome?

Internal validity concerns whether the changes you observed were actually caused by the variables you were testing, or whether something else is responsible. High internal validity means the study is well-controlled, free from confounding factors, and designed to isolate cause and effect. The primary mechanism for achieving this is random assignment of treatments — which is why randomized controlled experiments are the gold standard for causal inference.

When internal validity fails, observed effects may be artifacts of study design rather than real relationships. A metric that improves during an experiment might be improving because of a seasonal trend, a simultaneous product change, or a biased assignment process — not because of the intervention itself.

External validity: do your findings generalize beyond the study?

A study can have strong internal validity and still be useless if its findings don't transfer to the real world. External validity asks whether causal relationships found in a study hold across different populations, settings, time periods, and measurement conditions. A highly controlled lab experiment may isolate causation perfectly while producing results that never replicate in production environments where conditions are messier and users are more varied.

This is a particularly common failure mode in product experimentation, where tests run on early adopters or power users produce results that don't generalize to the broader user base.

Statistical conclusion validity: did you use the right methods and reach the right inference?

Statistical conclusion validity asks whether the statistical methods chosen were appropriate and whether the conclusions drawn about the relationship between variables are actually correct. This type is specifically concerned with two kinds of mistakes: concluding that something worked when it didn't (a false positive, or Type I error), and concluding that something had no effect when it actually did (a false negative, or Type II error). Both are costly — false positives lead to shipping changes that don't actually help users; false negatives mean abandoning ideas that would have.

Power analysis is the primary tool for protecting statistical conclusion validity — it ensures your sample size is adequate to detect a meaningful effect if one exists. Without sufficient power, a null result proves nothing.

The multiple testing problem is a direct and quantifiable failure of statistical conclusion validity. If you test the same hypothesis at a 5% significance level across 20 independent metrics, the probability of finding at least one statistically significant result by chance alone rises to approximately 64%. That's not a finding — it's noise. Platforms like GrowthBook address this directly by providing multiple comparison corrections (including Bonferroni correction, False Discovery Rate adjustment, the Benjamini-Hochberg procedure) and enforcing minimum data thresholds before conclusions can be drawn, which operationalizes statistical conclusion validity as a system-level safeguard rather than a researcher's afterthought.

Face validity: does the measure appear credible on its surface?

Face validity is the most informal type — it asks whether a measurement instrument appears, on its surface, to measure what it claims to measure. It's evaluated through expert review or stakeholder judgment rather than statistical testing. While it's the weakest form of validity evidence on its own, it matters in practice because measures that lack face validity are often rejected by the people whose behavior the study is trying to understand, which introduces its own distortions.

Criterion validity: does your measure correlate with an established standard?

Criterion validity asks whether your measure correlates appropriately with an established gold-standard measure of the same construct. It comes in two forms: concurrent validity, where the measure and the criterion are assessed at the same time, and predictive validity, where the measure is evaluated on how well it forecasts a future outcome. A new engagement metric, for example, has criterion validity if it correlates with established measures of retention or revenue in the expected direction.

Together, these six types form a complete diagnostic framework. A study that passes all six has earned its conclusions. A study that passes only one or two has a narrower claim to make than its authors may realize.

The mechanisms that turn statistically significant results into wrong answers

Statistical significance is not the same as statistical validity. A result can clear the p < 0.05 threshold and still be completely wrong — not because the math was done incorrectly, but because the conditions that make that math meaningful were violated before the analysis even began. The threats described below don't just weaken conclusions at the margins. They can manufacture false confidence, invert findings entirely, or produce results that replicate nowhere outside the original study. Knowing the mechanism behind each threat is the first step toward recognizing one when it appears in your own work.

The multiple testing problem

When you test a single hypothesis at a 5% significance level, you accept a 5% chance of a false positive. That's the deal. But when you test 20 independent metrics at the same threshold, the probability of finding at least one statistically significant result by chance alone rises to approximately 64%. You haven't learned anything — you've just run enough tests to win the lottery.

This is the multiple testing problem, and it's endemic in digital experimentation, where analysts routinely track dozens of metrics per experiment. The situation is made worse by the fact that digital metrics are rarely independent: page views correlate with funnel starts, registrations correlate with purchase events. When metrics move together, the effective number of independent tests is lower than the raw count — but the direction of the bias still runs toward false positives.

Correction methods exist — Bonferroni correction, False Discovery Rate adjustment, the Benjamini-Hochberg procedure — and some experimentation platforms apply them automatically. But the correction only works if you apply it. Analysts who report the one significant metric out of twenty without adjusting the threshold are presenting a false positive as a finding.

P-hacking and the Texas Sharpshooter Fallacy

P-hacking is what happens when an analyst iterates — across metrics, time windows, or user subgroups — until a statistically significant result appears, then reports that result as if it were the original hypothesis. The name for this in informal logic is the Texas Sharpshooter Fallacy: you fire at the barn wall, then draw the target around the bullet holes.

The mechanism is the same as the multiple testing problem, but the tests are implicit rather than explicit. The analyst isn't running a declared battery of 20 tests — they're making a series of exploratory choices that collectively function as one. Each choice to slice the data differently, extend the date range, or exclude an outlier segment is an additional implicit test, and the significance threshold is never adjusted to account for them.

What makes p-hacking particularly difficult to address is that it happens unconsciously as often as deliberately. An analyst who genuinely believes they're exploring the data in good faith can still p-hack their way to a false conclusion.

Peeking — why stopping at the moment of significance invalidates results

Peeking is related to p-hacking but distinct from it. Where p-hacking involves manipulating what you measure, peeking involves manipulating when you stop. An analyst runs an experiment, checks results daily, and stops the test the moment it crosses the significance threshold — rather than running to a pre-specified sample size.

The problem is that p-values fluctuate over the course of an experiment. A test that will ultimately fail to reach significance may cross the threshold briefly in the middle of its run, then drift back. Stopping at that local minimum exploits natural variance and produces a result that looks valid but isn't.

Sequential testing methods exist specifically to address this — they allow valid early stopping by adjusting the significance threshold dynamically — but standard fixed-horizon tests are not designed to be checked repeatedly, and treating them as if they are inflates the false positive rate in ways that aren't visible in the final output.

Confounding variables and Simpson's Paradox

Confounding occurs when a third variable influences both the thing you're testing and the outcome you're measuring, making it look like there's a relationship between them when the real driver is something else entirely. It's the central challenge of observational research, and it doesn't disappear in experiments unless randomization is done correctly. Non-random assignment creates groups that differ on unmeasured dimensions before any treatment is applied — which means any observed difference in outcomes is contaminated by pre-existing group differences.

The most vivid illustration of what confounding can do is the 1973 Berkeley graduate admissions case, which produced what is now the canonical example of Simpson's Paradox. Overall admission rates appeared to favor men (44%) over women (35%), suggesting potential discrimination. But when researchers broke the data down by department, women had higher admission rates in many individual departments — 77% versus 62% in the Department of Education/01%3A_Why_Do_We_Learn_Statistics/1.02%3A_The_Cautionary_Tale_of_Simpsons_Paradox), for instance.

The aggregate finding reversed at the departmental level because of a confounding variable: women disproportionately applied to more competitive departments with lower admission rates across the board. Accounting for that variable showed women actually had a slightly higher overall admission rate than men. Simpson's Paradox is the extreme case, but the underlying mechanism — a confounding variable that distorts the apparent relationship between two others — is present in far more mundane analyses. Any time groups are compared without accounting for how they differ on other relevant dimensions, the risk is real.

Inadequate sample size and regression to the mean

Small samples produce noisy estimates. That's not a design flaw — it's a mathematical property of sampling. But the practical consequence is that results from underpowered studies are more likely to reflect random variation than genuine effects, and they're more likely to produce extreme values that won't hold up on replication.

This connects directly to a well-documented statistical phenomenon called regression to the mean: if a small sample produces an unusually large effect, the next measurement of the same thing will almost always show a smaller one. The first result wasn't a discovery — it was a lucky draw from a noisy distribution.

A useful heuristic from experimentation practice is Twyman's Law: any result that looks surprisingly large or interesting is more likely to reflect a data or implementation error than a genuine effect. Unusually dramatic findings deserve more scrutiny, not less, precisely because the prior probability of a genuine effect of that magnitude is low. An underpowered study that produces a striking result is not a discovery — it's a hypothesis that needs a properly sized test.

Validity is decided at the design stage, not the analysis stage

Most validity problems in research and experimentation are not analysis problems — they are design problems. By the time data collection is complete, the most consequential decisions affecting statistical validity have already been made. Sample size, randomization strategy, and measurement instrument choices either build validity in from the start or lock in failure modes that no amount of clever analysis can undo.

Sample size, statistical power, and margin of error

Sample size is not just a precision consideration — it is a validity consideration. A study that is underpowered cannot reliably detect real effects, which means its conclusions are unreliable regardless of how sophisticated the analysis is. The relationship is mathematical: larger samples reduce the margin of error and increase statistical power, the probability of detecting a true effect when one exists.

The industry standard for adequate power is 80%, meaning that even a well-designed study will miss one in five real effects. The practical implication is that sample size must be calculated before data collection begins, not adjusted after results come in. Research in clinical methodology makes this explicit — sample size calculation is part of the early stages of conducting a study, not a post-hoc correction. Two studies using identical methodology but different sample sizes can point researchers toward opposite clinical decisions, which illustrates that sample size is not a minor technical detail.

The time-dependence of power is worth understanding concretely. In an experiment accumulating roughly 2,195 users per week, power at Week 1 might be only 41% — meaning only effects as large as 34.5% are detectable. By Week 3, the same experiment reaches 80% power for the target effect size. GrowthBook's power analysis tool surfaces exactly this kind of "power over time" projection before an experiment launches, allowing teams to commit to a runtime that gives the study a legitimate chance of producing valid conclusions.

Randomization requirements — why large samples can still fail

Sample size alone does not guarantee validity. The 1936 Literary Digest presidential poll collected responses from approximately 2.3 million people and still produced the wrong result — because the sample was not representative of the voting population. The large size did not guarantee correctness; the non-representative sampling method invalidated the conclusions entirely.

Proper randomization — ensuring every member of the target population has an equal probability of being selected — is what makes a sample representative and makes conclusions generalizable. Without it, even massive datasets produce biased estimates. This is why randomization is a prerequisite for validity, not a methodological nicety.

Measurement accuracy and instrument choices

What you measure, how you define it, and over what time window you observe it are all part of the measurement instrument — and all affect validity. The standards for what constitutes a meaningful difference are highly contextual: a 10% difference between groups might be negligible for a breakfast cereal marketing campaign and clinically decisive for a breast cancer treatment. Choosing a metric that does not align with the actual research question produces statistically significant results that answer the wrong question.

Metric definitions also feed directly into power calculations. A conversion window of 72 hours versus 7 days, for example, changes the variance of the metric and therefore the sample size required to achieve adequate power. Treating metric definitions as interchangeable or adjustable after data collection introduces the same validity risks as any other post-hoc decision.

The dangers of post-hoc analysis

Post-hoc decisions — changing the primary metric after peeking at results, extending a study's runtime because the numbers are close, or selecting the analysis window based on what looks significant — are forms of p-hacking that inflate false positive rates even when the underlying data is clean. The mechanism is straightforward: if you stop a test at the moment it crosses a significance threshold rather than at a pre-calculated sample size, you are effectively selecting for a streak of positive results in one branch, not detecting a real effect.

As one practitioner who analyzed a real-world case of A/B results failing to replicate in production put it: "Precalculate a sample size based on the statistical power you need... then run the test to completion and crunch the numbers afterward." The discipline is simple to describe and genuinely difficult to maintain under deadline pressure. Decide the primary metric, the minimum detectable effect, the required sample size, and the stopping rule before data collection begins — and treat any deviation from that plan as a validity risk, not a methodological convenience.

Statistical validity in A/B testing: why winning results don't always mean what they seem

A/B testing is where statistical validity failures are most consequential and most common. The combination of time pressure, multiple metrics, and stakeholder expectations creates exactly the conditions where the validity threats described above — peeking, multiple testing, and post-hoc metric selection — are most likely to occur. A winning test result is not a valid result by default; it is a result that requires the same validity scrutiny as any other study.

The peeking problem — why stopping at significance invalidates your test

In practice, most A/B tests are not run to a pre-specified sample size. They are checked daily, sometimes hourly, and stopped when results look good — or extended when they don't. This is peeking, and it systematically inflates false positive rates in ways that are invisible in the final output.

When you check results repeatedly as data accumulates, you're giving yourself multiple chances to observe a random fluctuation that crosses the significance threshold. In any experiment with sufficient observations, the p-value will dip below 0.05 by chance at some point during the run. Stopping there doesn't capture a real effect — it captures noise at a convenient moment.

The mechanism is the same as the multiple testing problem described above — each additional check is an implicit test, and the cumulative false positive rate compounds accordingly. This is why documentation on experimentation failure modes treats peeking as a named, first-class problem, not a minor procedural footnote. Sequential testing is the structural solution: it adjusts the significance threshold dynamically to account for repeated looks, allowing valid early stopping without inflating the false positive rate.

Real-world consequences — when winning tests don't win in production

The downstream consequence of peeking — and of validity failures more broadly — is a pattern that many experimentation teams eventually encounter: tests that show strong positive results in the experiment but produce no measurable lift after the change ships to production. The result is shipped, the metric doesn't move, and the team is left trying to reconcile a clean-looking experiment with a flat outcome.

This creates a specific kind of organizational damage. When results don't make sense, people stop trusting the data and start trusting their gut instead. The experimentation program loses credibility not because the platform failed, but because the validity conditions that make results trustworthy were never enforced. Documentation on experimentation programs identifies this cognitive dissonance explicitly — and it's one of the harder problems to fix, because the solution is cultural and structural, not technical.

The practical implication is that every winning result deserves a validity audit before it drives a shipping decision. That audit should ask: Was the sample size pre-specified? Was the primary metric declared before data collection? Was the test run to completion rather than stopped at the moment of significance? A result that can't answer yes to all three is a hypothesis, not a finding.

Platform-level safeguards that remove validity from the discipline column

The most durable solution to validity failures in A/B testing is not better individual discipline — it's building validity requirements into the platform so they can't be bypassed under deadline pressure.

Some experimentation platforms support sequential testing as a statistical framework that allows teams to check results continuously without inflating false positive rates. This removes the peeking problem at the infrastructure level rather than relying on analysts to resist the temptation to stop early. Similarly, automated sample ratio mismatch (SRM) detection flags experiments where the traffic split doesn't match the intended allocation — a common sign of implementation errors that would otherwise produce invalid results.

Some platforms also offer pre-experiment planning guides designed to help teams build validity into study design before data collection begins. The framing is a pre-flight checklist: validity is something you build into an experiment's design, not something you verify after the results are in.

Pre-registration and fixed-horizon testing as the structural solution

The most reliable structural protection against peeking and post-hoc analysis is pre-registration: committing to the primary metric, the stopping rule, the significance threshold, and the minimum detectable effect before the experiment launches. Pre-registration doesn't prevent exploratory analysis — it just distinguishes confirmatory findings from exploratory ones, which is the distinction that matters for decision-making.

Fixed-horizon testing — running an experiment to a pre-calculated sample size and analyzing results exactly once — is the simplest implementation of this principle. It's also the most commonly violated one. The temptation to check early is real, and the organizational pressure to ship is real. Documentation on experimentation best practices recommends drawing conclusions thoughtfully from multi-metric tests and treating a single standout result as a hypothesis to confirm, not a finding to act on. That recommendation is easy to agree with in the abstract and genuinely difficult to follow when a metric is up 12% and the product manager is asking when the feature ships.

The answer is: after the pre-specified sample size is reached, not before.

Statistical validity as a pre-commitment, not a post-hoc check

Statistical validity is not something you verify after the results come in. By the time you're looking at a p-value, the decisions that determine whether that p-value means anything have already been made. Sample size, randomization, metric definition, stopping rule — these are design decisions, and they either build validity in from the start or they don't.

Three questions that determine whether a study's validity is already at risk

Before any experiment launches, three questions determine whether its conclusions will be trustworthy:

Was the primary metric defined before data collection began, or selected after results were visible?
Was the sample size calculated to achieve at least 80% power for the minimum effect size that would justify a decision?
Is there a pre-specified stopping rule that doesn't depend on whether results look significant at the time of checking?

A study that can answer yes to all three has the structural conditions for valid conclusions. A study that can't answer yes to even one of them has a validity problem that no amount of sophisticated analysis will fix.

Using the six-type framework as a diagnostic lens on studies already in progress

For experiments already running, the six-type validity framework functions as a diagnostic tool rather than a design checklist. Work through each type in sequence:

Construct validity: Is the metric actually measuring the outcome the team cares about, or a proxy that may not correlate with the real goal?
Internal validity: Is random assignment working correctly? Is there any evidence of a sample ratio mismatch or multiple exposure contamination?
External validity: Is the test population representative of the full user base, or concentrated in a segment whose behavior may not generalize?
Statistical conclusion validity: Is the test adequately powered? Are multiple metrics being tracked without correction?
Face validity: Would a domain expert look at the metric definition and immediately recognize it as measuring what it claims to measure?
Criterion validity: Does the metric correlate with established measures of the outcome in the expected direction?

Any type that produces a "no" or "uncertain" answer is a validity risk. The appropriate response is not to discount the result — it's to identify which type of validity is threatened and what additional evidence would resolve the uncertainty.

When validity norms depend on individual discipline, they fail under deadline pressure

The pattern that produces most validity failures in practice is not malice or incompetence — it's deadline pressure applied to norms that depend entirely on individual discipline to enforce. An analyst who knows they shouldn't peek will still peek when the product review is tomorrow and the results are almost significant. A team that knows they should pre-register their metric will still change it after seeing the data when the original metric is flat and a secondary metric is up.

The solution is to move validity requirements from the discipline column to the infrastructure column. Statistical guardrails built into the experimentation platform exist precisely to make these norms easier to enforce at the system level, so they don't depend on individual discipline under deadline pressure. Sequential testing, automated SRM detection, minimum data thresholds, and pre-experiment planning workflows are all mechanisms for making the valid path the default path — not the path that requires extra effort to follow.

Statistical validity is ultimately a commitment made before data collection begins. The six types, the threats, the design requirements — all of it points to the same conclusion: the question "is this result valid?" has to be answered by the study design, not by the analysis. If the design doesn't support valid conclusions, the analysis can't rescue them.

What to do next:

Before your next experiment, run through the six-type validity checklist: construct, internal, external, statistical conclusion, face, and criterion. Identify which type is most at risk given your study design.
Calculate your required sample size before data collection begins. If you cannot reach 80% power within a realistic timeline, reduce scope or increase the minimum detectable effect — do not extend the study after the fact.
Pre-register your primary metric, your stopping rule, and your significance threshold. Write them down before the experiment launches.
If your platform supports sequential testing, enable it. It allows valid early stopping without inflating your false positive rate.
Treat any single-metric significant result in a multi-metric test as a hypothesis, not a finding. Confirm it in a dedicated follow-up experiment.

Related insights

Example H2

See All articles

Experiments

Best 7 A/B Testing tools with Product Analytics

May 8, 2026

min read

Experiments

Best 7 Warehouse Native A/B Testing Tools

May 5, 2026

min read

Analytics

How to Track Unique Visitors on Your Website

May 4, 2026

min read

Ready to ship faster?

No credit card required. Start with feature flags, experimentation, and product analytics—free.

Get Started

Book a Demo

Simplified white illustration of a right angle ruler or carpenter's square tool.

White checkmark symbol with a scattered pixelated effect around its edges on a transparent background.

A test hits statistical significance, the team ships the change, and then nothing moves.

What statistical validity actually means (and why it's not the same as reliability)

The core definition: accuracy of conclusions, not just reproducibility

Why consistent results can still be wrong

Invalid conclusions cost more than the study you didn't run correctly

Validity is a collection of evidence, not a single test

The six types of statistical validity researchers need to know

Construct validity: are you measuring what you think you're measuring?

Internal validity: did your intervention actually cause the outcome?

External validity: do your findings generalize beyond the study?

Statistical conclusion validity: did you use the right methods and reach the right inference?

Face validity: does the measure appear credible on its surface?

Criterion validity: does your measure correlate with an established standard?

The mechanisms that turn statistically significant results into wrong answers

The multiple testing problem

P-hacking and the Texas Sharpshooter Fallacy

Peeking — why stopping at the moment of significance invalidates results

Confounding variables and Simpson's Paradox

Inadequate sample size and regression to the mean

Validity is decided at the design stage, not the analysis stage

Sample size, statistical power, and margin of error

Randomization requirements — why large samples can still fail

Measurement accuracy and instrument choices

The dangers of post-hoc analysis

Statistical validity in A/B testing: why winning results don't always mean what they seem

The peeking problem — why stopping at significance invalidates your test

Real-world consequences — when winning tests don't win in production

Platform-level safeguards that remove validity from the discipline column

Pre-registration and fixed-horizon testing as the structural solution

Statistical validity as a pre-commitment, not a post-hoc check

Three questions that determine whether a study's validity is already at risk

Using the six-type framework as a diagnostic lens on studies already in progress

When validity norms depend on individual discipline, they fail under deadline pressure

Related insights

Table of Contents

Related Articles

Best 7 A/B Testing tools with Product Analytics

Best 7 Warehouse Native A/B Testing Tools

How to Track Unique Visitors on Your Website

Ready to ship faster?