Experiments

What are experimental units in research?

A graphic of a bar chart with an arrow pointing upward.

Pick the wrong entity to randomize in your experiment, and your p-values, sample size, and conclusions all break — even if the rest of your analysis is flawless.

This is the core problem with misidentifying experimental units, and it happens constantly: in academic labs counting cells instead of mice, in product teams counting pageviews instead of users, in any experiment where the thing being measured gets confused with the thing that actually received the treatment.

This article is for engineers, PMs, and data practitioners who run experiments — whether that means lab studies, field tests, or product A/B tests — and want to understand how experimental units work well enough to get them right. Here's what you'll learn:

  • What experimental units are, including their dual role as both the recipient of a treatment and the basis for statistical inference
  • How to correctly identify the experimental unit in your own study using a practical diagnostic checklist
  • Why experimental units and sampling units are different things, and how confusing them produces pseudo-replication
  • How experimental units determine your true sample size and what under-replication actually costs you
  • How these principles apply directly to A/B testing and product experimentation, including how platforms like GrowthBook make the randomization unit an explicit design decision

The article moves from the foundational concept through identification, common errors, sample size implications, and finally to digital experimentation — so whether you're designing a study from scratch or auditing one that's already running, you'll find the relevant section without having to read linearly.

The entity that receives the treatment: defining the experimental unit

Before any statistical analysis can be trusted, one question has to be answered correctly: what, exactly, is the experimental unit? It sounds like a technical formality, but getting this wrong doesn't just introduce noise into your results — it can invalidate them entirely.

Whether you're running a clinical trial, a behavioral study, or a product A/B test, the experimental unit is the foundation everything else is built on.

Independence is the criterion, not physical form

An experimental unit is the entity that receives a treatment independently of all other units in a study. The NC3Rs Experimental Design Advisory group defines it this way: "The experimental unit is the entity subjected to an intervention independently of all other units. It must be possible to assign any two experimental units to different treatment groups."

That independence criterion is doing a lot of work in the definition. It's not enough that something receives a treatment — it has to receive it in a way that is logically separable from every other entity in the study. If one unit's treatment assignment or response influences another's, they aren't truly independent experimental units.

The ARRIVE Guidelines, which govern reporting standards for animal research, add a further condition: experimental units "should not influence each other on the outcomes that are measured."

The experimental unit can take many forms. In the most straightforward case, it's an individual person or animal, each independently allocated to a treatment group. But it can also be a cage of animals receiving a shared diet, a litter of pups whose dam was treated, a specific body region of a single animal receiving a topical drug, or even the same animal across distinct time periods in a crossover design. What matters is not the physical form of the entity — it's whether the treatment was applied independently.

You'll sometimes see the experimental unit referred to as the "unit of randomisation", particularly in clinical and biomedical research contexts. The terms are functionally synonymous and worth knowing, since practitioners in different fields tend to favor one over the other.

The dual role: treatment receipt and the basis for inference

Here's where many researchers — even experienced ones — miss something important. The experimental unit doesn't just receive a treatment. It also serves as the entity about which you draw population-level conclusions.

NC3Rs states both roles explicitly: the experimental unit is "the entity you want to make inferences about (in the population) based on the sample (in your experiment)" and simultaneously "the entity subjected to an intervention independently of all other units." These two roles are inseparable, and the second one — the inference role — is what makes correct identification so consequential.

Sample size, as defined by both NC3Rs and the ARRIVE Guidelines, is the number of experimental units per group. That's not a technical footnote; it's a direct statement that your power calculations, your degrees of freedom, and your ability to generalize findings all hinge on how you've counted experimental units. If you're counting the wrong thing, your sample size estimate is wrong, and everything downstream follows.

Misidentifying the unit doesn't just add noise — it invalidates the analysis

The consequences of misidentifying the experimental unit aren't subtle. NC3Rs is direct about it: "If you do not correctly identify the experimental unit, there is a risk you overestimate your sample size which could invalidate the results of your statistical analysis and conclusions."

A concrete example from the ARRIVE Guidelines makes this tangible. Suppose a researcher takes 50 cell measurements from a single mouse. If the mouse is the experimental unit — because the treatment was applied to the mouse, not to individual cells — then those 50 measurements represent a sample size of one, not fifty. Treating them as 50 independent observations inflates the apparent sample size, distorts the statistical analysis, and produces conclusions that can't be trusted.

This is why identifying the experimental unit isn't a step you return to after designing your study. It's the first design decision, and every subsequent choice — how many subjects to recruit, how to structure your analysis, what inferences you're entitled to draw — depends on getting it right.

Treatment assignment, not measurement, determines the experimental unit

Knowing the definition of an experimental unit is one thing. Knowing which entity in your study actually qualifies is another. The distinction that unlocks correct identification is this: the experimental unit is defined by what receives the treatment, not by what gets measured.

These are often different entities, and conflating them is one of the most common errors in study design — committed by experienced researchers as often as by newcomers.

The guiding question: what received the treatment independently?

The most reliable diagnostic question you can ask is: which entity in this study received the treatment independently of all other entities?

NC3Rs EDA offers a sharper version of this test: "It must be possible to assign any two experimental units to different treatment groups." If two entities in your study always receive the same treatment because they are physically or logically grouped together, neither one is the experimental unit — the group is. This single test resolves most identification problems before they become analytical errors.

The key move is separating "what did I measure?" from "what received the treatment?" These questions have different answers more often than researchers expect, and the experimental unit is always the answer to the second question.

The aquarium example: when the container is the unit, not its contents

Consider a study testing the effect of a water additive on fish. The researcher applies the additive to the aquarium, not to individual fish. Every fish in a given aquarium experiences identical water conditions — they are not independent recipients of the treatment. The aquarium is what received the treatment independently. Therefore, the aquarium is the experimental unit, and individual fish are the measurement unit.

This distinction matters enormously for analysis. If the study contains ten fish across two aquariums, the researcher does not have ten experimental units — they have two. The fish provide multiple measurements, but those measurements are not statistically independent of one another. NC3Rs EDA makes this principle explicit: taking multiple measurements from the same entity does not multiply the number of experimental units.

The aquarium example also illustrates a pattern worth internalizing: the experimental unit is frequently larger than the entity being measured. NC3Rs EDA notes that the unit "may be bigger than the animal (e.g. a litter or a cage)" — the cage of animals, not the individual animal, is the experimental unit when animals within a cage cannot be assigned to different treatments independently.

The restaurant example: scaling the principle to real-world experiments

The same logic applies outside the lab. Imagine a restaurant chain testing a new menu layout at a subset of its locations. Customers are measured — order size, satisfaction scores, return visits — but the treatment was assigned at the restaurant level. Every customer who walks into a treated location receives the same menu. Those customers are not independent recipients of the treatment; the restaurant is. The restaurant is the experimental unit. Individual customers are the measurement unit.

This scales directly to digital product contexts. If a feature is rolled out to all users in a geographic region, the region is the experimental unit, not the individual user. Misidentifying your experimental units can lead to overestimating your sample size, skewing your statistical analysis, and producing invalid conclusions. The error is not abstract — it produces wrong numbers and wrong decisions.

Four diagnostic questions that resolve most identification problems

When you are unsure which entity in your study is the experimental unit, work through these diagnostic questions in order. Start by asking what entity actually received the treatment — not what you measured, but what the intervention was applied to. From there, apply the NC3Rs EDA independence test: could you have assigned this entity to a different treatment group without affecting any other entity?

Next, look for nesting: are there smaller entities grouped inside your candidate unit that all received the same treatment? If so, the candidate unit — not the nested entities — is the experimental unit. Finally, check whether your unit is larger than you initially assumed, since the experimental unit is often the cage, the location, the region, or the account rather than the individual subject inside it.

In practice, this identification step happens at the design stage — and in modern experimentation platforms, it is made explicit. GrowthBook, for instance, requires teams to configure a hashAttribute when setting up an experiment, which is the attribute used to assign entities to treatment groups. Selecting that attribute is the act of identifying the experimental unit. The platform supports user, location, postal code, URL path, and other randomization units precisely because the right choice depends on how the treatment is actually assigned — which is exactly the question this checklist is designed to answer.

Experimental units vs. sampling units: the structural difference that pseudo-replication exploits

Most statistical errors don't announce themselves. Pseudo-replication — the mistake of treating measurements taken on sampling units as if they were independent experimental units — is particularly insidious because the data looks fine, the model runs without errors, and the p-value comes back significant. The problem is that the significance isn't real.

Understanding the difference between experimental units and sampling units is what separates a valid analysis from a convincing-looking one.

Sampling units live inside experimental units — and cannot stand in for them

A sampling unit is a fraction of an experimental unit — not a separate entity that independently receives a treatment, but a piece of something that does. The distinction is structural: experimental units exist at the level where treatment is assigned; sampling units exist within experimental units and are measured to characterize them.

The pairing shows up across research contexts in a consistent pattern. A fish tank is an experimental unit; an individual fish pulled from that tank is a sampling unit. A cage holding five birds is an experimental unit; one bird from that cage is a sampling unit. A field plot is an experimental unit; a quadrant within that plot is a sampling unit. In every case, the sampling unit shares its treatment assignment with the experimental unit it belongs to — it didn't receive the treatment independently, so it cannot be treated as an independent observation for statistical purposes.

This matters because, as agricultural statisticians put it directly: variation of observations within an experimental unit will not give you treatment differences. Only variation between experimental units provides the basis for testing whether a treatment had an effect. When you collapse that distinction, you're no longer measuring what you think you're measuring.

What pseudo-replication actually does to your statistics

Here's the error in concrete terms. Suppose you have two fish tanks. Tank A receives Treatment 1; Tank B receives Treatment 2. Each tank holds 10 fish, and you record a measurement on every fish — giving you 20 data points. If you enter those 20 observations into a statistical model and treat each fish as an independent experimental unit, you've committed pseudo-replication.

The immediate consequence is that your statistical test is counting the wrong number of independent data points. With 20 fish observations and 2 treatment groups, the test behaves as though you have 18 pieces of independent evidence to work with. You don't — you have 2 experimental units, one per treatment. The test's internal math is dividing by a number that doesn't reflect reality, which makes the result look more statistically significant than it actually is.

The p-value comes back small, but the small p-value is an artifact of miscounting, not evidence of a real treatment effect.

The mechanism is exactly this — counting sampling units as experimental units inflates n, which inflates the apparent degrees of freedom, which makes effects appear more statistically significant than they are.

Recognizing the error before it propagates

The fish tank example is easy to diagnose in retrospect, but the same structural mistake appears in less obvious forms. Multiple measurements taken on the same patient over time, repeated observations from the same store location, or several page views from the same user session — all of these are sampling units nested within experimental units, not independent experimental units in their own right.

The diagnostic question is the same one that identifies the experimental unit in the first place: what entity received the treatment independently? Everything measured within that entity is a sampling unit. Counting those measurements as separate experimental units doesn't increase your statistical power — it creates the illusion of power that isn't there.

Modern experimentation platforms have formalized this distinction at the infrastructure level. Tools like GrowthBook expose the randomization unit — the entity that gets hashed to determine treatment assignment — as an explicit configuration parameter. That design choice reflects a real methodological constraint: if a team assigns treatments at the user level but analyzes outcomes at the session or pageview level, treating each session as an independent observation, they're committing the digital equivalent of the fish tank error. The platform makes the experimental unit explicit precisely because the consequences of getting it wrong propagate through every analysis that follows.

How experimental units determine sample size and replication

Sample size is not the number of measurements in your study. It is the number of experimental units per group. That distinction sounds simple, but getting it wrong is one of the most common ways researchers and data practitioners end up with studies that are either underpowered or deceptively over-counted — and often both at once.

Replication means independent units, not more measurements

The ARRIVE Guidelines define the experimental unit as "the biological entity subjected to an intervention independently of all other units, such that it is possible to assign any two experimental units to different treatment groups." From that definition follows a direct corollary: the sample size is the count of those independent units per group, not the total number of observations collected.

Replication, in the statistically meaningful sense, requires independent experimental units receiving each treatment. It is not satisfied by taking more measurements from the same unit. As the JABSTB textbook frames it, a statistically valid sample is comprised of independent replicates of the experimental unit, generated through some random process. Both conditions — independence and randomness — are required. Repeated measurements on a single unit satisfy neither.

This matters for repeated-measures designs specifically. A before-and-after experiment that records two scores from the same subject produces more data points than experimental units. Those two scores are intrinsically linked; they do not represent two independent replications of the treatment. The experimental unit count stays at one per subject.

How misidentification inflates your n

The inflation error has a concrete form. Suppose you measure 50 individual cells taken from a single mouse. If the mouse is the experimental unit — because the mouse, not the cell, received the treatment independently — then you have n = 1, not n = 50. The 50 cell measurements are subsamples. They estimate measurement error within that one unit; they say nothing about how different mice would respond to the treatment.

Treating those 50 measurements as 50 independent replicates makes the study appear far better-powered than it actually is. The degrees of freedom are artificially inflated, confidence intervals are falsely narrow, and any resulting p-values are not trustworthy. This is the mechanism behind pseudo-replication: the study looks adequately sized on paper while being, in practice, an experiment of one.

Hierarchical designs and the question of which n to count

Real experiments are rarely flat. Biological and technical factors are typically organized in hierarchies — cells within animals, animals within cages, cages within rooms. Each level of that hierarchy raises the same question: at which level does independent treatment assignment actually occur?

The ARRIVE Guidelines are direct about this challenge: "Such hierarchies can make determining the sample size difficult (is it the number of animals, cells, or mitochondria?)". The answer is always the level at which treatment is assigned independently. If the cage receives the treatment and animals within the cage are measured, the cage is the experimental unit regardless of how many animals it contains.

Hierarchical designs can also have multiple experimental units within a single experiment. A pregnant dam receiving one treatment and her weaned pups subsequently allocated to different diets creates two distinct experimental units operating at different levels. Each has its own relevant n for its respective treatment comparison. Collapsing them into a single count produces an analysis that is wrong at both levels.

The consequences of under-replication

An experiment with only one experimental unit per treatment group cannot estimate within-treatment variability. There are no degrees of freedom available for error, which means no valid statistical test can be performed. Results from such a design cannot be generalized beyond the specific units tested, because there is no empirical basis for estimating how much those units represent a broader population.

Under-replication is not just a power problem — it is a validity problem. No amount of additional measurements per unit compensates for the absence of independent replication across units. This is why identifying the experimental unit correctly before designing a study is not a formality. It determines whether your sample size calculation reflects reality or a number that will mislead you from the start.

Platforms like GrowthBook operationalize this principle in digital experimentation by enforcing minimum metric thresholds before surfacing results — a practical guard against drawing conclusions from experiments that have not yet accumulated enough experimental units to support valid inference.

Experimental units in A/B testing and product experimentation

The same principle that determines the experimental unit in a laboratory study applies without modification to a digital A/B test: the experimental unit is whatever entity independently receives the treatment. The terminology shifts — "randomization unit" instead of "experimental unit," "traffic split" instead of "treatment assignment" — but the underlying logic is identical. Getting it wrong produces the same class of errors in a product dashboard that it produces in a published paper.

Users, sessions, regions: which entity actually received the treatment?

In most product experiments, the user is the experimental unit. A feature flag is toggled on or off for a given user, that user consistently sees one variant throughout the experiment, and the analysis counts users — not events, not pageviews — as the unit of observation. But the user is not the only valid choice.

Depending on what receives the treatment independently, the experimental unit might be a session, a device, an account, a geographic region, a store location, or a server-side entity like an API endpoint. The NC3Rs diagnostic question applies cleanly here: can any two of these entities be assigned to different treatment groups independently of each other? If the answer is yes, you have a candidate experimental unit. GrowthBook's platform reflects this variety explicitly, supporting randomization by user, location, postal code, URL path, and other attributes — a recognition that the right unit depends on the experiment, not on a platform default.

When the analysis unit doesn't match the assignment unit, independence breaks down

The choice of experimental unit determines whether the independence assumption underlying your statistical test actually holds. If users are the true experimental unit but your analysis treats sessions as independent observations, you are counting the same user's repeated sessions as if they were separate, unrelated data points. They are not. The same user's sessions are correlated by definition — same preferences, same context, same exposure history.

This is precisely what NC3Rs identifies as the core requirement: it must be possible to assign any two experimental units to different treatment groups. A session cannot be independently assigned when the user behind it has already been assigned. Misidentifying the unit in this way leads directly to overestimated sample sizes and invalid statistical conclusions — the confidence intervals look tighter than they are because the apparent sample size is inflated.

Consistent assignment is the practical safeguard against this error. Experimentation platforms address it by ensuring the same entity always receives the same variant across the experiment's duration, rather than being re-randomized on each visit. The experimental unit must receive a consistent treatment throughout — not a randomly re-assigned one on each interaction.

The pseudo-replication risk in product experimentation

Counting pageviews or events as independent experimental units when the actual unit is the user is the digital equivalent of pseudo-replication. The apparent sample size grows quickly — millions of events can accumulate in days — but the number of independent experimental units grows much more slowly. Statistical tests that treat event counts as the sample size are operating on inflated degrees of freedom, producing p-values and confidence intervals that cannot be trusted.

GrowthBook's own guidance on exposure timing addresses this directly: expose users as close to the actual treatment exposure as possible, and avoid including users who never encountered the treatment. Including unexposed users in the analysis increases noise and reduces the ability to detect real differences — the same logic that makes pseudo-replication damaging in a lab study. For cases where assignment and exposure are unavoidably separated, some platforms support an activation metric to filter the analysis down to users who actually received the treatment.

Making the experimental unit an explicit configuration decision, not a hidden assumption

The industry has responded to this problem by making experimental unit selection an explicit, configurable platform decision rather than a hidden assumption. GrowthBook exposes the randomization unit as a named configuration parameter within a unified platform that connects feature flags, experiment configuration, and metrics analysis — ensuring the entity used for assignment is the same entity reflected in downstream statistical reporting.

The platform also offers sticky bucketing for experiments where consistent assignment must be maintained even if experiment settings change. Statsig similarly surfaces flexible targeting and randomization units as a deliberate capability.

This is not a convenience feature. It is a recognition that the choice of experimental unit is a design decision with direct consequences for statistical validity, and that practitioners need to make it consciously rather than inherit a default that may not match their experiment's structure.

One question determines everything: what entity received the treatment independently?

The through-line of everything in this article is a single question: what entity received the treatment independently? That question determines your experimental unit, which determines your true sample size, which determines whether your statistical conclusions are valid or just plausible-looking. The cell isn't the unit if the mouse got the treatment. The session isn't the unit if the user was assigned. The customer isn't the unit if the restaurant was randomized. Everything else follows from getting that one thing right.

Three questions that surface the right unit before the study runs

Before you finalize any study design — or before you audit one that's already running — ask three things in order. First: what entity actually received the treatment, not what you measured? Second: could any two of those entities have been assigned to different groups without affecting each other? Third: is there a level of nesting above your candidate unit where treatment was really applied?

If you work through those questions honestly, you'll land on the right unit most of the time. The mistake almost always comes from skipping straight to "what did I measure?" and working backward.

Pseudo-replication doesn't look like an error until you check the denominator

The hardest part about pseudo-replication is that it doesn't look like an error. The data is real, the model runs, and the p-value comes back significant. What's broken is the denominator — and catching it requires checking at the design stage, not after the results are in. If you've worked through the three questions above honestly, you've already done the work. The analysis just has to reflect the same unit you identified.

The terminology changes by field; the logic doesn't

The framing changes by field — "unit of randomisation" in clinical research, "randomization unit" or hashAttribute in product experimentation, "experimental unit" in biology — but the underlying logic doesn't. Whatever entity was independently assigned to a treatment group is your experimental unit, and your analysis has to be built around that entity, not around the measurements you took from inside it. This is one of those rare methodological principles that transfers cleanly across contexts, which means getting it right once pays dividends everywhere you run experiments.

What to do next:

  • If you're designing a new study, answer the three questions in this article before writing your analysis plan — not after.
  • If you're in a product experimentation context, check whether your platform's randomization unit matches the entity your treatment was actually applied to.
  • If you're auditing an existing study, locate the denominator in your statistical test and verify it reflects independent experimental units, not subsamples.
  • If you're working in a hierarchical design (cells within animals, users within accounts, sessions within users), identify every level where treatment assignment occurs and ensure your analysis reflects the correct level.

Related insights

Table of Contents

Related Articles

See All Articles
Product Updates

Understanding STAR goals for effective performance

May 22, 2026
x
min read
Experiments

Green release: what it is and how it works

May 21, 2026
x
min read
Experiments

Understanding false causality and examples

May 21, 2026
x
min read

Ready to ship faster?

No credit card required. Start with feature flags, experimentation, and product analytics—free.

Simplified white illustration of a right angle ruler or carpenter's square tool.White checkmark symbol with a scattered pixelated effect around its edges on a transparent background.