Experiments

T test vs chi square: key differences explained

A graphic of a bar chart with an arrow pointing upward.

Picking the wrong statistical test doesn't give you a slightly off answer — it gives you a meaningless one.

The math behind a t-test assumes your data is continuous and measurable. The math behind a chi-square test assumes your data is categorical and countable. Apply either test to the wrong data type, and the output isn't imprecise — it's built on a fiction.

This article is for engineers, PMs, and data analysts who run experiments or analyze product metrics and want to stop guessing which test to use. Whether you're measuring session duration in an A/B test or tracking how users distribute across pricing tiers, the choice between a t-test and chi-square follows directly from the structure of your data — not intuition.

Here's what you'll learn:

  • What each test actually measures and why applying the wrong one produces invalid conclusions
  • The specific variants of each test (three for t-tests, two for chi-square) and when each applies
  • The data assumptions both tests require — and what to do when your data violates them
  • How test statistics, p-values, and critical values work in both tests
  • A repeatable decision framework with a quick-reference table you can use on your next analysis

The article builds from fundamentals to application. It starts with the core conceptual difference between the two tests, walks through each variant and its assumptions, explains how the underlying math evaluates a null hypothesis, and ends with a practical decision process you can apply immediately.

The fundamental difference: what t tests and chi square tests actually measure

Choosing between a t-test and a chi-square test is not a matter of preference or convention — it's determined by the structure of your data and the question you're trying to answer. Get this wrong, and you don't just get a less precise result.

You get a result that is structurally meaningless, because the mathematical logic underlying each test simply doesn't apply when the data type doesn't match.

This is the distinction that matters most, and it comes before any consideration of sample size, significance thresholds, or test variants.

T-tests measure mean differences in continuous data — nothing else

A t-test operates on continuous numerical data and answers a specific kind of question: is the mean of this variable meaningfully different — across two groups, from a known value, or across two time points for the same subjects?

Think of the data types where this applies: average revenue per user across two quarters, page load times before and after a deployment, session durations for users on two different onboarding flows. These are all measurements that exist on a numerical scale, where it makes sense to compute an average and ask whether that average has shifted.

The t-test quantifies whether the observed difference in means is large enough relative to the variability in the data to be considered non-random. It's fundamentally a signal-to-noise calculation: how big is the difference compared to how noisy the data is?

Chi-square tests measure category frequency discrepancies, not magnitudes

A chi-square test operates on categorical data and answers a different kind of question entirely: does the distribution of categories match what we'd expect, or are two categorical variables associated with each other?

The data types here look different: user plan tier (free, pro, enterprise), button label chosen in an A/B test, device category (mobile, desktop, tablet), or whether a user converted. These are labels, not measurements. There is no meaningful average of "free, pro, enterprise" — the relevant question is how many users fall into each category and whether that distribution is what you'd expect.

A chi-square test doesn't compare magnitudes. It compares observed frequencies against expected frequencies and asks whether the discrepancy is larger than chance would produce. When used as a test of independence, it asks whether knowing a user's value on one categorical variable tells you anything about their value on another — for example, whether product plan choice is independent of acquisition channel.

Why misapplying these tests produces invalid conclusions

The reason this distinction matters to practitioners — not just statisticians — is that applying the wrong test doesn't produce a degraded answer. It produces a nonsensical one.

Consider what happens if you try to run a t-test on a categorical variable like button label. You might encode the labels as numbers (1 for "Sign Up," 2 for "Get Started") and compute a mean. But that mean has no interpretable meaning — the numerical encoding is arbitrary, and the t-statistic you'd calculate would be built on a fiction.

The test's logic assumes you're measuring something on a continuous scale where differences in magnitude are meaningful. When that assumption is violated at the data structure level, the output isn't just imprecise — it's answering a question that was never coherent to begin with.

The same logic applies in reverse. Running a chi-square test on continuous revenue figures requires binning them into categories first, which discards information and introduces arbitrary decisions about bin boundaries. You've transformed your data to fit the wrong tool rather than selecting the right tool for your data.

"Picking the right test isn't just a formality — it can make or break your results. Use the wrong one, and you might end up with misleading conclusions." The reason it can break results is precisely this structural mismatch — not a subtle calibration issue, but a fundamental incompatibility between the test's assumptions and the data it's being asked to evaluate.

For teams configuring metrics in an experimentation platform, this distinction is what determines which statistical test gets applied under the hood. A continuous metric like revenue per user calls for one test family; a binary or categorical outcome like plan conversion calls for another. The conceptual line between them is the same regardless of what tool is doing the computation.

Three t-test variants and two chi-square variants: matching each to its research scenario

Knowing that a t-test applies to your problem is only half the answer. There are three distinct t-test variants and two chi-square variants, each designed for a specific research scenario — and choosing the wrong one within a family can produce results that are just as structurally flawed as choosing the wrong test entirely.

The consequences aren't subtle: using an independent t-test when a paired design is appropriate, for example, inflates variance and drains statistical power, making real effects harder to detect. Here's how to match each variant to the situation it was built for.

One-sample t-test: testing against a known benchmark

The one-sample t-test answers a narrow but useful question: does your group's mean differ from some fixed reference value? The reference might be an industry benchmark, a regulatory threshold, or a historical baseline — the key is that it's a single pre-established number, not a second group of observations.

A practical example: your team wants to know whether your application's average API response time differs from the 200ms service-level target you've committed to. You have one sample of response times and one benchmark. That's a one-sample t-test.

Independent two-sample t-test: comparing two separate groups

This is the most commonly used t-test variant in product and engineering contexts, and it's the structural foundation of most A/B testing frameworks. It compares the means of two distinct, unrelated groups — where each observation belongs to exactly one group and there's no natural pairing between them.

The scenario: did users randomly assigned to the control experience spend more per session on average than users assigned to the treatment? Two groups, no overlap, compare their means. GrowthBook's experimentation analysis uses a two-tailed t-distribution calculation with the Welch-Satterthwaite approximation for degrees of freedom — which is precisely this variant.

This approximation adjusts for situations where the two groups being compared have different amounts of variability in their data — a common real-world condition that the simpler version of the t-test does not handle correctly. If you're running A/B tests and interpreting p-values from an experimentation platform, you're almost certainly working with an independent two-sample t-test under the hood.

Paired t-test: before/after or matched measurements

The paired t-test applies when the same subjects are measured twice — before and after an intervention — or when observations are matched in meaningful pairs. Because the two measurements come from the same subjects, the groups are dependent, not independent, and that dependency is information you should use rather than discard.

The scenario: you want to know whether users spent more time in your app after you redesigned the onboarding flow. You have pre-launch and post-launch measurements for the same user cohort. Using an independent t-test here would ignore the correlation between paired observations, artificially inflating variance and reducing your ability to detect a real effect. The paired t-test accounts for this structure directly.

Chi-square goodness-of-fit: does your distribution match expectations?

The goodness-of-fit test addresses a different kind of question entirely: does the observed distribution of a single categorical variable match a theoretical or expected distribution? You're not comparing groups — you're comparing a pattern.

The scenario: you want to verify whether users are distributing themselves across your four pricing tiers in the proportions your pricing model assumed (say, 40/30/20/10). You observe the actual counts and test whether they deviate significantly from the expected proportions. If the deviation is large enough to be unlikely by chance, the test rejects the hypothesis that your observed distribution matches the theoretical one.

Chi-square test of independence: are two categorical variables related?

The test of independence examines whether two categorical variables are associated or whether they vary independently of each other. You have two categorical variables measured on the same subjects, and you want to know if knowing someone's value on one variable tells you anything about their value on the other.

The scenario: is product tier selection independent of customer industry segment? You build a contingency table of tier choices by industry, compare observed cell counts to what you'd expect if the variables were unrelated, and the chi-square statistic tells you how far the observed pattern departs from independence. A significant result means the variables are associated — though it says nothing about which one drives the other.

Variant selection within a test family carries the same stakes as the initial choice between t test vs chi square. Getting the variant right is where statistical rigor meets practical research design.

Statistical assumptions: what your data must satisfy before running a t test or chi square

Choosing the right test is only half the battle. The more commonly skipped step — and the one with the most consequences — is verifying that your data actually satisfies the assumptions that make the test valid. Running a t-test on data that violates its assumptions doesn't just reduce precision; it produces conclusions that are structurally invalid. The same is true for chi-square. Before you run either test, your data needs to pass a set of verifiable conditions.

T-test assumptions: four conditions your continuous data must clear

T-tests are built for continuous dependent variables — measurements that can take any value along a scale, like revenue per user, page load time, or session duration. If your outcome variable is a label or a category, a t-test is the wrong tool entirely.

Beyond data type, t-tests require that observations be independent of one another. One subject's measurement should have no influence on another's. This rules out scenarios like repeated measurements on the same users without using a paired t-test design.

The normality assumption is real but often overstated. Your data should be approximately normally distributed, but with larger sample sizes — generally n > 30 — the Central Limit Theorem means the sampling distribution of the mean) — meaning that if you repeatedly drew samples and calculated the mean each time, those means would form a roughly bell-shaped distribution, even if the original data is skewed — will be approximately normal even if the underlying data isn't.

For small samples, normality matters more, and you should check it explicitly.

For two-sample t-tests specifically, there's a fourth assumption: homogeneity of variance, meaning the variance in each group should be roughly equal. When this assumption is violated, Welch's t-test — which adjusts the degrees of freedom to account for unequal variances — is the standard correction and is widely available in statistical software.

Chi-square assumptions: three conditions that practitioners most often skip

Chi-square tests operate on categorical variables — data that divides observations into discrete groups or labels, like browser type, pricing tier, or geographic region. The data type requirement here is non-negotiable: chi-square tests are not appropriate for continuous outcomes.

Like t-tests, chi-square tests require independence of observations. Each subject should contribute to exactly one cell in the contingency table. If the same user can appear in multiple categories, the independence assumption is violated.

The assumption that practitioners most frequently overlook is the minimum expected cell count. Each cell in the contingency table must have an expected frequency of at least 5 — and this applies to expected frequencies, not observed ones. A cell can have an observed count of 8 but an expected count of 3, which still violates the assumption. When expected counts fall below this threshold, the chi-square approximation becomes unreliable and the resulting p-value cannot be trusted.

Assumption violations redirect you to a different test, not a dead end

Assumption violations don't mean you're stuck. They mean you need a different test — one designed for your actual data conditions.

If your t-test normality assumption is violated, particularly with small samples, the Mann-Whitney U test is the appropriate non-parametric alternative. Rather than comparing means, it compares rank distributions, making it robust to non-normal data. Research published in BioData Mining (Chicco et al., 2025) explicitly frames Mann-Whitney U as the direct alternative to Student's t-test in these conditions.

If your chi-square expected cell counts fall below 5, Fisher's exact test is the standard alternative for 2×2 tables. Unlike chi-square, Fisher's exact test calculates exact probabilities rather than relying on an approximation, making it reliable even with sparse data.

Pre-test data audit checklist

Before running either test, work through these conditions against your actual dataset:

  • Outcome variable type: Is your outcome continuous or categorical? If continuous, you're in t-test territory. If categorical, chi-square applies.
  • Observation independence: Are your observations independent? No repeated measures, no clustering, no users appearing in multiple groups.
  • Normality (t-tests): Is your data approximately normally distributed, or is your sample large enough (n > 30) for the Central Limit Theorem to cover you?
  • Variance equality (two-sample t-tests): Are the variances across groups roughly equal? If not, use Welch's t-test.
  • Expected cell counts (chi-square): Do all cells in your contingency table have expected frequencies of at least 5? If not, use Fisher's exact test.

If any of these conditions fails, identify the appropriate alternative before proceeding. Tools like GrowthBook automate several of these checks in an A/B testing context — flagging independence violations through multiple exposure detection and catching inadequate sample sizes before conclusions are drawn — but the underlying logic applies regardless of what software you're using. The assumptions don't change because the tool is convenient.

How t tests and chi square tests evaluate a null hypothesis: test statistics, p-values, and critical values

Most practitioners who run experiments interact with p-values as an output — a number that either clears the 0.05 threshold or doesn't. But understanding what the test statistic actually measures before the p-value is derived is what separates someone who can correctly interpret results from someone who can only report them.

Both t-tests and chi-square tests operate within the same logical framework, but they generate their test statistics through fundamentally different mechanisms — and conflating those mechanisms leads to misread results.

T-tests and chi-square tests share the same null hypothesis logic, despite different mechanics

Regardless of which test you're running, the null hypothesis testing workflow is the same. You start by stating a null hypothesis — the proposition that no significant difference or association exists in the data. You collect data, calculate a test statistic, then compare that statistic to a critical value derived from your chosen significance level (α). If the test statistic exceeds the critical value, you reject the null. If it doesn't, you fail to reject it.

For a two-tailed test at 95% confidence, the benchmark critical value is 1.96 — the point beyond which only 5% of the area under a normal distribution falls. This threshold applies broadly across test types, though the exact critical value shifts depending on degrees of freedom and the specific distribution being used. The decision logic, however, stays constant: a test statistic that clears the critical value is grounds for rejection; one that doesn't leaves the null standing.

The t-statistic: measuring signal against noise

The t-statistic is a ratio. In its simplest form, it's the observed difference between group means divided by the standard error — a measure of how much variability exists in the data. Intuitively, a large t-statistic means the difference between groups is large relative to how spread out the underlying data is. It's a signal-to-noise ratio: the signal is the mean difference you observed; the noise is the natural variability in your sample.

If you're comparing average session durations between two product variants and your t-statistic is 2.5, that result clears the 1.96 critical value for a two-tailed test at α=0.05. The interpretation is that the observed difference is unlikely to have occurred by chance given the variability in the data. The exact critical value threshold shifts slightly based on degrees of freedom — which are tied to sample size — which is why the t-distribution is used rather than the standard normal distribution, particularly in smaller samples.

The chi-square statistic: measuring observed vs. expected discrepancy

The chi-square statistic works differently. Rather than measuring a mean difference, it measures how much the observed distribution of categorical frequencies deviates from what you would expect if the null hypothesis were true. The formula sums the squared difference between observed and expected counts in each category, divided by the expected count: χ² = Σ[(O-E)²/E].

In plain terms: for each category, take the difference between what you actually observed and what you expected, square it so that positive and negative gaps don't cancel out, divide by the expected count to normalize for scale, then add all those values together. A larger total means a bigger departure from what the null hypothesis predicts.

A large chi-square statistic means the actual pattern of category counts looks very different from the theoretical distribution you'd expect under the null. If you're testing whether users who saw a new onboarding flow are distributed differently across subscription tiers compared to the control group, a large chi-square value tells you that the category frequencies don't match what random chance would predict.

Like the t-statistic, the chi-square statistic is compared against a critical value that also depends on degrees of freedom. For a goodness-of-fit test, degrees of freedom equal the number of categories minus one; for a test of independence, degrees of freedom equal (number of row categories minus one) multiplied by (number of column categories minus one).

p-values signal rarity under the null — not importance, not magnitude

The p-value is the probability of observing a test statistic as extreme as the one you calculated, assuming the null hypothesis is true. A p-value below 0.05 doesn't mean your result is important or large — it means it's unlikely under the null. That distinction matters enormously in practice.

One of the most consequential misuses of p-values is running multiple tests and treating any significant result as a finding. If you test the same hypothesis across 20 different metrics at a 5% significance level, the probability of finding at least one statistically significant result by chance is approximately 64% — not 5%. This is the multiple testing problem, and it inflates Type I error rates (false positives) dramatically. The inverse failure mode, a Type II error, occurs when a real effect goes undetected because the test lacked sufficient power.

Experimentation platforms like GrowthBook address this directly by applying multiple comparison corrections — including Bonferroni correction and the Benjamini-Hochberg procedure — automatically across experiment metrics, preventing the false positive inflation that comes from treating each p-value in isolation. Whether you're running a t-test on continuous engagement data or a chi-square test on categorical conversion outcomes, the p-value only means what it's supposed to mean when the testing framework around it is correctly structured.

T test vs chi square: three questions that resolve the choice before you open any software

The prior sections of this article have established what each test measures, how its variants work, and what assumptions your data must satisfy. This section converts all of that into a repeatable decision process you can apply to any data problem in front of you right now.

Three questions that determine your test choice

The choice between a t-test and a chi-square test isn't a judgment call — it follows directly from three objective characteristics of your data and your research question. Work through these in order.

First: Is your outcome variable continuous or categorical? This is the primary filter. If your outcome is a number that can take a range of values — revenue, session duration, load time, engagement score — you're in t-test territory. If your outcome is a label or category — product tier selected, user segment, whether someone clicked or didn't — chi-square is the candidate. This single question resolves the majority of t test vs chi square confusion.

Second: How many groups are you comparing? T-tests are designed for one or two groups. If you're comparing a sample mean against a known value, that's a one-sample t-test. If you're comparing two groups against each other, that's an independent two-sample or paired t-test depending on your study design. If you have more than two groups and a continuous outcome, you've moved into ANOVA territory — not covered here, but worth knowing as the natural escalation. Chi-square tests, by contrast, handle two or more categorical variables regardless of how many categories each contains.

Third: Is your research question about comparing magnitudes or testing for association? A magnitude question sounds like: "Did average revenue increase?" An association question sounds like: "Is the product tier a customer chooses related to their industry?" These are structurally different questions, and they require structurally different tests.

T-test scenarios: continuous outcomes and mean comparisons

Consider a product team that wants to know whether average sales in Q2 were significantly higher than in Q1. The outcome — sales revenue — is continuous. The question is whether the mean changed between two time periods. A canonical t-test use case: "Maybe you're checking if the average sales have changed between two quarters." An independent two-sample t-test answers this directly, producing a test statistic that reflects how large the mean difference is relative to the variability in the data.

A second common scenario: an engineering team ships a new feature and wants to measure whether average session duration increased in the treatment group relative to control. Session duration is continuous; the comparison is between two groups. Depending on whether the same users appear in both conditions or different users are assigned to each, this calls for a paired t-test or an independent two-sample t-test respectively. In either case, the t-test is answering a magnitude question: by how much did the mean shift, and is that shift statistically distinguishable from noise?

Chi-square scenarios: categorical outcomes and association testing

Now consider a product manager who wants to know whether the pricing tier customers select — Basic, Pro, or Enterprise — is related to their industry segment. Both variables are categorical. There's no mean to compare; the question is whether the distribution of tier selections looks different across industry segments, or whether the two variables are effectively independent. This is a chi-square test of independence — used to test if two categorical variables are associated.

A second chi-square scenario involves distribution checking rather than association. Suppose a data analyst wants to verify whether the current distribution of users across pricing tiers matches the distribution from the prior year. The analyst has observed counts and expected counts for each category. A chi-square goodness-of-fit test quantifies whether the discrepancy between observed and expected frequencies is larger than chance would predict.

The key distinction from t-test scenarios: chi-square never asks "by how much did the mean change?" It asks "does the pattern of counts across categories match what we'd expect, or is there a relationship between these categorical variables that the data reveals?"

A quick-reference decision table

Outcome Variable Research Question Appropriate Test
Continuous Compare means between two groups Independent two-sample t-test
Continuous Compare before/after on the same subjects Paired t-test
Continuous Compare one group mean to a known value One-sample t-test
Categorical Are two categorical variables associated? Chi-square test of independence
Categorical Does observed distribution match expected? Chi-square goodness-of-fit

When you're setting up an experiment in an experimentation platform, this classification happens before you ever run the analysis. The data type of your outcome metric — continuous engagement score versus categorical plan selection — determines which statistical approach is valid. Getting that classification right is what makes the downstream results interpretable.

T test vs chi square: the structural constraint that makes the choice unambiguous

The core argument of this article reduces to one principle: test selection isn't a preference — it's a structural constraint imposed by your data. A t-test and a chi-square test aren't two ways to answer the same question. They answer fundamentally different questions, and the question you can legitimately ask is determined before you open any software, by whether your outcome variable is a measurement or a label.

The core distinction restated: measurements vs. labels determine your test family

If your outcome is continuous — revenue, session duration, load time — you're comparing means, and a t-test is the right family. If your outcome is categorical — plan tier, device type, whether someone converted — you're comparing distributions or testing for association, and chi-square is the right family. The decision table in the prior section captures every common scenario; if you're unsure which row you're in, the answer is almost always resolved by that first question about data type.

Selecting the correct test: a linear path from outcome variable to valid analysis

The path from data to valid test is linear, not ambiguous. Start with your outcome variable. If it's a measurement on a continuous scale, you need a t-test — then determine which variant based on your study design: one sample against a benchmark, two independent groups, or two measurements on the same subjects. If your outcome is a label or category, you need a chi-square test — then determine which variant based on your question: testing whether a distribution matches expectations (goodness-of-fit) or testing whether two categorical variables are related (test of independence).

Before running either test, verify your assumptions. Continuous data requires independence of observations, approximate normality (or sufficient sample size), and — for two-sample tests — roughly equal variances. Categorical data requires independence of observations and expected cell counts of at least 5 in every cell. If any assumption fails, the path redirects: Mann-Whitney U replaces the t-test when normality is violated in small samples; Fisher's exact test replaces chi-square when expected cell counts are too low.

This linear path applies whether you're configuring metrics in GrowthBook, writing your own analysis in Python, or reviewing someone else's work. The data type of your outcome variable is the constraint that makes the choice unambiguous — not convention, not habit, not what the person before you used.

When your data outgrows t-tests and chi-square: ANOVA and non-parametric alternatives

T-tests and chi-square tests cover a wide range of common analytical scenarios, but they have boundaries. When your data pushes past those boundaries, the appropriate response is to escalate to a test designed for the more complex structure — not to force the data into a test it doesn't fit.

The most common escalation from t-tests is ANOVA (Analysis of Variance). When you have a continuous outcome and more than two groups to compare, a t-test is no longer appropriate — running multiple pairwise t-tests inflates your Type I error rate in exactly the way described in the p-values section above. ANOVA tests whether any group means differ across three or more groups simultaneously, controlling the error rate across the full comparison. If ANOVA returns a significant result, post-hoc tests identify which specific pairs differ.

For non-parametric situations — where your continuous data violates normality assumptions and your sample is too small for the Central Limit Theorem to compensate — the Kruskal-Wallis test is the multi-group equivalent of Mann-Whitney U. It compares rank distributions across three or more groups without assuming normality.

For categorical data with more complex structures — ordered categories, repeated measures on categorical outcomes, or sparse contingency tables larger than 2×2 — logistic regression and log-linear models provide more flexible frameworks than chi-square. These methods handle the complexity that chi-square's simpler structure cannot accommodate.

Knowing when to escalate is as important as knowing which test to use in the first place. The goal is always the same: match the test to the actual structure of your data and your research question, not the other way around.

What to do next: Identify whether your outcome variable is continuous or categorical. If continuous, determine whether you are comparing one group to a benchmark, two independent groups, or two measurements on the same subjects — then select the corresponding t-test variant. If categorical, determine whether you are testing for association between two variables or checking whether a distribution matches expectations — then select chi-square goodness-of-fit or test of independence accordingly. Verify your assumptions before running the test. If any assumption fails, use the alternative identified in the assumptions section above.

Table of Contents

Related Articles

See All Articles
Product Updates

Understanding STAR goals for effective performance

May 22, 2026
x
min read
Experiments

Green release: what it is and how it works

May 21, 2026
x
min read
Experiments

Understanding false causality and examples

May 21, 2026
x
min read

Ready to ship faster?

No credit card required. Start with feature flags, experimentation, and product analytics—free.

Simplified white illustration of a right angle ruler or carpenter's square tool.White checkmark symbol with a scattered pixelated effect around its edges on a transparent background.