Sample Size & Power — Math Primer

How many observations do you need? Power analysis finds the Goldilocks number—large enough to detect real effects, small enough not to waste resources.

Why Sample Size Matters

In any study—clinical trial, A/B test, or marketing experiment—you estimate population parameters using a sample. The core question: how many observations do you need?

Too small: You lack statistical power to detect real effects. Time, money, and ethical capital are wasted on inconclusive results.

Too large: You waste resources and may detect statistically significant but practically meaningless differences.

The Four Levers

Power analysis balances four interconnected parameters. Adjusting one changes the required sample size:

α (Alpha)

Type I error risk: probability of a false positive.

Standard: 0.05 (5% risk). Lower α → larger sample needed.

1−β (Power)

Probability of detecting a real effect when one exists.

Standard: 0.80 or 0.90. Higher power → larger sample needed.

δ (Raw Difference)

MCID: smallest difference that matters practically.

Smaller difference to detect → much larger sample needed.

σ (Standard Deviation)

Noise level: natural data fluctuation.

More noise → larger sample needed to separate signal.

The Formula

For a two-sample t-test with equal groups, the sample size per group is:

Iterative T-Based Formula n = 2 × (t_α,df + t_β,df)² / d²
where d = δ/σ (Cohen’s d) and df = 2(n−1)

Why Iterative?

The degrees of freedom (df) depend on n, and the critical t-value depends on df. So the formula is circular—we solve it iteratively: start with a Z-approximation, compute df, look up the t-critical value, recompute n, and repeat until it converges. The Calculator tab does this automatically.

Since Cohen’s d = δ / σ, a smaller raw difference (δ) or a larger standard deviation (σ) both shrink d, which increases the required sample size. This is why noisy data and small effects are so expensive to study.

Why We Use the T-Distribution

Many introductory texts present the sample size formula using Z-scores from the normal distribution. That is an approximation. Here is why the t-distribution is the correct choice—and what changes.

The Problem with Z

The normal (Z) distribution assumes you know the true population standard deviation σ. In practice, you never do—you estimate it from the sample. That extra uncertainty means the test statistic follows a t-distribution, not a normal.

The t-distribution is wider than the normal, especially for small samples. This means t-critical values are larger than Z-critical values, which means you need more observations to achieve the same power.

Degrees of Freedom

The t-distribution’s shape depends on degrees of freedom (df). For a two-sample t-test with n observations per group:

Degrees of Freedom df = 2(n − 1) = 2n − 2

As df grows (larger samples), the t-distribution converges to the normal. At df ≥ 120, the difference is negligible. At df = 10, it matters a lot.

How Much Does It Matter?

1.960

Z-critical (α=0.05)

2.101

t-critical (df=18, n=10)

2.024

t-critical (df=38, n=20)

1.984

t-critical (df=98, n=50)

At n=10 per group, using Z instead of t underestimates the required sample size by ~7%. At n=50, the error drops to ~1%. Our calculator uses the exact t-distribution with iterative convergence, so it is accurate at all sample sizes.

Connecting the Dots

This is the same t-distribution used in our Reading Regressions page (for t-stats and p-values) and in the P-Hacking Simulator (for Welch’s t-test). The math is identical across all three pages.

Reference Tables

Need to look up a critical value by hand? Use our reference tables:

T-Distribution Table → | F-Distribution Table →

Two-Sample T-Test Calculator

Set your parameters below. The sample size updates automatically using the exact t-distribution with iterative convergence.

α (Sig. Level) 0.05

Power (1−β) 0.80

Std Deviation (σ) 1.0

Raw Difference (δ) 0.5

Test Type Two-Sided One-Sided

0.50

Cohen’s d (δ/σ)

64
n Per Group

128
Total Sample Size

—

Degrees of Freedom

—

t-Critical Value

Sample Size vs. Cohen’s d

This chart shows how the required sample size changes across a range of effect sizes, holding your current α and power constant. The red dot marks your current d.

Reading the Curve

The curve is steep for small effect sizes (d < 0.3): tiny effects require enormous samples. As d grows, sample requirements drop rapidly. This is why defining a realistic MCID (minimum clinically important difference) is so critical—it determines where you land on this curve.

What Power Looks Like

Power is the probability of rejecting the null hypothesis when the alternative is true. The visualization below shows two t-distributions (not normal—notice the heavier tails):

Blue curve: the null t-distribution (H₀: no effect).
Green curve: the alternative t-distribution (H_A: true effect = δ).
Red dashed line: the t-critical value. Anything to the right is rejected.
Green shaded area: this is power—the probability of correctly rejecting H₀.

Effect Size (δ) 1.50

Sample Size (n) 30

α 0.05

—

Power (1−β)

—

β (Type II Error)

—

t-Critical Value

—

Degrees of Freedom

Try It

Drag Sample Size up and watch the green area grow—that is power increasing. With small n, notice the t-distributions have heavier tails (more uncertainty). As n grows, they sharpen toward the normal shape. Drag Effect Size toward zero to see the curves overlap, making signal indistinguishable from noise.

Common Mistakes

Using “small/medium/large” effect sizes without domain context. Define the MCID specific to your problem—what is the smallest difference that would change a decision?
Underestimating variance (σ) from pilot data. Be conservative—if uncertain, overestimate noise slightly. Underestimation guarantees an underpowered study.
Ignoring budget constraints then pretending a reduced sample still has 80% power. Acknowledge when your study is underpowered.
Using one-sided tests to shrink the sample size. Default to two-sided unless you can prove the intervention cannot cause harm.
Computing “observed power” after a study using the observed effect size. Retrospective power analysis is statistically circular and provides no new information. Power is strictly a pre-study planning tool.
Using the Z-approximation for small samples. The normal distribution underestimates the required sample size when n is small. Always use the t-distribution (as our calculator does).

Workflow

Define your hypothesis type: equality (superiority), non-inferiority, or equivalence.
Set α and power: typically α = 0.05, power = 0.80–0.90. Justify deviations based on error consequences.
Define MCID (δ): the smallest raw effect that would change practice or justify costs. Consult stakeholders.
Estimate variance (σ): use pilot data, published literature, or conservative estimates.
Calculate sample size: use our Calculator tab, or dedicated tools (G*Power, PASS) for more complex designs.
Run sensitivity analysis: show how sample size changes with reasonable variations in δ and σ. This demonstrates robustness.

The Biggest Risk

The greatest risk in research is not failing to find an effect—it is designing a study incapable of answering your question. Power analysis is not a formality; it is an ethical obligation to use resources responsibly.