We're testing whether a new supplement increases cognitive performance.
We recruit participants and randomly assign them to two groups:
The Setup
Group A: Takes the supplement Group B: Takes a placebo (sugar pill)
The secret you know (but a researcher wouldn't): In this simulation,
both groups are pulling scores from the exact same random distribution.
The supplement does absolutely nothing. Any difference you see is pure noise.
Your task: Keep running experiments until you find a “statistically significant” result (p < 0.05).
Group A
Group B
True Mean (50)
Group A — Supplement
Mean Score—
vs
Group B — Placebo
Mean Score—
Click “Run Experiment” to start
Experiments run: 0
!You Found Something!
Significant Result
Group A scored significantly higher than Group B (p < 0.05).
The supplement appears to work!
Look at the plot above. The purple dots (Group A) cluster toward the right.
It looks like the supplement worked.
But wait. A skeptic might say: “Maybe you just got lucky and randomly
assigned the naturally high-performers to Group A. Maybe those specific people are just… better.”
The Question
Is Group A genuinely superior? Or was it just a lucky draw?
There's only one way to find out: Test the exact same people again.
!The Re-Test Results
We tested the exact same participants one week later.
If Group A was truly special, they should score high again.
Original Test
Group A Mean—
Group B Mean—
Difference—
Re-Test (Same People)
Group A Mean—
Group B Mean—
Difference—
The Illusion Evaporated
Group A's “superiority” vanished. On re-test, their scores regressed toward the average.
They weren't special—they just had a lucky day.
This is regression to the mean. When you select a group because they
scored high once, they will almost always score lower the next time—not because
anything changed, but because extreme results are rarely repeated.
The Lesson
A p-value < 0.05 doesn't mean you found truth. It means you found something unusual enough
to be worth investigating. If you test 20 hypotheses, one will be “significant”
by pure chance. This is P-hacking: running enough tests until the noise aligns
in your favor, then publishing only that result.