MSE 125 — Slides – Lecture 9: Permutation Tests

the bootstrap says the drug works

but a skeptic asks: how surprised should we be if the drug did nothing at all?

logistics

HW 2 review sessions: this week
project meeting with CA: before May 1; sign up! https://stanford-mse-125.github.io/web/project
quiz 4: Wed Apr 29 — bootstrap + permutation tests (Lec 8-9)
project proposal: due Fri May 1
HW 3: due Fri May 8

today

permutation tests: shuffle labels to simulate “no effect”
the null distribution: 10,000 permutations map out what chance looks like
the p-value: how surprised should we be?
second example: do NBA refs favor the home team? — association vs causation
one-sided vs two-sided tests — and why the default is two-sided

The Lady Tasting Tea — your turn

Rothamsted, 1920s. Muriel Bristol (algae researcher) claims she can taste whether milk went into the cup before or after the tea.

R.A. Fisher’s test: 8 cups, 4 of each kind, in random order — Bristol picks out the 4 “milk first” cups.

your turn. on scrap paper, write M or T for each of positions 1–8. Pick 4 of each — make your best guess at the sequence.

…and the actual sequence:

M · T · M · M · T · T · T · M

show of hands: who got all 4 milk-firsts right?

\(\binom{8}{4} = 70\) arrangements → pure guessing hits the right answer 1 in 70 (~1.4%)

with ~120 of us guessing, we’d expect 1 or 2 winners by luck alone

Bristol got all eight right. that is not luck — it’s a permutation test in miniature.

Fisher (1935) The Design of Experiments; Salsburg (2001) The Lady Tasting Tea.

The origin story for permutation tests — and a live demonstration. Bristol was a serious scientist at Rothamsted (algae researcher) who claimed she could taste the order of milk/tea pouring. Fisher took her seriously and designed the test on the next slide.

Facilitation for the interactive part: 1. On slide 1, read the setup and prompt. Give students 15 seconds to write M/T/M/… picking 4 of each. If they pick 3 or 5 milks, that’s still guessing; don’t police the constraint. 2. Flip to the reveal slide. The sequence M T M M T T T M is just one of the 70 arrangements — the “actual” for today’s demo. You can pick a different sequence if you prefer; change the slide body to match. 3. Ask for a show of hands: who got all 4 milk-firsts correct? There are 70 arrangements; under pure guessing, each student has a 1/70 = 1.4% chance. In a ~120-person class, expect 1-2 hands up; occasionally zero. 4. Land the lesson: Fisher asked “what would guessing look like?” and answered it combinatorially. Bristol got all 8 right — a 1-in-70 event under guessing. That either means she can really tell, or we just witnessed a 1-in-70 coincidence. This is exactly the shape of the reasoning we’re about to apply to a 2,100-patient HIV trial.

if the drug does nothing, labels don’t matter

ACTG 175 — where we left off

treatment_mean = 33.3     # CD4 change, combination therapy
control_mean   = -17.1    # CD4 change, AZT monotherapy

observed_effect = 33.3 - (-17.1)   # = 50.4 CD4 cells

bootstrap 95% CI: [39.6, 61.3] — CI excludes 0, so at the 5% level, the effect is real

so why do a hypothesis test?

history. p-values (Fisher, 1925) came first; CIs (Neyman, 1937) are the inversion of a test
generality. tests extend where CIs don’t — counts, orderings, whole distributions (the Lady Tasting Tea)
language. reject, α, p-value — how science reports evidence

Quick recap of the data — not a recap of concepts, just the numbers we’ll use. Treatment group gained 33 cells on average, control lost 17 (the “control” is AZT monotherapy, not placebo — ethics of 1991 HIV trials). Observed effect 50.4 CD4 cells. Bootstrap CI excludes zero, so the data are already inconsistent with “no effect” at the 5% level — Ch 10 formalizes the link to tests.

So why spend a chapter on hypothesis tests when the CI does the same work? Three honest reasons: (1) History — Fisher’s p-value came a decade before Neyman’s CI. Neyman invented the CI by inverting a test, not the other way around. (2) Generality — the permutation test extends cleanly to settings where a CI isn’t the right shape (the Lady Tasting Tea we just saw is about ordering, not a mean). (3) Language — p-value, reject, α, Type I error are the lingua franca of scientific and industry reporting. You’ll see them in every paper, clinical trial, and A/B test; you need to speak the language. The two frameworks agree on the same evidence — Ch 10 makes the link precise — but the test vocabulary is what you’ll encounter in the wild.

the key insight

if the drug has no effect, each patient’s CD4 change is determined by the patient — not by the treatment label

treatment labels are meaningless

so shuffling them should produce results that look like the real data

but if our observed effect is way larger than anything shuffling produces — the drug works

observed effect: +50.4 CD4 cells

shuffle treatment/control labels under the null. what difference do you expect?

A. still about +50
B. close to 0
C. could be anything
D. exactly 0

DISCUSSION: Predict-then-reveal (3 min). Commit to A, B, C, or D; defend to a neighbor; debrief. Process goal: establish the intuition that shuffled effects should be near zero before formalizing the null distribution. Correct answer: B — close to 0 but not exactly 0. The pool of CD4 changes has a grand mean somewhere between the two group means. Randomly splitting into two groups of similar size gives means that are close to the grand mean, so their difference is close to zero. Not exactly zero — sampling variation means each shuffle gives a slightly different split. And not “could be anything” — the law of large numbers keeps shuffled means close to the grand mean. A would only happen if the drug actually works, which is what the null denies. If stuck: “If labels don’t matter, what determines the two group means? Just luck of the draw.” Key insight: Under the null, the expected difference is exactly 0, but any one shuffle gives a small random deviation around 0.

five shuffles

np.random.seed(42)   # so everyone sees the same shuffles

def permutation_diff_of_means(values, n_first):
    shuffled = np.random.permutation(values)
    return shuffled[:n_first].mean() - shuffled[n_first:].mean()

Permutation 1: fake effect = +3.2 CD4 cells
Permutation 2: fake effect = -1.8 CD4 cells
Permutation 3: fake effect = +0.5 CD4 cells
Permutation 4: fake effect = -4.1 CD4 cells
Permutation 5: fake effect = +2.7 CD4 cells

fake effects bounce around zero — observed effect was +50.4

permutation tests: the recipe

combine all CD4 changes into one pool (ignore labels)
randomly assign \(n_C\) to “control,” the remaining \(n_T\) to “treatment”
compute the test statistic on the fake groups
repeat many times
compare the observed effect to the distribution of fake effects

steps 1-4 build the null distribution

step 5 asks: how extreme is our result?

now we need vocabulary

null hypothesis

the default claim a test tries to disprove

here: “the drug has no effect on CD4 count”

test statistic

the number we compute from the data to measure the effect

here: difference in group means

the permutation test — definition

permutation test

a hypothesis test that builds the null distribution by shuffling group labels and recomputing the test statistic

null distribution

the distribution of the test statistic when the null hypothesis is true

key assumption: labels are exchangeable under the null

from the null hypothesis to the null distribution

null hypothesis — a claim about the world: “the drug has no effect”
null distribution — the distribution of the test statistic if that claim were true

the permutation test builds the second from the first — by shuffling

why does shuffling simulate from the null distribution?

ACTG 175 patients were randomly assigned to treatment groups

exchangeability

swapping labels does not change the joint distribution of the data under the null

random assignment guarantees exchangeability:

under the null, the drug has no effect on outcomes
outcomes don’t depend on labels
shuffled data is just as plausible as the original

building the null distribution

n_perms = 10_000
perm_effects = np.array([
    permutation_diff_of_means(all_cd4, n_T)
    for _ in range(n_perms)
])

Null distribution summary:
  Mean: +0.01
  SD:   5.94
  Min:  -22.40
  Max:  +21.70

tightly centered around zero — SD is about 6 cells

our observed effect of 50 cells is roughly \(50 / 6 \approx\) 8 SD away

the null distribution — visualized

where do you expect the observed +50.4 to land?

not even close to what random shuffling produces

the p-value

p-value

the probability of observing a result at least as extreme as what you got, assuming the null hypothesis is true

the p-value answers: if there were truly no effect, how often would we see something this extreme?

computing the p-value

two-sided: count permutations where \(|\text{fake effect}| \geq |\text{observed effect}|\) — extreme in either direction

n_extreme = np.sum(np.abs(perm_effects) >= np.abs(observed_effect))
p_value = (n_extreme + 1) / (n_perms + 1)

Permutations with |effect| >= |50.4|: 0
p-value: 0.0001

Two-sided counting rule: permutations whose absolute effect is at least as large as 50.4, in either direction — a drug that dramatically helped or dramatically hurt would both count. Zero of 10,000 permutations reached it. Report p ≈ 0.0001.

Why \(+1/+1\)? The identity permutation is also a permutation, and under the null it’s equidistributed with every other. Counting the observed data as one of the \(m+1\) permutations prevents \(p = 0\) exactly, which no finite simulation can earn. So \(\frac{0 + 1}{10{,}000 + 1} \approx 0.0001\) is the simulation floor. The estimator is conservative — its expected value under the null is at least the true tail probability, so false positives are still controlled at the nominal level (Phipson & Smyth 2010). Say one line if asked (“we count the observed data as one extra permutation to avoid reporting p = 0 from a finite simulation”); don’t read the whole justification unless asked.

overwhelming evidence

ACTG 175: \(p \approx 10^{-4}\) — the observed effect lives in a place the null never visits

hold onto this number — we’ll revisit what “overwhelming” feels like once we see data where the evidence is much thinner

what the p-value is

a probability statement about the data, given the null

conditions on the null — if no effect, the data would be unlikely
continuous evidence, not a verdict — \(p \approx 10^{-4}\) far stronger than \(p \approx 0.04\)
observed data only — replication, mechanism, priors not in the formula

a skeptic says:

“a p-value of 0.0001 means there’s only a 0.01% chance the drug doesn’t work”

is the skeptic right?

DISCUSSION: Think-pair-share (4 min total). Think and jot; pair and compare; debrief. Hint to deliver verbally if students get stuck: “what does the p-value condition on?” Process goal: this is the single most important misconception in the entire course. Students must articulate the error in the skeptic’s reasoning before we reveal the correction. Correct answer: the skeptic is wrong. The p-value is the probability of data this extreme given the null is true — not the probability the null is true given the data. The direction of the conditional matters enormously. P(extreme data | null true) is what we computed. P(null true | extreme data) is what the skeptic claims, which would require Bayes’ theorem and a prior probability on the null hypothesis. If stuck: “Analogy: P(it’s raining | I have an umbrella) is not the same as P(I have an umbrella | it’s raining). The order of the conditional matters.” Key insight: The p-value lives in a world where the null IS true and asks how surprising the data are. It does not tell you the probability that the null is true.

three traps

NOT the probability \(H_0\) is true — the skeptic’s flipped conditional

NOT the probability the result will replicate

p = 0.049 and p = 0.051 are not meaningfully different — the 0.05 threshold is a convention

a p-value is a continuous measure of evidence, not a binary verdict

do refs favor the home team?

folk hypothesis: home teams get called for fewer fouls

same machinery, different interpretation

NBA personal fouls — the data

# player-level logs -> team-game foul totals
team_game = (logs.groupby(['GAME_ID', 'TEAM_ABBREVIATION', 'MATCHUP'],
                          as_index=False)['PF'].sum())
team_game['home'] = team_game['MATCHUP'].str.contains('vs.')

home_pf = team_game[team_game['home']]['PF'].values
away_pf = team_game[~team_game['home']]['PF'].values
obs_diff_nba = home_pf.mean() - away_pf.mean()

home:  19.36 fouls per game   (n = 3,690 team-games)
away:  19.54 fouls per game   (n = 3,690)
observed gap:   -0.18 fouls per game

same null, different interpretation

key difference from ACTG 175: no random assignment here

teams play each opponent home and away — schedule is fixed, not randomized

ACTG 175: “the drug has no effect” — labels exchangeable by random assignment

NBA: “home and away foul counts come from the same distribution”

same null, same permutation mechanics (shuffle labels, recompute)

but rejecting the null only tells us the groups differ — not why

permutation test finds: home teams get fewer foul calls

can we conclude referee bias?
if not — name three alternative mechanisms confounded with home/away

DISCUSSION: Think-pair-share (4 min). Think-pair-share; debrief by collecting explanations on the board. Prompt: Can we conclude referee bias from the NBA permutation test? Process goal: reinforce association vs. causation; students should generate confounders themselves before we rule out “ref bias” as the only story. Correct answer: No — no random assignment means no causal claim. Referee bias is one possibility but not the only one. Alternative mechanisms students should reach: - fatigue and travel: the visiting team has almost always traveled more recently; tired defenders reach in instead of sliding their feet, and reaching fouls get called - scheduling: back-to-backs (two games on consecutive nights) hit the visiting side of a given date more often over a season; compounds fatigue - defensive style: a team may play more (or less) aggressively at home — crowd pressure pushes both ways - familiarity: home team knows its own sightlines, rim stiffness, sideline location; visitors misposition - referee bias: the hypothesis we started with Closest thing to a natural experiment: the 2020 “bubble” playoffs in Orlando — every game on a neutral court with no fans. The permutation test detects the gap but cannot decompose which mechanism produces it; that requires different comparisons. If stuck: “If two teams played on Mars with no fans, would home teams still get called for fewer fouls? Which mechanisms would survive?”

NBA permutation — shuffle labels, recompute

all_pf = np.concatenate([home_pf, away_pf])
n_home = len(home_pf)

perm_diffs = np.array([
    permutation_diff_of_means(all_pf, n_home)
    for _ in range(10_000)
])

n_extreme = np.sum(np.abs(perm_diffs) >= np.abs(obs_diff_nba))
p_value = (n_extreme + 1) / (len(perm_diffs) + 1)

print(f"permutation p-value (two-sided): {p_value:.4f}")

permutation p-value (two-sided): 0.0539

NBA fouls — null distribution

observed gap sits at the edge of the null — not outside it

NBA:  ~540 / 10,000 permutations at least as extreme
ACTG:     0 / 10,000

marginal evidence — contrast with ACTG’s \(p \approx 10^{-4}\)

what if the question were narrower?

“home teams get called for fewer fouls”

not “they get called for different fouls”

one-sided vs two-sided

two-sided vs one-sided tests

two-sided — count fake effects at least as extreme as \(|\text{obs}|\) in either direction

one-sided — count only one tail (direction chosen in advance)

Bridge to the one-sided idea. The NBA question we just ran was two-sided — “do the groups differ in either direction” — because we counted fake gaps extreme on both sides. But the folk hypothesis that motivated the analysis was directional: home teams are called for FEWER fouls. That’s one-sided. We only count fake gaps at least as negative as the observed one.

The picture pairs the two definitions visually: same null distribution on both panels, same observed statistic. Left panel shades both tails beyond ±observed — two-sided, p ≈ 0.054. Right panel shades only the lower tail (the pre-registered direction: home − away < 0) — one-sided, p ≈ 0.029. Same null, different counting rule, different verdict at α=0.05. “Extreme” has a precise meaning that depends on the question.

Under a symmetric null, the one-sided p-value is roughly half the two-sided — “roughly” because Monte Carlo noise and the +1/+1 estimator treat the discrete atoms at ±|obs| slightly asymmetrically. In our actual NBA run, 2 × 0.0288 = 0.0576, not 0.0539 exactly. The approximate 2× relationship is the point — formalized on the next slide.

the verdict flips

same data. same null distribution. same observed statistic.

# two-sided: |fake| >= |obs|  — either direction
two_sided = (np.sum(np.abs(perm_diffs) >= np.abs(obs_diff_nba)) + 1) / (len(perm_diffs) + 1)

# one-sided lower: fake <= obs  — pre-registered: home − away < 0
one_sided = (np.sum(perm_diffs <= obs_diff_nba) + 1) / (len(perm_diffs) + 1)

Two-sided p-value:   0.0539    →  FAIL to reject at α = 0.05
One-sided p-value:   0.0288    →  REJECT at α = 0.05

under a symmetric null: two-sided p ≈ 2 × one-sided p — half the tail, half the p-value

the analytic choice — not the data — drives the conclusion — direction must be chosen before seeing the data

This is THE punchline of the one-sided section, and the big upgrade from the old scoring example (where both p-values were at the simulation floor, so the factor-of-two never did anything visible). Here, on real data, the two-sided test fails to reject and the one-sided test rejects. The analyst’s choice of test, not the numbers in the spreadsheet, determines the verdict. That’s why pre-registration matters: if a researcher looks at the data, sees home foul counts were lower, and then declares “one-sided,” they’ve effectively run two tests and doubled their false-positive rate. Post-hoc one-sided = inflated Type I error. Emphasize this slide heavily — students usually don’t feel the factor of two until they see it move the verdict.

The 2× relation is approximate, not exact: if a student does the arithmetic and asks “but 2 × 0.0288 = 0.0576, not 0.0539”, the answer is that the null is discrete and the +1/+1 estimator isn’t symmetric about zero — Monte Carlo rounding. The approximate 2× relationship is the point; the flip is real.

prefer two-sided by default

if you’re not sure, use two-sided — the honest default

a one-sided test is justified only when:

effects in the other direction are logically impossible or irrelevant
the direction was chosen before seeing the data

picking the tail post-hoc = effectively running two one-sided tests, each at \(\alpha = 0.05\)

the rate of falsely rejecting a true null jumps from \(\alpha = 0.05\) to \(\approx 0.10\)

Ch 10 formalizes this as the Type I error rate.

a high-profile example: Deflategate

Deflategate — a one-sided story

halftime, 2015 AFC Championship. officials measure the game balls.

Patriots’ balls: below the legal minimum. Colts’ balls: fine.

cooling explains some drop — same field, same weather, same physics.

so why did the Pats drop more?

the accusation was directional: Pats dropped more than Colts.

a Pats ball that dropped less → exoneration, not cheating.

only one direction counts as damaging → one-sided test

state \(H_0\) in plain language → pick one-sided or two-sided

factory — contract sets a minimum widget weight; weekly compliance check
school district — math curriculum pilot; keep, roll back, or revise?
biotech — new cholesterol drug; team hopes it lowers LDL

DISCUSSION: Think-pair-share (5-6 min). Think and jot — null first, then one-sided vs two-sided; pair and compare; debrief by sharing answers and spotting the trap. Prompt: For each scenario, write the null precisely in plain language, then pick one-sided or two-sided. Process goal: force students to commit to a null phrasing before picking the test. The phrasing matters: “doesn’t increase scores” is a different null from “doesn’t change scores,” and they pick out different tests. Debrief in order — widget first (warm-up, contract clarifies the direction), school next (the null-phrasing point), biotech LAST (the safety trap).

Answers: (1) One-sided. \(H_0\): \(\mu \ge w_{\min}\) (shipment meets the minimum); \(H_1\): \(\mu < w_{\min}\). The contract penalizes only under-weight shipments — over-weight is logically exculpatory, not noteworthy. Same structure as Deflategate. (2) Two-sided. \(H_0\): \(\mu_{\text{new}} = \mu_{\text{old}}\) (no change in scores). The right framing is “no change,” not “no increase”: scores that drop drive the rollback decision, so the test has to be able to detect them. A one-sided “doesn’t increase” null would silently treat a score drop as just another way to fail to find an improvement, discarding evidence the district would act on. THIS is the null-phrasing point. (3) Two-sided — a trap. \(H_0\): \(\mu_{\text{new}} = \mu_{\text{control}}\) (drug doesn’t change LDL). Tempting to go one-sided because the team hopes for a decrease, but a drug that raises LDL is a safety signal and would absolutely change the launch decision. Hoping for a direction ≠ only one direction matters. (Aside if asked about FDA practice: real Phase III efficacy trials sometimes use one-sided tests at α=0.025, which is decision-theoretically equivalent to two-sided at α=0.05 — regulatory convention paired with separate safety monitoring. Not a free halving.)

Meta-lesson to surface in debrief: writing the null precisely tells you which test to use. One-sided is genuinely rare outside contract-compliance or tamper-detection contexts. Reserve it for cases where the other direction is logically irrelevant (not merely undesired), and pre-register the choice before looking at the data.

bootstrap vs permutation — when to use which

	bootstrap	permutation test
question	how precise is my estimate?	is the effect real?
produces	confidence interval	p-value
null hypothesis	not needed	required
key assumption	i.i.d. samples in each group	exchangeability under null
best for	any statistic	comparing groups
resampling	with replacement, within each group	without replacement, shuffling across groups

i.i.d. = independent and identically distributed

both are simulation-based inference: the computer builds the reference distribution; no normality or closed-form formula needed

bootstrap = precision \(\quad\) permutation = significance \(\quad\) use both

A/B testing — same idea, different name

in data analytics, the permutation test between two groups is called A/B testing

A = control \(\quad\) B = treatment
used at every tech company: feature rollouts, pricing, ad copy, UX tweaks

the machinery is exactly what we just ran — shuffle labels, recompute, compare

summary

shuffle → null distribution → p-value — the recipe
small p-value \(\neq\) small probability the null is true — the #1 misconception
report the number — \(p \approx 10^{-4}\) (ACTG) and \(p \approx 0.05\) (NBA) are different verdicts, not the same “significant”

one recipe — 8 cups of tea, 2,139 clinical patients, 7,000 NBA games

next time

we have the p-value — but what threshold should we use?

Ch 10: formal hypothesis-testing framework — \(H_0\), \(H_1\), \(\alpha\), Type I/II errors, power
Ch 11: what happens when you run many tests?
Ch 18: when a test can license causal claims — designing for causation

one-minute feedback

what was the most useful thing you learned today?
what was the most confusing?

give feedback