MSE 125 — Slides – Lecture 11: Multiple Testing

logistics

quiz 5: Wed May 6
HW 3: due Fri May 8

the brief

you’re an analyst for the NBA Director of Player Personnel.

last season’s shooting numbers are on your desk — every player’s makes and attempts. your boss asks one question:

“which of our players really shoot better than the league average?”

how would you tackle this?

Frame the lecture as a decision problem, not a topic to cover. We are not here to “demonstrate multiple testing” — we are here to answer a real question that any sports analyst would face. Which of last season’s rotation players really shoot above league average?

Two ways the obvious approach fails: - Block 1 (multiple testing): run a z-test per player at p < 0.05. We’ll see far more rejections than the math allows under the null — but the rejections are mostly noise, and the analyst who reports them ships a list full of false positives. - Block 3 (confounding): the corrections give us a tight list of “real” departures. But the most extreme high outlier turns out to be a worse shooter than the most extreme low outlier — once we look at where each one shoots from. Right answer, wrong question.

Don’t tease the numbers (141, 47, 117) — those land in the next two blocks. Today’s hook is the question, not the punchline.

today

the multiple testing problem
two corrections: Bonferroni and Benjamini-Hochberg
right answer, wrong question: confounding and Simpson’s paradox

the multiple testing problem

the data: shots by zone

shot_zones_2023-24.csv — for each NBA player, field goals made (FGM) and attempted (FGA) in each of six shot zones. FG% = FGM / FGA.

RA — restricted area (under the basket)
PAINT — rest of the painted lane
MID — mid-range (inside the arc)
LC3, RC3 — corner threes
ATB3 — above-the-break threes

source: NBA.com Stats, 2023–24 regular season

who shoots above league average?

shots = pd.read_csv('data/nba/shot_zones_2023-24.csv')
qual = shots[shots['FGA'] >= 200].copy()
qual['FG_PCT'] = qual['FGM'] / qual['FGA']
p0 = qual['FGM'].sum() / qual['FGA'].sum()

Players (rotation, FGA >= 200):  317
League FG% (qualified players):   0.4770

extremes:

Jarrett Allen: 63%
Rudy Gobert: 66%
Jevon Carter: 38%

are these gaps real, or could they happen by chance over a season’s worth of shots?

bernoulli’s free lunch

Lec 10: two samples (treatment vs control), continuous outcome. SE pooled from both arms.

now: one sample (one player), binary outcome (made or missed). no control.

how do we get a standard error with no control?

bernoulli: variance per shot = p(1-p) — mean determines variance.

under H_0: p = p_0, the SE is pinned at \sqrt{p_0(1-p_0)/n}.

no second group. no estimation from data. null carries its own SE.

one test per player

one-sample z-test for a proportion — large-FGA normal approximation.

for each player i, test H_0: p_i = p_0 where p_0 = 0.477 (league):

z_i = \frac{\hat p_i - p_0}{\sqrt{p_0(1-p_0) / \text{FGA}_i}} \sim \mathcal{N}(0, 1)

run it 317 times.

Tests run:                317
Significant at p < 0.05:  141

z-test or t-test?

both standardize the gap: \dfrac{\text{statistic} - \text{null}}{\text{SE under null}}

t-test (continuous, e.g. CD4 counts):

SE estimated from data: \hat\sigma / \sqrt{n}
reference: t_{n-1} (extra uncertainty from \hat\sigma)

z-test for a proportion (binary, e.g. made/missed):

SE pinned by H_0: \sqrt{p_0(1-p_0)/n} — nothing to estimate
reference: \mathcal{N}(0, 1)

both = large-n limit of the bootstrap (Lec 8). bootstrap works for any n; z and t are the CLT shortcuts.

The bridge for students who know t-tests but haven’t seen z. Two ideas to land:

Same recipe: standardized distance from the null, then look up tail probability. The two tests differ only in (a) how you get the SE and (b) which reference distribution you use.
Why z here: the binomial SE is determined by p_0 alone — no estimation needed because H_0 fixes the variance. For a continuous outcome the variance is a separate unknown, so you have to estimate it from the data and pay the t-distribution penalty for the extra uncertainty.
Bootstrap connection: Lec 8 introduced the bootstrap as resampling the data and looking at the empirical distribution of the statistic. Both t and z are CLT-driven approximations to that. For a mean: bootstrap → normal centered at true mean, sd = s/√n; the t-test corrects for the small-sample sd estimate. For a proportion: bootstrap → normal centered at true p, sd = √(p(1−p)/n); the z-test plugs in the null p_0 since H_0 specifies it. The bootstrap works for both; z and t are the closed-form CLT shortcuts.
When to use each: continuous and want to compare to a fixed value → t. Binary and want to compare to a fixed proportion → z (or equivalent: a binomial test, exact for any n). Continuous with huge n and known sd → z works too, but t is safer.

This slide is dense. Consider stepping through the fragments slowly. If running long, skip the bootstrap line — it’s a recall, not new content.

if each player’s true FG% equaled the league baseline, how many of 317 tests would you expect to appear significant at \alpha = 0.05?

give a number.

expected by chance

if H_0 holds for every player:

\mathbb{E}[\text{false positives}] = m \cdot \alpha

Tests:                317
Expected false +ves:  15.9
SD under null:         3.9     # = sqrt(m · α · (1−α))
Observed significant: 141
Excess over null:     125  (32.3 SDs)

125 over the null floor — real signal. but mixed with ~16 fakes.

the multiple testing problem

multiple testing

when many hypothesis tests are run, a predictable fraction will clear \alpha = 0.05 by chance alone.

with m tests at level \alpha: expect m \cdot \alpha false positives.

it doesn’t matter who runs the tests:

one analyst runs 317 tests → expect 16 false positives
1,000 labs each run one test → expect 50 false positives

the p-value histogram: a diagnostic

under H_0: p-values are Uniform(0, 1) — flat.

predict: what does our 317-player histogram look like?

this isn’t just NBA

The bridge from the NBA arc into science. The cartoon does dual duty in one panel:

Multiple testing arithmetic. 20 colors of jellybean, 20 independent tests at \alpha = 0.05. Under the null you expect ~1 false positive. Green pops up.
Publication bias / file drawer. Only the “significant” finding makes the news. The 19 negative results are invisible. By the time the public reads the headline, the multiplicity that produced it has been erased.

This is the same arithmetic we just ran on 317 NBA players — just spread across studies, labs, and the press instead of one analyst’s notebook. ~30 seconds for the room to read and laugh.

Verbal bridge before clicking forward: “the NBA had 317 tests on one analyst’s screen. Science has the same problem distributed across labs and journals — and the stakes are quite a bit higher than getting Aaron Gordon’s FG% wrong.”

what’s at stake

a published false positive doesn’t sit on a shelf — someone acts on it.

medicine: Amgen tried to replicate 53 landmark cancer studies — confirmed only 6 (Begley & Ellis, Nature 2012). Failed preclinical work feeds drug trials that cost ~$1B each.
psychology / policy:
- “power posing” (Cuddy 2010) → 50M+ TED views, corporate trainings → didn’t replicate.
- “Growth mindset” interventions → district-wide rollouts, near-zero average effect.
biomedical waste: Freedman et al. (2015) estimate $28 B/year spent in the US on preclinical research that doesn’t reproduce.

cost isn’t just dollars: patients enrolled in trials, students taught by failed interventions, policy built on phantom effects.

The “why care” slide. The arithmetic on the previous slide is mechanical; this slide is the consequence in fields where stakes are high.

Begley & Ellis (2012, Nature 483:531): Amgen’s hematology/oncology team tried to replicate 53 “landmark” preclinical cancer findings before building drug programs on them. They confirmed 6 (11%). Bayer ran a parallel exercise (Prinz, Schlange, Asadullah 2011) on 67 target-ID papers and reproduced fewer than 25%. This isn’t psychology — it’s the foundation of cancer drug development, where each failed clinical trial costs hundreds of millions and exposes patients to real risk.

Power posing (Carney, Cuddy, Yap 2010): claimed 1-min “high-power” poses raised testosterone, lowered cortisol, increased risk-taking. Cuddy’s 2012 TED talk has 70M+ views and is one of the most-watched ever. Corporate training programs adopted it. By 2016 the original co-author Dana Carney had publicly disavowed it as not replicable.

Growth mindset: Carol Dweck’s work led to interventions deployed in tens of thousands of K-12 classrooms. Large-scale RCTs (Sisk et al. 2018 meta-analysis; Yeager & Dweck 2019 NSGM trial) found average effects near zero or very small, with most studies underpowered. School districts spent millions on programs whose population-level benefit is now in serious doubt.

$28B figure: Freedman, Cockburn, Simcoe (2015, PLOS Biology) — based on irreproducibility rates from preclinical biology, life-sciences R&D spending, and downstream costs.

The point of this slide is that “replication” is not academic hygiene. It determines what diseases get treated, what students get taught, what policies get adopted.

the replication crisis: numbers

2011 — Bem (precognition): standard methods, top journal, conclusion: people sense the future. published, then unreplicable.
2015 — Reproducibility Project: 100 psychology studies retried. ~36% replicated with significance. effect sizes roughly half the originals.
2026 — SCORE: 274 claims across 164 papers, 54 social/behavioral journals. 55% of claims replicated; given the same data, only 34% of independent analysts using defensible alternative specs reached the original conclusion.

hidden multiplicity: p-hacking

even one analyst running “one test” is hiding multiplicity:

exclude an outlier? that’s a choice
which outcome variable? that’s a choice
collect more data until significant? that’s a choice
which subset to analyze? that’s a choice

each choice is an implicit hypothesis test. the final reported p is the survivor of many.

your boss: “find me a clutch shooter.” you pick Player X and test:

restriction	p-value
4th quarter	0.18
last 5 min	0.12
last 2 min	0.09
last 2 min, score within 3	0.04 ✓
…same, home games only	0.02 ✓

write-up: “Player X shoots above league average in clutch home situations, p = 0.02.”

what’s the chance you’d find something significant under the null?

DISCUSSION: spot-the-forks (5 min). 1 min think; 2 min pair; debrief.

This is Andrew Gelman’s garden of forking paths (Gelman & Loken, 2014) made concrete. The reported p = 0.02 is not the probability the analyst would have seen this result by chance — it is the probability conditional on this exact sequence of restrictions. The relevant probability is “would I have found some sub-restriction with p < 0.05 under the null?” — and that is much larger.

Forks visible in the table (5):

Quarter window: 4th, last 5, last 2, last 1, last 30 sec, last 10 sec…
Score margin: within 1, 2, 3, 5, 10, “any close game”
Home / away / both
Stopping rule: “I’ll stop when I find p < 0.05” (optional stopping)
Sign of the effect: above league, below league, “different from league”

Forks invisible in the table (just as many):

Choice of player (Player X vs Y vs Z — 50+ candidates)
Definition of “shot” (FGA only? include free throws? assisted vs unassisted?)
Outcome metric (FG%? eFG%? TS%? points-per-shot?)
Reference comparison (league average? his own season average? historical clutch baseline?)
Time period (this season? career? last 3 seasons?)
How to handle missing/garbage-time minutes

Effective \alpha. Even taking only the 5 visible forks at face value, 1 - 0.95^5 \approx 23\% — and if you keep going through the invisible forks (30+ plausible analyses), the chance of finding something at p < 0.05 under the null is essentially 1. The reported p = 0.02 massively understates the actual evidence.

The takeaway to land before the next slide: the analyst here didn’t lie. Each test was honest. Each p-value was correctly computed. The “garden” is not in any one test — it’s in the choice of which tests to look at and which to report. Bonferroni and BH can’t help here because they only correct for the tests you tell them about. → next slide: structural pressure makes the file drawer worse, then xkcd, then the actual fix (pre-registration) on the slide after.

structural pressure

two features of science amplify the problem:

file drawer: null results don’t get published
incentives: careers are built on novel significant findings

if 20 labs test the same false hypothesis, one will publish a significant result by chance — and that’s the only one you read

the fix: pre-registration

pre-registration

publicly committing to your research question, data plan, and analysis plan before looking at the data.

a pre-registered analysis is a single pre-specified test — not the survivor of many.

doesn’t ban exploration — requires labeling it as exploratory, not confirmatory

which is more credible?

study A

pre-registered hypothesis: drug X reduces blood pressure by ≥ 5 mmHg.

result: 4.2 mmHg, p = 0.08

study B

exploratory analysis of 50 outcomes.

result: drug X reduces ankle swelling by 18%, p = 0.02

pick one and defend it

two corrections

correction 1: Bonferroni

start with a false-positive budget of \alpha = 0.05 — total chance of any wrong rejection.

spread it evenly across m tests: each gets \alpha/m.

union bound:

\Pr[A_1 \cup \cdots \cup A_m] \;\le\; \sum_{i=1}^{m} \Pr[A_i]

so

\Pr[\text{any false positive}] \;\le\; m \cdot \frac{\alpha}{m} \;=\; \alpha

works regardless of correlation between tests

Bonferroni: the formal version

Bonferroni correction

test each hypothesis at the stricter level

\alpha_{\text{Bonf}} = \frac{\alpha}{m}

controls family-wise error rate (FWER): probability of any false positive across all m tests.

conservative, but the guarantee is bulletproof.

Bonferroni applied

alpha = 0.05
bonf_threshold = alpha / 317   # = 0.000158  (317x stricter)
n_bonf = (qual['p_value'] < bonf_threshold).sum()

threshold dropped 317× — what’s your guess for the new count, out of 141?

Significant after Bonferroni:  47   (down from 141)

survivors — the obvious extremes:

rim-feasting big men: Allen, Gobert, Lively, Gafford, Giannis (60–75%)
worst shooters at the other tail

the cost of conservatism

Damian Lillard:   42.6% on 1,270 attempts
                  p-value ≈ 3 × 10⁻⁴
                  Bonferroni threshold: 1.6 × 10⁻⁴
                  → fails to reject

a real moderate effect that Bonferroni throws away

correction 2: BH (rising bar)

Bonferroni asks every p-value to clear \alpha/m — flat bar

BH lets the bar rise with rank:

\alpha/m, \; 2\alpha/m, \; 3\alpha/m, \; \ldots, \; \alpha

if you call k results discoveries, expect k\alpha to be null → expected false-discovery fraction ≈ \alpha

Benjamini-Hochberg procedure

Benjamini-Hochberg (BH)

sort p-values: p_{(1)} \le p_{(2)} \le \cdots \le p_{(m)} — so p_{(k)} is the k-th smallest.
find the largest k such that p_{(k)} \le \frac{k}{m} \alpha
reject hypotheses 1, \ldots, k

controls false discovery rate (FDR): expected fraction of false positives among rejections.

BH on a toy example

m = 5 tests, \alpha = 0.05. sort p-values, compare to k\alpha/m = 0.01k:

k	p_{(k)}	k\alpha/m	\le ?
1	0.001	0.01	✓
2	0.008	0.02	✓
3	0.030	0.03	✓
4	0.050	0.04	✗
5	0.400	0.05	✗

walk from the bottom; largest k where \le holds is k = 3 → reject 1, 2, 3.

compare: Bonferroni (\alpha/m = 0.01) rejects only k=1; uncorrected rejects 1–4.

Bonferroni vs BH on shot data

Uncorrected (p < 0.05):  141
Bonferroni (FWER):        47
BH (FDR <= 0.05):        117

the 70-player gap

players caught by BH but not Bonferroni:

Damian Lillard (42.6%)
Fred VanVleet
Kevin Durant
Chet Holmgren
… and ~66 others

moderate effects — “two or three points off league average” — that BH flags and Bonferroni rejects on principle

when Bonferroni gives up

Hedenfalk et al. (2001) — breast cancer microarrays

m = 3{,}170 genes, t-test per gene (BRCA1 vs BRCA2 tumors)
only 7 vs 8 samples per gene → low power per test
Bonferroni rejects roughly 1 gene
BH at q = 0.05 rejects roughly 94 genes (Storey & Tibshirani, 2003)

many moderate effects, no extreme ones → exactly where FDR was invented to work

which correction — and does it matter?

for each: is your budget K fixed (top-K) or data-driven (threshold)? what guarantee do you want about the list?

screening 20,000 genes; wet lab can run 100 follow-ups
evaluating a single drug for FDA safety approval
ranking 500 students for 50 scholarships
testing 100 churn features; ship the top 5
30,000 SNPs in a GWAS; report all that pass

DISCUSSION: analyze-this (5 min). 30 sec per scenario; 2 min pair; debrief.

Goal: push students past “Bonferroni vs BH” into the deeper question: does the correction even change who I act on? Both corrections preserve the p-value ranking — they only differ in where the threshold falls. So when K is fixed by budget, you take the top-K regardless of correction. The correction is doing characterization (how noisy is my reported set?), not selection.

Selection vs characterization — the framing for each:

20K genes, 100 follow-ups (K fixed): Take the top 100 by p-value. Correction doesn’t change selection. BH-style FDR is useful for characterization — “of these 100, expect ~5 to be noise” — but you didn’t need BH to pick them.
FDA approval (K data-driven, single decision): This is the only scenario where the threshold actually drives the decision. Bonferroni-flavor — controlling FWER for one go/no-go is appropriate; the cost of crossing the line wrongly is severe and asymmetric.
500 students, 50 scholarships (K fixed): Pure ranking problem. Take the top 50. No correction changes the selection. The interesting question is “of the borderline 10, what fraction don’t deserve it?” — an FDR-style characterization, not a threshold.
100 features, top 5 (K fixed): Same as scholarships and genes. Take the top 5; characterize how many might be noise.
30K SNPs, threshold-based (K data-driven): This is the textbook GWAS case where corrections matter for selection. Bonferroni at 5 \times 10^{-8} is the field’s convention — single false positive can spawn years of follow-up work. BH is gaining traction for discovery work where K is genuinely data-driven.

Key insight: “which correction” is the wrong first question when budgets are fixed. Right first question: does the threshold drive my action set, or am I taking top-K either way?

If the latter, the correction adds nothing for selection but can still characterize your batch’s expected noise — Storey’s q-value framework formalizes this (chapter callout).

stakes determine the correction

	controls	use when
Bonferroni	family-wise error rate	any false positive is costly
BH	false discovery rate	tolerable fraction of FPs is OK

corrections fix the known multiplicity — the tests you ran on purpose.

pre-registration fixes the hidden multiplicity — the choices you made before the test.

right answer, wrong question

remember the survivors?

Bonferroni rejected 47 players — the most extreme cases.

among them:

Aaron Gordon: 55.7% — confidently above league
Klay Thompson: 43.3% — confidently below league

both rejections survive any correction. so Gordon is the better shooter. right?

inside the Gordon/Thompson gap

aggregate: Gordon +12.4 over Klay.

predict: in how many of the 6 zones does Gordon outshoot Klay?

zone	Klay Thompson	Aaron Gordon	Klay − Gordon
RA	75.3	70.8	+4.5
PAINT	41.2	24.4	+16.8
MID	44.8	29.7	+15.1
LC3	30.4	24.1	+6.3
RC3	50.0	48.1	+1.9
ATB3	38.7	24.7	+14.0
TOTAL	43.3	55.7	−12.4

Klay wins every zone. Gordon wins the aggregate.

Simpson’s paradox

Simpson’s paradox (Simpson, 1951)

an association reverses when you look inside subgroups.

aggregate trend ≠ within-subgroup trend.

a strong form of confounding — the aggregate doesn’t just mislead, it points the wrong direction

a lurking variable: shot location

rim_share = RA_FGA / FGA — fraction of shots from the restricted area.

why the scatter is so tight

zone	league FG%
restricted area	67%
above-the-break 3	36%

whoever takes the easier shots posts the better aggregate number — mechanical, not skillful.

ecological correlation — a warning

ecological correlation

correlations on group averages (one point per state, per team, per player) can be much stronger than correlations on individual observations.

averaging hides within-group variation.

always check the unit of analysis.

why Gordon’s aggregate looks better

Share of shots taken in each zone (%):
                Klay Thompson  Aaron Gordon
RA                        8.2          64.8
PAINT                    10.5          10.9
MID                      20.3           5.2
LC3                       4.1           4.1
RC3                       4.1           3.8
ATB3                     52.8          11.3

Gordon: 65% of shots from the rim (league: 67%) Klay: 53% above the break (league: 36%)

decomposing FG%

shot_mix_FG_PCT: what each player would shoot if they hit league rate in every zone skill_above_league: residual — actual FG% minus shot-mix expectation

                 FG_PCT  shot_mix_FG_PCT  skill_above_league
Klay Thompson    0.433        0.411              +0.022
Aaron Gordon     0.557        0.573              −0.016

Klay: +2 points above his shot-mix expectation.

Gordon: −2 points below his.

the reversal in one picture

confounding

confounder (informal)

a variable tied to both the input we care about and the outcome — creating an aggregate association that doesn’t reflect the mechanism we want to measure.

here:

input: which player took the shot
outcome: did it go in
confounder: which zone

“tied to” is doing a lot of work — formal causal definitions and DAGs in Ch 18.

Bonferroni rejected H_0 for both Gordon and Thompson. Both rejections are statistically real.

what kind of question would each rejection answer correctly?

what kind would it answer wrongly?

the limits of corrections

multiple testing corrections (Bonferroni, BH) protect against finding effects that aren’t there.

they do nothing to protect against finding effects that are there but reflect the wrong mechanism.

next chapter: regression lets us adjust for confounders by including them in the model.

“if you torture the data long enough, it will confess to anything.”

— Ronald Coase

today: right answer, wrong question

multiple testing — fix the threshold

m tests at \alpha → expect m \cdot \alpha false positives
Bonferroni controls FWER; BH controls FDR
replication crisis = same arithmetic across labs

confounding — fix the question

correlation can reflect a third variable
Simpson’s paradox: aggregate trend reverses inside subgroups
significant ≠ the right question

correlation is not causation.

next: regression

Bonferroni and BH protect us from false discoveries.

they don’t protect us from wrong-mechanism discoveries.

Ch 12: include the confounder in the model — adjust the question, not just the threshold.

one-minute feedback

what was the most useful thing you learned today?
what was the most confusing?

give feedback