Lecture 11: Multiple Testing

MSE 125 — Applied Statistics

Madeleine Udell

Monday, May 4, 2026

logistics

  • quiz 5: Wed May 6
  • HW 3: due Fri May 8

the brief

you’re an analyst for the NBA Director of Player Personnel.

last season’s shooting numbers are on your desk — every player’s makes and attempts. your boss asks one question:

“which of our players really shoot better than the league average?”

how would you tackle this?

today

  • the multiple testing problem
  • two corrections: Bonferroni and Benjamini-Hochberg
  • right answer, wrong question: confounding and Simpson’s paradox

the multiple testing problem

the data: shots by zone

shot_zones_2023-24.csv — for each NBA player, field goals made (FGM) and attempted (FGA) in each of six shot zones. FG% = FGM / FGA.

  • RA — restricted area (under the basket)
  • PAINT — rest of the painted lane
  • MID — mid-range (inside the arc)
  • LC3, RC3 — corner threes
  • ATB3 — above-the-break threes

source: NBA.com Stats, 2023–24 regular season

who shoots above league average?

shots = pd.read_csv('data/nba/shot_zones_2023-24.csv')
qual = shots[shots['FGA'] >= 200].copy()
qual['FG_PCT'] = qual['FGM'] / qual['FGA']
p0 = qual['FGM'].sum() / qual['FGA'].sum()
Players (rotation, FGA >= 200):  317
League FG% (qualified players):   0.4770

extremes:

  • Jarrett Allen: 63%
  • Rudy Gobert: 66%
  • Jevon Carter: 38%

are these gaps real, or could they happen by chance over a season’s worth of shots?

bernoulli’s free lunch

Lec 10: two samples (treatment vs control), continuous outcome. SE pooled from both arms.

now: one sample (one player), binary outcome (made or missed). no control.

how do we get a standard error with no control?

bernoulli: variance per shot = p(1-p)mean determines variance.

under H_0: p = p_0, the SE is pinned at \sqrt{p_0(1-p_0)/n}.

no second group. no estimation from data. null carries its own SE.

one test per player

one-sample z-test for a proportion — large-FGA normal approximation.

for each player i, test H_0: p_i = p_0 where p_0 = 0.477 (league):

z_i = \frac{\hat p_i - p_0}{\sqrt{p_0(1-p_0) / \text{FGA}_i}} \sim \mathcal{N}(0, 1)

run it 317 times.

Tests run:                317
Significant at p < 0.05:  141

z-test or t-test?

both standardize the gap: \dfrac{\text{statistic} - \text{null}}{\text{SE under null}}

t-test (continuous, e.g. CD4 counts):

  • SE estimated from data: \hat\sigma / \sqrt{n}
  • reference: t_{n-1} (extra uncertainty from \hat\sigma)

z-test for a proportion (binary, e.g. made/missed):

  • SE pinned by H_0: \sqrt{p_0(1-p_0)/n} — nothing to estimate
  • reference: \mathcal{N}(0, 1)

both = large-n limit of the bootstrap (Lec 8). bootstrap works for any n; z and t are the CLT shortcuts.

if each player’s true FG% equaled the league baseline, how many of 317 tests would you expect to appear significant at \alpha = 0.05?

give a number.

expected by chance

if H_0 holds for every player:

\mathbb{E}[\text{false positives}] = m \cdot \alpha

Tests:                317
Expected false +ves:  15.9
SD under null:         3.9     # = sqrt(m · α · (1−α))
Observed significant: 141
Excess over null:     125  (32.3 SDs)

125 over the null floor — real signal. but mixed with ~16 fakes.

the multiple testing problem

multiple testing

when many hypothesis tests are run, a predictable fraction will clear \alpha = 0.05 by chance alone.

with m tests at level \alpha: expect m \cdot \alpha false positives.

it doesn’t matter who runs the tests:

  • one analyst runs 317 tests → expect 16 false positives
  • 1,000 labs each run one test → expect 50 false positives

the p-value histogram: a diagnostic

under H_0: p-values are Uniform(0, 1) — flat.

predict: what does our 317-player histogram look like?

this isn’t just NBA

what’s at stake

a published false positive doesn’t sit on a shelf — someone acts on it.

  • medicine: Amgen tried to replicate 53 landmark cancer studies — confirmed only 6 (Begley & Ellis, Nature 2012). Failed preclinical work feeds drug trials that cost ~$1B each.
  • psychology / policy:
    • “power posing” (Cuddy 2010) → 50M+ TED views, corporate trainings → didn’t replicate.
    • “Growth mindset” interventions → district-wide rollouts, near-zero average effect.
  • biomedical waste: Freedman et al. (2015) estimate $28 B/year spent in the US on preclinical research that doesn’t reproduce.

cost isn’t just dollars: patients enrolled in trials, students taught by failed interventions, policy built on phantom effects.

the replication crisis: numbers

  • 2011 — Bem (precognition): standard methods, top journal, conclusion: people sense the future. published, then unreplicable.
  • 2015 — Reproducibility Project: 100 psychology studies retried. ~36% replicated with significance. effect sizes roughly half the originals.
  • 2026 — SCORE: 274 claims across 164 papers, 54 social/behavioral journals. 55% of claims replicated; given the same data, only 34% of independent analysts using defensible alternative specs reached the original conclusion.

hidden multiplicity: p-hacking

even one analyst running “one test” is hiding multiplicity:

  • exclude an outlier? that’s a choice
  • which outcome variable? that’s a choice
  • collect more data until significant? that’s a choice
  • which subset to analyze? that’s a choice

each choice is an implicit hypothesis test. the final reported p is the survivor of many.

your boss: “find me a clutch shooter.” you pick Player X and test:

restriction p-value
4th quarter 0.18
last 5 min 0.12
last 2 min 0.09
last 2 min, score within 3 0.04
…same, home games only 0.02

write-up: “Player X shoots above league average in clutch home situations, p = 0.02.”

what’s the chance you’d find something significant under the null?

structural pressure

two features of science amplify the problem:

  • file drawer: null results don’t get published
  • incentives: careers are built on novel significant findings

if 20 labs test the same false hypothesis, one will publish a significant result by chance — and that’s the only one you read

the fix: pre-registration

pre-registration

publicly committing to your research question, data plan, and analysis plan before looking at the data.

a pre-registered analysis is a single pre-specified test — not the survivor of many.

doesn’t ban exploration — requires labeling it as exploratory, not confirmatory

which is more credible?

study A

pre-registered hypothesis: drug X reduces blood pressure by ≥ 5 mmHg.

result: 4.2 mmHg, p = 0.08

study B

exploratory analysis of 50 outcomes.

result: drug X reduces ankle swelling by 18%, p = 0.02

pick one and defend it

two corrections

correction 1: Bonferroni

start with a false-positive budget of \alpha = 0.05 — total chance of any wrong rejection.

spread it evenly across m tests: each gets \alpha/m.

union bound:

\Pr[A_1 \cup \cdots \cup A_m] \;\le\; \sum_{i=1}^{m} \Pr[A_i]

so

\Pr[\text{any false positive}] \;\le\; m \cdot \frac{\alpha}{m} \;=\; \alpha

works regardless of correlation between tests

Bonferroni: the formal version

Bonferroni correction

test each hypothesis at the stricter level

\alpha_{\text{Bonf}} = \frac{\alpha}{m}

controls family-wise error rate (FWER): probability of any false positive across all m tests.

conservative, but the guarantee is bulletproof.

Bonferroni applied

alpha = 0.05
bonf_threshold = alpha / 317   # = 0.000158  (317x stricter)
n_bonf = (qual['p_value'] < bonf_threshold).sum()

threshold dropped 317× — what’s your guess for the new count, out of 141?

Significant after Bonferroni:  47   (down from 141)

survivors — the obvious extremes:

  • rim-feasting big men: Allen, Gobert, Lively, Gafford, Giannis (60–75%)
  • worst shooters at the other tail

the cost of conservatism

Damian Lillard:   42.6% on 1,270 attempts
                  p-value ≈ 3 × 10⁻⁴
                  Bonferroni threshold: 1.6 × 10⁻⁴
                  → fails to reject

a real moderate effect that Bonferroni throws away

correction 2: BH (rising bar)

Bonferroni asks every p-value to clear \alpha/mflat bar

BH lets the bar rise with rank:

\alpha/m, \; 2\alpha/m, \; 3\alpha/m, \; \ldots, \; \alpha

if you call k results discoveries, expect k\alpha to be null → expected false-discovery fraction ≈ \alpha

Benjamini-Hochberg procedure

Benjamini-Hochberg (BH)

  1. sort p-values: p_{(1)} \le p_{(2)} \le \cdots \le p_{(m)} — so p_{(k)} is the k-th smallest.
  2. find the largest k such that p_{(k)} \le \frac{k}{m} \alpha
  3. reject hypotheses 1, \ldots, k

controls false discovery rate (FDR): expected fraction of false positives among rejections.

BH on a toy example

m = 5 tests, \alpha = 0.05. sort p-values, compare to k\alpha/m = 0.01k:

k p_{(k)} k\alpha/m \le ?
1 0.001 0.01
2 0.008 0.02
3 0.030 0.03
4 0.050 0.04
5 0.400 0.05

walk from the bottom; largest k where \le holds is k = 3 → reject 1, 2, 3.

compare: Bonferroni (\alpha/m = 0.01) rejects only k=1; uncorrected rejects 1–4.

Bonferroni vs BH on shot data

Uncorrected (p < 0.05):  141
Bonferroni (FWER):        47
BH (FDR <= 0.05):        117

the 70-player gap

players caught by BH but not Bonferroni:

  • Damian Lillard (42.6%)
  • Fred VanVleet
  • Kevin Durant
  • Chet Holmgren
  • … and ~66 others

moderate effects — “two or three points off league average” — that BH flags and Bonferroni rejects on principle

when Bonferroni gives up

Hedenfalk et al. (2001) — breast cancer microarrays

  • m = 3{,}170 genes, t-test per gene (BRCA1 vs BRCA2 tumors)
  • only 7 vs 8 samples per gene → low power per test
  • Bonferroni rejects roughly 1 gene
  • BH at q = 0.05 rejects roughly 94 genes (Storey & Tibshirani, 2003)

many moderate effects, no extreme ones → exactly where FDR was invented to work

which correction — and does it matter?

for each: is your budget K fixed (top-K) or data-driven (threshold)? what guarantee do you want about the list?

  • screening 20,000 genes; wet lab can run 100 follow-ups
  • evaluating a single drug for FDA safety approval
  • ranking 500 students for 50 scholarships
  • testing 100 churn features; ship the top 5
  • 30,000 SNPs in a GWAS; report all that pass

stakes determine the correction

controls use when
Bonferroni family-wise error rate any false positive is costly
BH false discovery rate tolerable fraction of FPs is OK

corrections fix the known multiplicity — the tests you ran on purpose.

pre-registration fixes the hidden multiplicity — the choices you made before the test.

right answer, wrong question

remember the survivors?

Bonferroni rejected 47 players — the most extreme cases.

among them:

  • Aaron Gordon: 55.7% — confidently above league
  • Klay Thompson: 43.3% — confidently below league

both rejections survive any correction. so Gordon is the better shooter. right?

inside the Gordon/Thompson gap

aggregate: Gordon +12.4 over Klay.

predict: in how many of the 6 zones does Gordon outshoot Klay?

zone Klay Thompson Aaron Gordon Klay − Gordon
RA 75.3 70.8 +4.5
PAINT 41.2 24.4 +16.8
MID 44.8 29.7 +15.1
LC3 30.4 24.1 +6.3
RC3 50.0 48.1 +1.9
ATB3 38.7 24.7 +14.0
TOTAL 43.3 55.7 −12.4

Klay wins every zone. Gordon wins the aggregate.

Simpson’s paradox

Simpson’s paradox (Simpson, 1951)

an association reverses when you look inside subgroups.

aggregate trend ≠ within-subgroup trend.

a strong form of confounding — the aggregate doesn’t just mislead, it points the wrong direction

a lurking variable: shot location

rim_share = RA_FGA / FGA — fraction of shots from the restricted area.

why the scatter is so tight

zone league FG%
restricted area 67%
above-the-break 3 36%

whoever takes the easier shots posts the better aggregate number — mechanical, not skillful.

ecological correlation — a warning

ecological correlation

correlations on group averages (one point per state, per team, per player) can be much stronger than correlations on individual observations.

averaging hides within-group variation.

always check the unit of analysis.

why Gordon’s aggregate looks better

Share of shots taken in each zone (%):
                Klay Thompson  Aaron Gordon
RA                        8.2          64.8
PAINT                    10.5          10.9
MID                      20.3           5.2
LC3                       4.1           4.1
RC3                       4.1           3.8
ATB3                     52.8          11.3

Gordon: 65% of shots from the rim (league: 67%) Klay: 53% above the break (league: 36%)

decomposing FG%

shot_mix_FG_PCT: what each player would shoot if they hit league rate in every zone skill_above_league: residual — actual FG% minus shot-mix expectation

                 FG_PCT  shot_mix_FG_PCT  skill_above_league
Klay Thompson    0.433        0.411              +0.022
Aaron Gordon     0.557        0.573              −0.016

Klay: +2 points above his shot-mix expectation.

Gordon: −2 points below his.

the reversal in one picture

confounding

confounder (informal)

a variable tied to both the input we care about and the outcome — creating an aggregate association that doesn’t reflect the mechanism we want to measure.

here:

  • input: which player took the shot
  • outcome: did it go in
  • confounder: which zone

“tied to” is doing a lot of work — formal causal definitions and DAGs in Ch 18.

Bonferroni rejected H_0 for both Gordon and Thompson. Both rejections are statistically real.

what kind of question would each rejection answer correctly?

what kind would it answer wrongly?

the limits of corrections

multiple testing corrections (Bonferroni, BH) protect against finding effects that aren’t there.

they do nothing to protect against finding effects that are there but reflect the wrong mechanism.

next chapter: regression lets us adjust for confounders by including them in the model.

“if you torture the data long enough, it will confess to anything.”

— Ronald Coase

today: right answer, wrong question

multiple testing — fix the threshold

  • m tests at \alpha → expect m \cdot \alpha false positives
  • Bonferroni controls FWER; BH controls FDR
  • replication crisis = same arithmetic across labs

confounding — fix the question

  • correlation can reflect a third variable
  • Simpson’s paradox: aggregate trend reverses inside subgroups
  • significant ≠ the right question

correlation is not causation.

next: regression

Bonferroni and BH protect us from false discoveries.

they don’t protect us from wrong-mechanism discoveries.

Ch 12: include the confounder in the model — adjust the question, not just the threshold.

one-minute feedback

  1. what was the most useful thing you learned today?
  2. what was the most confusing?

give feedback