MSE 125 — Slides – Lecture 10: Hypothesis Testing

in chapter 9 we shuffled labels and got p ≈ 10⁻⁴

the drug works

but in 1991 the trial hadn’t been run yet

logistics

project: proposal due this Friday
quiz 5: Wed May 6; Lec 10-11 (hypothesis testing + multiple testing)
HW 3: due Fri May 8

the design questions Ch 9 left on the table

before any data exists:

how many patients do we enroll?
what p-value would convince us?
what errors are we willing to make, and at what rate?

permutation tests need data to shuffle. these questions need a formula

today

the framework: \(H_0\), \(H_1\), the 4-step recipe, formalized
two errors, not one: Type I, Type II, power
from simulation to formula: Welch’s t-test
power: how many patients? requires guessing the effect size
significance ≠ importance

the framework

the recipe

every hypothesis test, same four steps:

state \(H_0\) and \(H_1\)
choose a test statistic
determine the null distribution
compute the p-value, reject if \(p < \alpha\)

Ch 9 walked through all four with a permutation test. today we name the parts and swap step 3 for a formula

step 1: competing claims

for the clinical trial:

\(H_0\): \(\mu_T - \mu_C = 0\) (no effect)
\(H_1\): \(\mu_T - \mu_C \neq 0\) (some effect)

note: \(H_1\) is a family of effects, not a single value

a 5-cell effect, a 50-cell effect, a 500-cell effect: all live inside \(H_1\)

state \(H_0\) and \(H_1\) for each scenario

a new website layout might increase sign-ups
a coin might be unfair
a pollution standard might not be met

\(H_0\) and \(H_1\): the formal version

null and alternative hypotheses

\(H_0\) = a single, specific claim about the world (typically “no effect”)

\(H_1\) = the complement of \(H_0\): a family of states ruled out by the null

a hypothesis test asks whether the data are surprising enough under \(H_0\) to reject it

the test does not require us to pick a specific effect inside \(H_1\); only to decide whether \(H_0\) is ruled out

two errors, not one

you’re the FDA reviewing a new drug. which mistake is worse?

A. approve a useless drug: side effects, no benefit
B. reject a drug that saves lives: patients die who could have been saved

DISCUSSION: Poll + debrief (3 min). Hands up for A or B (no fence-sitting); debrief by asking why students chose what they chose. Prompt: Which mistake is worse? Format: Quick hand raise (A or B), then ask 1-2 students from each side to justify. Process goal: force students to feel the asymmetry before we give it names. There’s no single right answer — depends on disease severity, the drug’s side effects, and what alternatives exist. A useless cancer drug with brutal side effects is terrible to approve; rejecting a cure for a fatal disease with no alternatives is terrible too. The point is that the two errors have different costs, and those costs should drive the choice of α. If stuck: “What if the disease is fatal and there are no other treatments? What if the drug has serious side effects?” Key insight: the two errors are not symmetric — their costs depend on context.

the courtroom analogy

think of \(H_0\) as “innocent until proven guilty”

Type I error = convicting an innocent person
- rejected \(H_0\) when it was actually true

Type II error = letting a guilty person go free
- failed to reject \(H_0\) when \(H_1\) was actually true

the α-β tradeoff

\(\alpha\) = controlled directly: it’s where we draw the threshold

\(\beta\) = controlled indirectly: depends on the true effect size, which we don’t know

there is no free lunch. moving the threshold trades \(\alpha\) against \(\beta\)

Predict-first prompt before advancing: ask students what happens to the false-positive region (under H₀) and to the false-negative region (under H₁) as the rejection threshold moves right (stricter α). Hold for 30 seconds, no hands. Then click forward — bullets reveal one at a time.

Two bell curves: null centered at zero, alternative centered at the assumed effect. Threshold is the vertical line. Blue (null past threshold) = α, the false positive rate, set directly by where we draw the threshold. Red (alternative inside the threshold) = β, the false negative rate — NOT directly controlled because it depends on how far the alternative is shifted (the true effect size, unknown). Slide the threshold right: blue shrinks, red grows. Power analysis (Block 4) computes β GIVEN an assumed effect size — that’s how we reason about β even though we can’t pin it down without knowing the truth.

the error table

	\(H_0\) true (no effect)	\(H_1\) true (real effect)
reject \(H_0\)	Type I error (\(\alpha\))	correct
fail to reject \(H_0\)	correct	Type II error (\(\beta\))

power = \(1 - \beta\) = probability of correctly detecting a real effect

definitions: Type I, Type II, power

significance level, Type I error, Type II error, power

significance level \(\alpha\): threshold for rejecting \(H_0\); equals the Type I error rate
Type II error rate \(\beta\): \(P(\text{fail to reject } H_0 \mid H_1 \text{ true})\)
power \(= 1 - \beta\): probability of correctly detecting a real effect

“fail to reject”, not “accept”

we say “fail to reject \(H_0\)”, never “accept \(H_0\)”

a non-significant result means the data are compatible with \(H_0\)

but they might also be compatible with many other hypotheses

absence of evidence is not evidence of absence

p-values are uniform under \(H_0\)

10,000 t-tests on splits of the control arm (no effect possible) → 4.6% reject at \(\alpha = 0.05\)

so \(\alpha\) does what it advertises

from simulation to formula

why a formula?

Ch 9: null distribution by shuffling: 10,000 fake datasets, each producing one fake statistic

design question: “with 200 patients per arm and a 30-cell effect, how often will I reject?”

no data to shuffle. we need a closed-form null distribution that depends only on \(n_T\) and \(n_C\)

the brewer who solved this in 1908

William Sealy Gosset at the Guinness brewery, Dublin
compared barley varieties with tiny samples (5–10 batches)
normal approximation unreliable at small \(n\)
couldn’t bootstrap or simulate: no computers
so he derived the exact small-sample distribution analytically

published in 1908 as “Student”; Guinness banned real names

that’s why it’s called Student’s t-distribution

Salsburg, The Lady Tasting Tea, Ch. 2

what does t look like?

fat tails at small \(n\): normal cutoff over-rejects under \(H_0\)

Welch’s t-statistic

a refinement of Gosset’s that drops the equal-variance assumption. using Ch 8 notation (sample means \(\bar X_T, \bar X_C\), sample variances \(s_T^2, s_C^2\), sample sizes \(n_T, n_C\)):

\[t = \frac{\bar{X}_T - \bar{X}_C}{\left(s_T^2/n_T + s_C^2/n_C\right)^{1/2}}\]

numerator: the observed effect (Ch 8 difference of means)
denominator: the standard error \(\widehat{\text{SE}}\) from Ch 8

gloss: how many standard errors is the effect from zero?

why Welch’s specifically?

two flavors of two-sample t-test:

Student’s: assumes equal variance in both groups
Welch’s: doesn’t

real data: variances rarely equal

equal_var=False is the safe default: use it unless you have a positive reason to assume equal variance

Welch’s on ACTG 175

from scipy import stats
t_stat, p_value = stats.ttest_ind(
    treatment, control, equal_var=False
)
# t-statistic: 9.46
# p-value:     2.8e-19

permutation (Ch 9): \(p \approx 10^{-4}\); limited by 10,000 shuffles

same conclusion. but the formula will also tell us what would have happened with 100 patients per arm, or 30, or 1000

A/B testing: same recipe, different stakes

a growth team tests a new homepage layout:

variant B: 4.2% sign-ups, \(n = 12{,}400\)
variant A: 4.0% sign-ups, \(n = 12{,}200\)

Welch’s logic, identical machinery → \(p \approx 0.04\)

Type I error: ship a feature that doesn’t help (dev cost, maintenance)

Type II error: kill a feature that would have moved the needle

most companies use a less strict \(\alpha\) than the FDA: wrong shipping decisions are reversible

trial result: p = 0.04

colleague A

“just barely significant; shouldn’t be trusted”

colleague B

“p < 0.05; we reject”

who is right? what does this reveal about the 0.05 threshold?

DISCUSSION: Think-pair-share (4 min). Think individually, then discuss with a neighbor; be ready to defend either side. Prompt: p = 0.04 — who is right? Process goal: surface the arbitrariness of α = 0.05 and the fact that p-values are continuous evidence, not binary verdicts. Both colleagues are partly right. B is correct procedurally — if you committed to α = 0.05 before seeing the data, p = 0.04 crosses the threshold. A raises the deeper point — p = 0.04 and p = 0.06 carry nearly identical evidence, yet one “rejects” and the other doesn’t. The 0.05 threshold is a convention, not a law of nature. If stuck: “Is p = 0.049 fundamentally different from p = 0.051?” Key insight: α is a decision tool, not a truth detector. Reasonable people can disagree where to set it. The right α depends on the stakes — Block 4 makes this explicit.

what should \(\alpha\) be?

field	typical \(\alpha\)	why
social science	0.05	convention (Fisher, 1925)
clinical trials	two trials, each 0.05	combined: ~0.0025
particle physics	~0.0000003	“5-sigma”

the choice depends on the cost of each error type, not on convention

power: how many patients?

power = the inverse question

so far: given data, what’s the p-value?

now: given a hoped-for effect, what’s the chance of a small p-value?

what study designers compute before the trial, to decide \(n\)

the catch: power needs a specific \(H_1\)

remember: \(H_1\) is a family of effects

power asks: “what’s the chance we reject if the true effect is \(\Delta\)?”

different \(\Delta\) → different power

so: we have to commit to a specific effect size to compute one number

what does picking \(\Delta\) mean?

each \(\Delta\) = a different world we might be in

bigger \(\Delta\) → populations farther apart → easier to detect

you must guess the effect size

no effect size, no power analysis

power = \(P(\text{reject} \mid \text{true effect} = \Delta)\); needs a value of \(\Delta\)

where the guess comes from:

prior studies of similar interventions (most defensible)
the smallest effect that matters clinically or commercially
a pilot study designed to estimate plausible effects

doubling the assumed effect cuts required \(n\) by roughly 4×

predict: how much power?

true effect = 30 CD4 cells, 100 patients per arm

what fraction of trials correctly reject \(H_0\)?

90%? 75%? 50%? 25%?

0.5: a coin flip

that’s an underpowered study

power curves: effect size vs sample size

large effect (50 CD4): detectable with small samples
small effect (10 CD4): needs hundreds per group
gray line = 80% power target

sample size planning: the formula

from statsmodels.stats.power import TTestIndPower
power_analysis = TTestIndPower()

n_needed = power_analysis.solve_power(
    effect_size=0.29,   # Cohen's d = 30 / 105
    power=0.8,
    alpha=0.05
)
# → ~190 per group

Cohen’s \(d = \Delta / \sigma\): effect in standard-deviation units

(small \(\approx\) 0.2, medium \(\approx\) 0.5, large \(\approx\) 0.8)

CI/test duality: one curve, two readings

CI = where the curve sits above \(\alpha\)
reject = where the curve dips below \(\alpha\)

The duality made visual. Fix the observed data; plot the p-value as a function of the candidate null θ₀. You get one curve, peaked at the observed mean (where p = 1 by definition — the data are perfectly consistent with the null θ₀ = \(\bar x_{\rm obs}\)). Now draw a horizontal line at α = 0.05.

Read the picture two ways. The shaded region — the part of the curve sitting above α — is exactly the 95% CI: [39.6, 61.2]. Any θ₀ where the curve dips below α is a value the test would reject. Two illustrative points: θ₀ = 55 sits inside the CI (p = 0.40, fail to reject); θ₀ = 38 sits just outside (p = 0.024, reject). The CI endpoints are literally where the curve crosses α — that’s the duality.

The chapter’s actual null is θ₀ = 0, far off-scale to the left, with p ≈ 10⁻¹⁹ — way below α, way outside CI. Both readings agree.

Practical takeaway: once you have a CI from Ch 8, you have a “did-this-θ₀-test-reject” oracle for every θ₀ at once. The CI is strictly more informative than any single test. Bootstrap (Ch 8) and permutation (Ch 9) gave the same conclusion not by coincidence — they’re two paths to the same horizontal-line test against this curve.

use tests only when you need a binary decision

binary decision (ship / don’t-ship, approve / reject)

→ hypothesis test is the right tool

every other question: how big is the effect? how uncertain?

→ reach for a CI

most common statistical mistake in industry: using a test when estimation would have answered the actual question

significance ≠ importance

with enough data, anything is “significant”

p-value shrinks as the standard error shrinks (\(\propto 1/\sqrt{n}\))

the effect itself doesn’t have to be large, only nonzero

the blood pressure drug

a clinical trial with 10,000 participants finds:

p = 0.013 (statistically significant at \(\alpha = 0.05\))
actual effect: 2 mmHg decrease in systolic BP

for context: a cup of coffee temporarily raises BP by about 5 mmHg

the drug’s effect is real, it’s not zero, but it’s smaller than your morning coffee

Mesas et al., Journal of Hypertension, 2011

blood pressure drug: p = 0.013, effect = 2 mmHg, n = 10,000

would you recommend it?

consider:

costs and side effects
alternative treatments
what “clinically meaningful” means

DISCUSSION: Design challenge (5 min). Take a position and defend it; what additional information would change your mind? Prompt: Would you recommend this drug? Process goal: force the distinction between statistical significance and practical importance into a concrete decision. Defensible answers: - “No — 2 mmHg is clinically meaningless. Lifestyle changes (exercise, diet) produce 5-10 mmHg reductions without side effects.” - “It depends — if this is the only option for patients who can’t exercise and are already on other medications, even 2 mmHg might matter at the population level.” - “Need more info — what are the side effects? What’s the baseline blood pressure? What does the drug cost?” If stuck: “Would you take a daily pill to get an effect smaller than your morning coffee?” Key insight: always report effect sizes and CIs alongside p-values.

always report effect sizes

a small p-value tells you: effect is unlikely to be zero

it does not tell you: the effect is large or important

always report:

effect size
confidence interval
p-value

all three together; never p-value alone

summary

Ch 10 = study design: two errors, formula-based test, power analysis
\(H_1\) is a family: power analysis forces us to pick a specific effect inside it
Welch’s t-test is the formula-based analog of permutation: solvable for studies you haven’t run yet
use estimation by default; reach for tests when the decision is binary
always report effect sizes alongside p-values

next time

we can test one hypothesis carefully

what happens when you test 20 at once?

Ch 11: multiple testing, Bonferroni correction, false discovery rate, p-hacking

one-minute feedback

what was the most useful thing you learned today?
what was the most confusing?

give feedback