Hypothesis Testing Framework

Code
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
from scipy.stats import norm
import warnings
warnings.filterwarnings('ignore')
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (7, 4.5)
plt.rcParams['font.size'] = 12

# Load data
DATA_DIR = 'data'
np.random.seed(42)

The average cost to bring a drug to market exceeds $1.3 billion. A clinical trial just produced a p-value of 0.00000000000000000028. Time to approve the drug? Not so fast.

In Chapter 9, we used permutation to simulate what the treatment effect would look like if the drug did nothing — that gave us a p-value. Today we formalize that machinery and discover its limitations.

Setup: Recap the Clinical Trial

The original ACTG 175 trial had four arms; here we compare control vs. all alternative treatments combined. Let’s reload the data and the observed treatment effect.

Code
df = pd.read_csv(f'{DATA_DIR}/clinical-trial/ACTG175.csv')
df['cd4_change'] = df['cd420'] - df['cd40']

control = df[df['treat'] == 0]['cd4_change'].dropna()
treatment = df[df['treat'] == 1]['cd4_change'].dropna()
observed_effect = treatment.mean() - control.mean()

print(f"Observed treatment effect: {observed_effect:.1f} CD4 cells")
print(f"Control n = {len(control)}, Treatment n = {len(treatment)}")
Observed treatment effect: 50.4 CD4 cells
Control n = 532, Treatment n = 1607

Before formalizing anything, let’s look at what we’re working with. Do the two groups actually look different?

Code
fig, ax = plt.subplots(figsize=(8, 5))
ax.hist(control, bins=40, alpha=0.5, label='Control', density=True)
ax.hist(treatment, bins=40, alpha=0.5, label='Treatment', density=True)
ax.axvline(control.mean(), color='C0', ls='--', lw=2)
ax.axvline(treatment.mean(), color='C1', ls='--', lw=2)
ax.set_xlabel('CD4 Change')
ax.set_ylabel('Density')
ax.set_title('Control vs Treatment: CD4 Cell Changes')
ax.legend()
plt.show()

The distributions overlap a lot, but the treatment group is shifted to the right. Is that shift real, or could it be random noise? How different is “different enough” to conclude the drug works?

That’s exactly what hypothesis testing answers. Let’s be precise.

The Hypothesis Testing Framework

Here’s the recipe for any hypothesis test — a checklist you can follow for any problem:

  1. State \(H_0\) and \(H_1\) — what are the competing claims?
  2. Choose a test statistic — what number summarizes the evidence?
  3. Determine the null distribution — what would the test statistic look like if \(H_0\) were true?
  4. Compute the p-value — how extreme is your observed test statistic under \(H_0\)?

For our clinical trial:

Component Our Clinical Trial
Null hypothesis \(H_0\) The treatment has no effect (mean difference = 0)
Alternative hypothesis \(H_1\) The treatment has an effect (mean difference \(\neq\) 0)
Test statistic Difference in sample means (or the t-statistic)
p-value Probability of seeing a result at least this extreme, if \(H_0\) were true
ImportantDefinition: Null and Alternative Hypotheses

The null hypothesis (\(H_0\)) is the default claim — typically “no effect” or “no difference.” The alternative hypothesis (\(H_1\)) is the competing claim you’re testing for. A hypothesis test asks whether the data provide enough evidence to reject \(H_0\) in favor of \(H_1\).

ImportantDefinition: P-value

The p-value is the probability of observing a test statistic at least as extreme as the one computed from the data, assuming \(H_0\) is true. It is not the probability that \(H_0\) is true.

WarningThe #1 p-value misconception

The p-value is NOT the probability that \(H_0\) is true. It’s the probability of getting data at least as extreme as ours if \(H_0\) were true. Big difference.

If you test “is this coin fair” and get p = 0.03, it does NOT mean there’s a 3% chance the coin is fair. It means: IF the coin were fair, there’d be only a 3% chance of seeing data this extreme.

Significance Level \(\alpha\)

We need a threshold: how extreme is “extreme enough” to reject \(H_0\)? This threshold is called the significance level \(\alpha\).

ImportantDefinition: Significance Level

The significance level \(\alpha\) is the threshold for rejecting \(H_0\): if the p-value is less than \(\alpha\), we reject \(H_0\). It also equals the Type I error rate — the probability of rejecting \(H_0\) when it is actually true.

Convention: \(\alpha = 0.05\). But why 0.05? R.A. Fisher suggested 0.05 in 1925 as a “convenient” threshold. It stuck, but there’s nothing magical about it. Some fields use 0.01; particle physics uses roughly 0.0000003. We’ll see in a moment why the choice should depend on the stakes.

NoteThe brewer who invented small-sample testing

The t-test we’re about to use has a surprising origin. In the early 1900s, William Sealy Gosset worked as a chemist at the Guinness brewery in Dublin. He needed to compare barley varieties and yeast strains — but with only a handful of samples from each batch, the normal approximation was unreliable. Gosset worked out the exact distribution of the sample mean divided by the sample standard deviation for small samples and published it in 1908 under the pseudonym “Student” (Guinness didn’t allow employees to publish under their own names, fearing competitors would learn they were using statistics). His “Student’s t-distribution” is the foundation of the t-test we use today — and the reason it’s still called “Student’s t.” (See Salsburg, The Lady Tasting Tea, Ch. 2.)

We use Welch’s t-test (equal_var=False), which does not assume the two groups have equal variance. This choice is almost always the right default.

\[t = \frac{\bar{X}_1 - \bar{X}_2}{\sqrt{s_1^2/n_1 + s_2^2/n_2}}\]

Code
# Welch's t-test
t_stat, p_value = stats.ttest_ind(treatment, control, equal_var=False)
print(f"t-statistic: {t_stat:.2f}")
print(f"p-value: {p_value:.2e}")
print(f"\nAt alpha = 0.05: {'Reject H0' if p_value < 0.05 else 'Fail to reject H0'}")
print(f"At alpha = 0.01: {'Reject H0' if p_value < 0.01 else 'Fail to reject H0'}")
print(f"At alpha = 0.001: {'Reject H0' if p_value < 0.001 else 'Fail to reject H0'}")
t-statistic: 9.15
p-value: 2.79e-19

At alpha = 0.05: Reject H0
At alpha = 0.01: Reject H0
At alpha = 0.001: Reject H0

The p-value here is so small that the choice of \(\alpha\) doesn’t matter. But for borderline cases, it matters a lot.

TipThink About It

Try changing the alpha values above. What’s the smallest alpha at which you’d still reject \(H_0\) for this trial?

Two Types of Errors

TipThink About It

Which mistake is worse — approving a useless drug, or rejecting a drug that actually saves lives?

Think of it like a courtroom trial. The null hypothesis is “innocent until proven guilty.”

  • Type I error = convicting an innocent person. We rejected \(H_0\) when it was actually true. In our clinical trial: we recommend a useless drug to patients.
  • Type II error = letting a guilty person go free. We failed to reject \(H_0\) when \(H_1\) was actually true. In our trial: we reject a drug that actually saves lives.

Now here’s the formal summary:

\(H_0\) is true (no effect) \(H_1\) is true (real effect)
Reject \(H_0\) Type I Error (false positive) Correct!
Fail to reject \(H_0\) Correct! Type II Error (false negative)
ImportantDefinition: Type I Error, Type II Error, and Power
  • Type I error: Rejecting \(H_0\) when it is true (false positive). Its rate equals \(\alpha\).
  • Type II error: Failing to reject \(H_0\) when \(H_1\) is true (false negative). Its rate is \(\beta\).
  • Power: \(1 - \beta\) = the probability of correctly detecting a real effect.
Important“Fail to reject” — not “accept”

Notice we say “fail to reject \(H_0\),” not “accept \(H_0\).” A non-significant result means the data are compatible with \(H_0\) — but they might also be compatible with many other hypotheses. Absence of evidence is not evidence of absence.

As Einstein (reportedly) put it: “No amount of experimentation can ever prove me right; a single experiment can prove me wrong.”

Let’s visualize the tradeoff. Under the null, the test statistic is centered at zero. Under the alternative, it’s shifted. The overlap is where errors happen. (The plot shows only the right tail of a two-sided test; the left tail is symmetric.)

Code
sigma = control.std()
n_per = len(control)
se = sigma * np.sqrt(2 / n_per)

x = np.linspace(-4*se, observed_effect + 4*se, 500)
null_dist = norm.pdf(x, 0, se)
alt_dist = norm.pdf(x, observed_effect, se)
cutoff = norm.ppf(1 - 0.025, 0, se)  # two-sided alpha=0.05

fig, ax = plt.subplots(figsize=(10, 5))
ax.plot(x, null_dist, 'b-', lw=2, label='Null ($H_0$: no effect)')
ax.plot(x, alt_dist, 'r-', lw=2, label=f'Alternative ($H_1$: effect = {observed_effect:.0f})')
ax.fill_between(x[x >= cutoff], null_dist[x >= cutoff], alpha=0.3, color='blue', label='Type I error ($\\alpha$)')
ax.fill_between(x[x <= cutoff], alt_dist[x <= cutoff], alpha=0.3, color='red', label='Type II error ($\\beta$)')
ax.axvline(cutoff, color='gray', ls='--', lw=1.5, label=f'Rejection threshold')
ax.set_xlabel('Test Statistic (mean difference)')
ax.set_ylabel('Density')
ax.set_title('Type I and Type II Errors: The Tradeoff')
ax.legend(fontsize=9, loc='upper right')
plt.show()

Moving the threshold right (stricter \(\alpha\)) shrinks the blue region (fewer false positives) but grows the red region (more false negatives). There is no free lunch.

Type I Error: Seeing Effects That Aren’t There

If there is truly no effect, how often will we mistakenly “find” one? By definition, at rate \(\alpha\). Let’s verify with simulation.

We’ll use only the control group (where everyone got the same drug), randomly split them into two fake groups, and test for a difference. There IS no difference — any “significant” result is a false positive.

Code
# Goal: split control into two fake groups and test.
# Any "significant" result is a false positive.
control_vals = control.values
n_sims = 10_000
p_values_null = []

for _ in range(n_sims):
    shuffled = np.random.permutation(control_vals)
    half = len(shuffled) // 2
    fake_a = shuffled[:half]
    fake_b = shuffled[half:]
    _, p = stats.ttest_ind(fake_a, fake_b, equal_var=False)
    p_values_null.append(p)

p_values_null = np.array(p_values_null)
Code
print(f"Fraction with p < 0.05: {np.mean(p_values_null < 0.05):.3f}  (should be ~0.05)")
print(f"Fraction with p < 0.01: {np.mean(p_values_null < 0.01):.3f}  (should be ~0.01)")
print(f"Fraction with p < 0.10: {np.mean(p_values_null < 0.10):.3f}  (should be ~0.10)")
Fraction with p < 0.05: 0.046  (should be ~0.05)
Fraction with p < 0.01: 0.009  (should be ~0.01)
Fraction with p < 0.10: 0.099  (should be ~0.10)
Code
fig, ax = plt.subplots(figsize=(8, 5))
ax.hist(p_values_null, bins=50, density=True, color='lightcoral', alpha=0.7, edgecolor='white')
ax.axhline(1.0, color='black', ls='--', lw=1, label='Uniform(0,1)')
ax.axvline(0.05, color='red', ls='--', lw=1.5, label='$\\alpha$ = 0.05')
ax.axvspan(0, 0.05, alpha=0.15, color='red', label='Rejection region')
ax.set_xlabel('p-value')
ax.set_ylabel('Density')
ax.set_title('Distribution of p-values Under the Null (no real effect)')
ax.legend()
plt.show()

Under the null, p-values from a correctly specified test are uniformly distributed on [0, 1] — recall the continuous Uniform distribution from your probability course, a flat density on [0, 1]. This uniformity is exact when the null fully specifies the distribution, and a good approximation otherwise (e.g., for large samples via the CLT). Choosing \(\alpha = 0.05\) therefore gives a 5% false positive rate: exactly 5% of uniform draws fall below 0.05.

The p-value is a random variable too — a key insight we return to throughout the course.

Power: Can We Detect a Real Effect?

Our trial found an effect of ~50 CD4 cells — very large, very detectable. But what if the true effect were smaller? Would we still catch it?

Power is the probability of correctly rejecting \(H_0\) when \(H_1\) is true. It depends on:

  1. The true effect size
  2. The sample size
  3. The significance level \(\alpha\)

Let’s build intuition with a single example first. If the true effect is 30 CD4 cells and we have 100 patients per group, what’s the power? We simulate from a normal distribution here — the actual CD4 count distribution is somewhat skewed, but the approximation is useful. The simulation also draws both groups with the same \(\sigma\), a simplification; Welch’s t-test works correctly either way.

Code
def simulate_power(true_effect, n_per_group, alpha=0.05, n_sims=2000):
    """Simulate the power of a two-sample t-test."""
    sigma = control.std()
    rejections = 0
    for _ in range(n_sims):
        group_a = np.random.normal(0, sigma, n_per_group)
        group_b = np.random.normal(true_effect, sigma, n_per_group)
        _, p = stats.ttest_ind(group_a, group_b, equal_var=False)
        if p < alpha:
            rejections += 1
    return rejections / n_sims

# One specific example
power_example = simulate_power(30, 100)
print(f"True effect = 30 CD4 cells, n = 100 per group")
print(f"Power = {power_example:.2f}")
print(f"{'Good enough (>= 0.80)' if power_example >= 0.8 else 'Not enough power (< 0.80) — need more patients!'}")
True effect = 30 CD4 cells, n = 100 per group
Power = 0.53
Not enough power (< 0.80) — need more patients!

Now let’s see how power varies across different effect sizes and sample sizes.

Code
# Power vs sample size for different effect sizes
# This cell takes ~15 seconds to run
sample_sizes = [25, 50, 100, 200, 500, 1000]
effect_sizes = [10, 30, 50]

fig, ax = plt.subplots(figsize=(10, 6))
for effect in effect_sizes:
    print(f"Computing power for effect = {effect} CD4 cells...")
    powers = [simulate_power(effect, n) for n in sample_sizes]
    ax.plot(sample_sizes, powers, 'o-', lw=2, label=f'Effect = {effect} CD4 cells')

ax.axhline(0.8, color='gray', ls='--', lw=1, label='Power = 0.80 (conventional target)')
ax.set_xlabel('Sample Size (per group)')
ax.set_ylabel('Power')
ax.set_title('Power: Probability of Detecting a Real Effect')
ax.legend()
ax.set_ylim(0, 1.05)
plt.show()
Computing power for effect = 10 CD4 cells...
Computing power for effect = 30 CD4 cells...
Computing power for effect = 50 CD4 cells...

Key observations:

  • A large effect (50 CD4 cells) is detectable even with small samples.
  • A small effect (10 CD4 cells) needs hundreds of patients per group.
  • The conventional target is 80% power. Below that, you’re more likely to miss a real effect than to find it.

Power Analysis: Planning a Study

TipThink About It

You’re designing the next clinical trial. Before running a $2M study, you should know if you have enough patients. How do you decide how many to enroll?

In practice, you’d use a power analysis tool rather than writing simulations from scratch. Here’s the standard tool:

Code
from statsmodels.stats.power import TTestIndPower

power_analysis = TTestIndPower()

# How many patients per group to detect a 30 CD4 cell effect?
effect_size_cohen = 30 / control.std()  # Cohen's d
n_needed = power_analysis.solve_power(
    effect_size=effect_size_cohen, power=0.8, alpha=0.05
)
print(f"To detect 30 CD4 cells with 80% power at alpha=0.05:")
print(f"  Need {n_needed:.0f} patients per group")
print(f"  That's {2*n_needed:.0f} patients total")
To detect 30 CD4 cells with 80% power at alpha=0.05:
  Need 192 patients per group
  That's 384 patients total
Code
# What power did our actual trial have?
actual_effect_cohen = observed_effect / control.std()
actual_power = power_analysis.solve_power(
    effect_size=actual_effect_cohen, nobs1=len(control), alpha=0.05
)
print(f"ACTG 175 trial: effect = {observed_effect:.1f} CD4 cells")
print(f"  n = {len(control)} per group, power = {actual_power:.4f}")
print(f"  We had more than enough patients for this effect.")
ACTG 175 trial: effect = 50.4 CD4 cells
  n = 532 per group, power = 1.0000
  We had more than enough patients for this effect.

Power tells us whether our study is big enough to find an effect. But the significance level \(\alpha\) tells us how much evidence we require before acting. And that depends on the stakes.

The Stakes: What Should \(\alpha\) Be?

You’re the FDA. Recall that the average cost to bring a drug to market exceeds $1.3 billion. You have to decide:

  • Approve a useless drug = waste billions in healthcare costs AND expose patients to side effects with no benefit (Type I error).
  • Reject an effective drug = patients die who could have been saved (Type II error).

What should \(\alpha\) be? 0.05? 0.01? 0.001?

TipThink About It

A drug might save 1,000 lives per year but costs $2B to manufacture. What false positive rate would you accept?

Code
# Null distribution (permutation-based)
null_mean_diffs = []
combined = np.concatenate([control.values, treatment.values])
for _ in range(50_000):
    shuffled = np.random.permutation(combined)
    half = len(shuffled) // 2
    null_mean_diffs.append(shuffled[:half].mean() - shuffled[half:].mean())
null_mean_diffs = np.array(null_mean_diffs)
null_se = null_mean_diffs.std()
Code
x = np.linspace(-4*null_se, 4*null_se, 1000)
y = norm.pdf(x, 0, null_se)

fig, ax = plt.subplots(figsize=(10, 6))
ax.plot(x, y, 'k-', lw=2, label='Null distribution')
ax.fill_between(x, y, alpha=0.1, color='gray')

# Different alpha levels with accessible colors
alphas_plot = [0.05, 0.01, 0.001]
colors_plot = ['orange', 'steelblue', 'darkred']
styles = ['-', '--', ':']
for a_val, color, ls in zip(alphas_plot, colors_plot, styles):
    cutoff = norm.ppf(1 - a_val/2, 0, null_se)
    ax.axvline(cutoff, color=color, lw=2, ls=ls,
               label=f'$\\alpha$={a_val}: reject if |effect| > {cutoff:.1f}')
    ax.axvline(-cutoff, color=color, lw=2, ls=ls)

ax.set_xlabel('Treatment Effect Under Null')
ax.set_ylabel('Density')
ax.set_title('How $\\alpha$ Sets the Rejection Threshold')
ax.legend(fontsize=10)
plt.show()

Smaller \(\alpha\) means a stricter threshold — harder to reject \(H_0\). That reduces false positives but increases false negatives.

There is no free lunch. The choice of \(\alpha\) depends on the relative costs of the two types of errors. In practice, the FDA often requires two independent trials, each at \(\alpha = 0.05\). If the trials are truly independent, the chance of two false positives is roughly \(0.05^2 = 0.0025\) — much more stringent than a single test.

One-Sided vs Two-Sided Tests

Our test was two-sided: we rejected \(H_0\) for large effects in either direction. Sometimes we know in advance that the effect can only go one way.

  • Two-sided \(H_1\): treatment effect \(\neq 0\) (could be better OR worse)
  • One-sided \(H_1\): treatment effect \(> 0\) (can only be better)

Suppose a new drug might have side effects. You want to test whether it increases adverse events (one-sided) vs. whether it changes adverse events in either direction (two-sided). If you use one-sided and the drug actually decreases adverse events, your test won’t detect that.

The FDA almost always requires two-sided tests: a drug must demonstrate efficacy without assuming the direction of the effect in advance.

Code
# Two-sided vs one-sided p-values
_, p_two = stats.ttest_ind(treatment, control, equal_var=False, alternative='two-sided')
_, p_one = stats.ttest_ind(treatment, control, equal_var=False, alternative='greater')

print(f"Two-sided p-value: {p_two:.2e}")
print(f"One-sided p-value: {p_one:.2e}")
print(f"\nWhen the effect is in the hypothesized direction,")
print(f"the one-sided p-value is exactly half the two-sided one (for symmetric tests).")
print(f"\nFor our data, both are tiny — the choice doesn't matter.")
print(f"But for borderline cases (p ~ 0.05), it could flip the decision.")
print(f"Rule of thumb: use two-sided unless you have a strong prior reason.")
Two-sided p-value: 2.79e-19
One-sided p-value: 1.39e-19

When the effect is in the hypothesized direction,
the one-sided p-value is exactly half the two-sided one (for symmetric tests).

For our data, both are tiny — the choice doesn't matter.
But for borderline cases (p ~ 0.05), it could flip the decision.
Rule of thumb: use two-sided unless you have a strong prior reason.

A One-Sample Test with Real Stakes: Swain v. Alabama

Not every hypothesis test compares two groups. Sometimes the question is whether a single sample matches a known population value. One landmark example comes from the U.S. Supreme Court.

In Swain v. Alabama (1965), Robert Swain, a Black man, was sentenced to death by an all-white jury in Talladega County, Alabama. His attorneys challenged the jury selection process: 26% of eligible jurors in the county were Black, yet only 8 out of 100 grand jury panelists were Black. Was this disparity evidence of racial discrimination, or could it have arisen by chance?

This is a one-sample proportion test. Under the null hypothesis of fair (race-blind) selection, each panelist is drawn independently with a 26% probability of being Black. We can test this directly:

Code
# Swain v. Alabama: one-sample proportion test
n_panelists = 100
observed_black = 8
expected_proportion = 0.26

# Exact binomial test: P(X <= 8) under Binomial(100, 0.26)
from scipy.stats import binomtest
result = binomtest(observed_black, n_panelists, expected_proportion, alternative='less')
print(f"Observed: {observed_black} Black panelists out of {n_panelists}")
print(f"Expected under fair selection: {expected_proportion * n_panelists:.0f}")
print(f"p-value (one-sided): {result.pvalue:.2e}")
Observed: 8 Black panelists out of 100
Expected under fair selection: 26
p-value (one-sided): 4.73e-06

The p-value is vanishingly small: under fair selection, seeing 8 or fewer Black panelists out of 100 is astronomically unlikely. Yet the Supreme Court ruled that “the overall percentage disparity has been small” and upheld the conviction. The Court reached this conclusion without performing a statistical test — a cautionary tale about relying on intuition instead of quantitative reasoning.

TipThink About It

The expected count under fair selection is 26. The observed count is 8. Would you call that disparity “small”? What would a p-value of this magnitude mean in a clinical trial?

Statistical Significance Is Not the Same as Importance

With ACTG175’s large sample size, even tiny, clinically meaningless effects would be “statistically significant.” Let’s see this in action.

TipThink About It

If we subsample to n = 15 per group, will the treatment still be significant? What about n = 25 or n = 50? Predict what happens before running the cell.

Code
# How often does a t-test detect the real effect at each sample size?
# For each n, we draw 500 random subsamples and count the fraction
# of t-tests that come back significant — this is the empirical power.
np.random.seed(42)
subsample_sizes = [15, 25, 50, 100, 200, 500]
n_simulations = 500

print(f"{'n per group':>12}  {'Fraction significant':>22}  {'Detected?':>16}")
print("-" * 56)

for n in subsample_sizes:
    sig_count = 0
    for _ in range(n_simulations):
        sub_ctrl = control.sample(n, replace=False)
        sub_trt = treatment.sample(n, replace=False)
        _, p = stats.ttest_ind(sub_trt, sub_ctrl, equal_var=False)
        if p < 0.05:
            sig_count += 1
    power = sig_count / n_simulations
    detected = "Almost always" if power > 0.8 else "Sometimes" if power > 0.5 else "Rarely"
    print(f"{n:>12}  {power:>21.0%}  {detected:>16}")
 n per group    Fraction significant         Detected?
--------------------------------------------------------
          15                    20%            Rarely
          25                    32%            Rarely
          50                    61%         Sometimes
         100                    90%     Almost always
         200                    99%     Almost always
         500                   100%     Almost always

The real effect is always there, but small samples often miss it. With n = 15 per group, a t-test detects the effect less than half the time. The ability to detect a real effect is called power, and it depends on both the effect size and the sample size.

Code
# p-value as a function of sample size (fixed effect)
ns = np.arange(20, 2001, 20)
p_vals_vs_n = []
for n in ns:
    se = sigma * np.sqrt(2/n)
    t = observed_effect / se
    p_vals_vs_n.append(2 * (1 - norm.cdf(abs(t))))

fig, ax = plt.subplots(figsize=(8, 5))
ax.semilogy(ns, p_vals_vs_n, color='steelblue', lw=2)
ax.axhline(0.05, color='red', ls='--', label='$\\alpha$ = 0.05')
ax.set_xlabel('Sample Size (per group)')
ax.set_ylabel('p-value (log scale)')
ax.set_title(f'p-value vs Sample Size (true effect = {observed_effect:.0f} CD4 cells)')
ax.legend()
plt.show()

TipThink About It

If a 5 CD4 cell change is “significant” with n = 10,000, should we change treatment guidelines?

The plot confirms: for a fixed real effect, the p-value shrinks as we add data. Eventually any nonzero effect becomes significant.

WarningAlways report effect sizes

The lesson: always report effect sizes and confidence intervals alongside p-values. A small p-value tells you the effect is unlikely to be zero — it does not tell you the effect is large or important.

TipThink About It: the blood pressure drug

A clinical trial with 10,000 participants tests a new blood pressure drug and finds a statistically significant reduction (p = 0.013). Impressive? The actual effect is a 2 mmHg decrease in systolic blood pressure. For context, drinking a cup of coffee temporarily raises blood pressure by about 5 mmHg (Mesas et al., Journal of Hypertension, 2011). The drug’s effect is real — it’s not zero — but it is smaller than your morning coffee. Would you recommend this drug to patients, given the costs and potential side effects?

Statistical significance told us the effect is not zero. It said nothing about whether the effect is worth acting on. That judgment requires domain knowledge: what reduction is clinically meaningful, what are the side effects, and what alternatives exist.

Key Takeaways

  • A p-value answers: “how surprising is this result if the null hypothesis were true?” It does NOT tell you the probability that the null is true, or that the result is important.
  • Type I error (false positive) rate = \(\alpha\). Type II error (false negative) rate depends on effect size, sample size, and \(\alpha\).
  • Power = probability of detecting a real effect. Use power analysis and sample size planning to design studies with at least 80% power.
  • Statistical significance is not the same as practical importance. With enough data, even tiny effects are “significant.” Always report effect size, confidence interval, AND p-value.

We’ll revisit these ideas in Lecture 11 when we ask: what happens when you run 20 tests at once? In Lecture 12, we’ll apply hypothesis tests to regression coefficients.

Study guide

Key ideas

  • Null hypothesis (\(H_0\)): The default claim (e.g., no effect, no difference). Alternative hypothesis (\(H_1\)): The competing claim you’re testing for.
  • p-value: Probability of observing data at least this extreme, assuming \(H_0\) is true. It is not the probability that \(H_0\) is true.
  • Significance level (\(\alpha\)): The false positive rate you’re willing to accept; the threshold for rejecting \(H_0\).
  • Type I error: Rejecting \(H_0\) when it’s true (false positive; rate = \(\alpha\)). Type II error: Failing to reject \(H_0\) when \(H_1\) is true (false negative; rate = \(\beta\)). Power = \(1 - \beta\).
  • p-values are uniform under \(H_0\): Exact when the null fully specifies the distribution; a good approximation otherwise.
  • Welch’s t-test: Two-sample t-test that does not assume equal variances.
  • You reject or fail to reject \(H_0\) — you never “accept” \(H_0\) or “prove” it true.
  • Power depends on effect size, sample size, and \(\alpha\) — compute it before running an experiment.
  • Statistical significance does not imply practical importance; always report effect sizes alongside p-values.

Computational tools

  • stats.ttest_ind(a, b, equal_var=False) — Welch’s t-test for two independent samples
  • stats.ttest_ind(..., alternative='greater') — one-sided t-test
  • TTestIndPower().solve_power(effect_size, power, alpha) — compute required sample size for a t-test
  • TTestIndPower().solve_power(effect_size, nobs1, alpha) — compute power given sample size

For the quiz

  • Be able to state \(H_0\) and \(H_1\) for a given scenario and identify the appropriate test.
  • Know the difference between Type I and Type II errors, and how \(\alpha\) controls the tradeoff.
  • Understand why “fail to reject” is not the same as “accept.”
  • Given a p-value and \(\alpha\), state the conclusion.
  • Explain why a large sample can make a tiny, meaningless effect “significant.”