Multiple Testing and Correlation

The intern who “discovered” load management

An NBA analytics intern tests whether each player performs better after rest. They find several dozen “significant” results and write a report: “These players benefit from load management.”

Their boss isn’t convinced. What did the intern miss?

“If you torture the data long enough, it will confess to anything.” — Ronald Coase

NoteWhy most published research findings are false

In a landmark 2005 paper, the statistician John Ioannidis argued that most published research findings are false. His central insight: when researchers test many hypotheses and studies are underpowered, Bayes’ theorem guarantees that a majority of “significant” results are false positives — even without any p-hacking. The paper has been cited over 10,000 times and reshaped how scientists think about statistical significance. The intern’s hundreds of tests are a textbook case.

Code
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
import warnings
warnings.filterwarnings('ignore')
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (8, 5)
plt.rcParams['font.size'] = 12

DATA_DIR = 'data'

The setup: hundreds of hypothesis tests

Recall from Chapter 10 that we test at significance level \(\alpha = 0.05\), accepting a 5% false positive rate for a single test. But what happens when we run hundreds of tests?

We have three seasons of NBA game logs (2021–2024). For each player, we can compute how many days of rest they had before each game. The question: does rest help?

Let’s load the data and take a look.

Code
# Load NBA load management data
nba = pd.read_csv(f'{DATA_DIR}/nba/nba_load_management.csv')
nba['GAME_DATE'] = pd.to_datetime(nba['GAME_DATE'])
print(f"Shape: {nba.shape}")
print(f"Unique players: {nba['PLAYER_NAME'].nunique()}")
print(f"Games with rest info: {nba['REST_DAYS'].notna().sum():,}")
nba[['PLAYER_NAME', 'GAME_DATE', 'PTS', 'REST_DAYS', 'REST_CATEGORY']].head(8)
Shape: (78335, 35)
Unique players: 818
Games with rest info: 77,517
PLAYER_NAME GAME_DATE PTS REST_DAYS REST_CATEGORY
0 Joe Johnson 2021-12-22 2 NaN NaN
1 LeBron James 2021-10-19 34 NaN NaN
2 LeBron James 2021-10-22 25 3.0 Short rest (3d)
3 LeBron James 2021-10-24 19 2.0 Normal (2d)
4 LeBron James 2021-10-29 26 5.0 Extended (4+d)
5 LeBron James 2021-10-31 15 2.0 Normal (2d)
6 LeBron James 2021-11-02 30 2.0 Normal (2d)
7 LeBron James 2021-11-19 23 17.0 Extended (4+d)

Each row is one player-game. The REST_DAYS column records how many days elapsed since that player’s previous game. Let’s see how rest days are distributed using a histogram.

Code
# How are rest days distributed?
df = nba.dropna(subset=['REST_DAYS']).copy()

fig, ax = plt.subplots(figsize=(8, 4))
rest_capped = df['REST_DAYS'].clip(upper=10)
ax.hist(rest_capped, bins=range(1, 12), edgecolor='black', alpha=0.7, color='steelblue')
ax.set_xlabel('Days Since Previous Game')
ax.set_ylabel('Count')
ax.set_title('Distribution of Rest Days')
ax.set_xticks(range(1, 11))
plt.tight_layout()
plt.show()

print(f"Most games follow a normal 2-day schedule ({(df['REST_DAYS']==2).mean():.0%} of games)")

Most games follow a normal 2-day schedule (55% of games)

The intern’s analysis: one t-test per player

The intern’s approach is straightforward: 1. For each player with enough games, split into “back-to-back” (1 day rest) vs “rested” (2+ days) 2. Run a two-sample t-test on points scored 3. If \(p < 0.05\), declare the player “benefits from rest”

This scenario illustrates the multiple testing problem: running many hypothesis tests on the same dataset. Let’s do exactly what the intern did.

Code
# Run a Welch's t-test for each player with >= 20 games in each category
min_games = 20

results = []
for player_id, group in df.groupby('PLAYER_ID'):
    btb = group[group['REST_DAYS'] == 1]['PTS']
    rested = group[group['REST_DAYS'] >= 2]['PTS']

    if len(btb) >= min_games and len(rested) >= min_games:
        t_stat, p_val = stats.ttest_ind(rested, btb)  # Welch's t-test (no equal-variance assumption)
        results.append({
            'PLAYER_NAME': group['PLAYER_NAME'].iloc[0],
            'n_btb': len(btb), 'n_rested': len(rested),
            'mean_btb': btb.mean(), 'mean_rested': rested.mean(),
            'diff': rested.mean() - btb.mean(),
            't_stat': t_stat, 'p_value': p_val
        })

results_df = pd.DataFrame(results).sort_values('p_value')

How many players show a “significant” rest effect at the conventional \(\alpha = 0.05\) threshold?

Code
n_players = len(results_df)
n_sig = (results_df['p_value'] < 0.05).sum()

print(f"Players tested: {n_players}")
print(f"'Significant' at p < 0.05: {n_sig} ({n_sig/n_players*100:.1f}%)")
print(f"\nThe intern's report: '{n_sig} players benefit from load management!'")
Players tested: 284
'Significant' at p < 0.05: 23 (8.1%)

The intern's report: '23 players benefit from load management!'

Wait — how many should we expect by chance?

TipThink About It

If rest has zero effect on every player, how many of our tests would you expect to come back “significant” at \(\alpha = 0.05\)?

Here’s the key insight. If rest has no effect on any player, the expected number of false positives is:

\[\text{Expected false positives} = m \times \alpha\]

where \(m\) is the number of tests. (Linearity of expectation holds even for dependent random variables — a useful fact from your probability course — so this expected count is valid regardless of dependence between tests. The Binomial distribution tells us exactly how many false positives to expect when running \(m\) independent tests, and the standard deviation used below assumes that independence — an approximation, since player performances share common opponents and schedules.)

Code
expected_false_positives = n_players * 0.05
sd_under_null = np.sqrt(n_players * 0.05 * 0.95)

print(f"Number of tests: {n_players}")
print(f"Expected false positives under the null: {expected_false_positives:.1f}")
print(f"Std dev under null (Binomial, assuming independence): {sd_under_null:.1f}")
print(f"Actual 'significant' results: {n_sig}")
print(f"\nGetting {n_sig} is only ~{(n_sig - expected_false_positives)/sd_under_null:.1f} SDs above {expected_false_positives:.0f}.")
print(f"The intern's 'discovery' is probably noise.")
Number of tests: 284
Expected false positives under the null: 14.2
Std dev under null (Binomial, assuming independence): 3.7
Actual 'significant' results: 23

Getting 23 is only ~2.4 SDs above 14.
The intern's 'discovery' is probably noise.

The p-value histogram: a powerful diagnostic

TipThink About It

If the null is true for every player, what should the histogram of p-values look like?

Under the null hypothesis (no effect for any player), p-values should follow a Uniform(0, 1) distribution. That means the p-value histogram should be flat.

If there are real effects, we’ll see a spike near 0 — those are the genuine discoveries hiding among the noise. Let’s look.

Code
fig, ax = plt.subplots(figsize=(8, 5))
ax.hist(results_df['p_value'], bins=20, edgecolor='black', alpha=0.7, color='steelblue')
ax.axhline(y=n_players/20, color='red', linestyle='--', linewidth=2,
           label=f'Expected under null ({n_players/20:.0f} per bin)')
ax.axvspan(0, 0.05, alpha=0.15, color='red', label='Rejection region (p < 0.05)')
ax.set_xlabel('p-value')
ax.set_ylabel('Count')
ax.set_title(f'Distribution of {n_players} p-values\n'
             f'(Flat = no real effect; spike near 0 = real effects)')
ax.legend()
plt.tight_layout()
plt.show()

print("The histogram is approximately uniform — consistent with NO widespread rest effect.")
print("The slight bump near 0 could be a few real effects, or just sampling variability.")

The histogram is approximately uniform — consistent with NO widespread rest effect.
The slight bump near 0 could be a few real effects, or just sampling variability.

Correction 1: Bonferroni

The simplest fix: if you run \(m\) tests, divide your significance level by \(m\).

ImportantDefinition: Bonferroni Correction and FWER

The Bonferroni correction sets

\[\alpha_{\text{Bonferroni}} = \frac{\alpha}{m}\]

It controls the family-wise error rate (FWER) — the probability of any false positive across all \(m\) tests. It’s conservative — meaning it may miss real effects — but it guarantees the probability of any false positive stays below \(\alpha\), no matter how the tests are correlated.

Code
alpha = 0.05
bonferroni_threshold = alpha / n_players

print(f"Original threshold: {alpha}")
print(f"Bonferroni threshold: {bonferroni_threshold:.6f}")
print(f"That's {1/bonferroni_threshold:.0f}x more stringent!\n")

n_bonf = (results_df['p_value'] < bonferroni_threshold).sum()
print(f"Significant after Bonferroni: {n_bonf} (down from {n_sig})")

if n_bonf > 0:
    survivors = results_df[results_df['p_value'] < bonferroni_threshold]
    print("\nSurvivors:")
    print(survivors[['PLAYER_NAME', 'mean_btb', 'mean_rested', 'diff', 'p_value']].to_string(index=False))
else:
    print("\nNone survive. The intern's entire report just evaporated.")
Original threshold: 0.05
Bonferroni threshold: 0.000176
That's 5680x more stringent!

Significant after Bonferroni: 0 (down from 23)

None survive. The intern's entire report just evaporated.

Correction 2: Benjamini-Hochberg (FDR control)

Bonferroni killed every result. But is that too aggressive? Maybe some players really do benefit from rest — we just can’t tell — Bonferroni demands an extremely low p-value for each test. Can we find a middle ground that allows a controlled fraction of false discoveries?

ImportantDefinition: Benjamini-Hochberg Procedure and FDR

The Benjamini-Hochberg (BH) procedure controls the false discovery rate (FDR) — the expected proportion of false positives among rejected hypotheses.

The BH procedure:

  1. Sort p-values: \(p_{(1)} \le p_{(2)} \le \cdots \le p_{(m)}\)
  2. Find the largest \(k\) such that \(p_{(k)} \le \frac{k}{m} \alpha\)
  3. Reject all hypotheses whose p-values are among the \(k\) smallest (i.e., hypotheses \(1, 2, \ldots, k\) in the sorted order)

In expectation, a fraction \(\alpha\) of your “discoveries” are false.

Code
# Step-by-step Benjamini-Hochberg
sorted_pvals = results_df.sort_values('p_value').copy()
sorted_pvals['rank'] = range(1, len(sorted_pvals) + 1)
sorted_pvals['BH_threshold'] = sorted_pvals['rank'] / n_players * alpha

# Step-up rule: find the largest k where p_(k) <= k/m * alpha,
# then reject ALL hypotheses 1 through k
reject_mask = sorted_pvals['p_value'] <= sorted_pvals['BH_threshold']
if reject_mask.any():
    max_k = sorted_pvals.loc[reject_mask, 'rank'].max()
    sorted_pvals['reject_BH'] = sorted_pvals['rank'] <= max_k
else:
    sorted_pvals['reject_BH'] = False

Let’s display the sorted p-values alongside their BH thresholds to see the step-up rule in action.

Code
# Show the procedure for the top 15 p-values
display_cols = ['PLAYER_NAME', 'p_value', 'rank', 'BH_threshold', 'reject_BH']
print("Benjamini-Hochberg procedure (first 15 sorted p-values):")
print(sorted_pvals.head(15)[display_cols].to_string(index=False))

n_bh = sorted_pvals['reject_BH'].sum()
print(f"\nSignificant after BH correction: {n_bh}")
print(f"  Uncorrected:  {n_sig}")
print(f"  Bonferroni:   {n_bonf}")
print(f"  BH (FDR):     {n_bh}")
Benjamini-Hochberg procedure (first 15 sorted p-values):
       PLAYER_NAME  p_value  rank  BH_threshold  reject_BH
      Shake Milton 0.000233     1      0.000176      False
  Payton Pritchard 0.000546     2      0.000352      False
Kristaps Porziņģis 0.001634     3      0.000528      False
       Amir Coffey 0.003384     4      0.000704      False
       Luka Dončić 0.003494     5      0.000880      False
   Duncan Robinson 0.004999     6      0.001056      False
     Herbert Jones 0.006371     7      0.001232      False
     Aaron Holiday 0.007158     8      0.001408      False
     Nick Richards 0.011588     9      0.001585      False
      Jock Landale 0.011602    10      0.001761      False
    Grant Williams 0.012670    11      0.001937      False
    Day'Ron Sharpe 0.015103    12      0.002113      False
  Jonathan Kuminga 0.019106    13      0.002289      False
   Trey Murphy III 0.019993    14      0.002465      False
      Kevon Looney 0.022410    15      0.002641      False

Significant after BH correction: 0
  Uncorrected:  23
  Bonferroni:   0
  BH (FDR):     0

The two corrections answer different questions. FWER asks: did I make any mistake at all? FDR asks: among the things I called discoveries, what fraction are wrong? Use Bonferroni when any false positive is costly (e.g., drug safety). Use BH when you can tolerate some false discoveries (e.g., screening genes for further study). A sportsbook evaluating hundreds of player props per night faces this same multiple testing problem. The stakes determine the correction — a connection to the consequential decisions from Chapter 10.

Code
# Visualize the BH procedure
fig, ax = plt.subplots(figsize=(8, 5))
top_n = 50  # show first 50 sorted p-values
subset = sorted_pvals.head(top_n)

ax.scatter(subset['rank'], subset['p_value'], color='steelblue', s=30, label='Observed p-values')
ax.plot(subset['rank'], subset['BH_threshold'], 'r--', linewidth=2, label='BH threshold (i/m × α)')
ax.axhline(y=bonferroni_threshold, color='orange', linestyle=':', linewidth=2, label='Bonferroni threshold')
ax.annotate(f'Bonferroni = {bonferroni_threshold:.5f}', xy=(top_n*0.6, bonferroni_threshold),
            xytext=(top_n*0.6, bonferroni_threshold + 0.01),
            arrowprops=dict(arrowstyle='->', color='orange'), fontsize=9, color='orange')
ax.set_xlabel('Rank (sorted by p-value)')
ax.set_ylabel('p-value')
ax.set_title('Benjamini-Hochberg Procedure')
ax.legend()
plt.tight_layout()
plt.show()

This same problem arises whenever you run many tests: a genomics lab testing 20,000 genes for association with a disease, a tech company running hundreds of A/B tests per week, or a pharmaceutical trial measuring many endpoints. Any time you run many tests, you need multiple testing correction.

The reproducibility crisis

Multiple testing corrections fix the problem when you know how many tests you ran. But what happens when the multiplicity is hidden — spread across months of exploratory analysis, dozens of researcher choices, and the publication process itself?

NoteThe replication crisis in psychology

In 2011, the psychologist Daryl Bem published a paper in a top journal claiming to find evidence for precognition — that people could sense future events before they happened. The paper used standard methods and passed peer review. The embarrassment lay not in the result being wrong (it almost certainly was), but in the fact that standard methods allowed it.

This embarrassment helped spark the Reproducibility Project (Open Science Collaboration, 2015), which attempted to replicate 100 published psychology studies. The results were sobering: only about 36% of the studies replicated with a statistically significant result, and effect sizes were roughly half the originals. Social psychology was hit hardest. The crisis wasn’t limited to psychology — similar problems surfaced in cancer biology, economics, and other fields — but psychology became the most visible case.

How did so many false findings get published? The answer connects directly to what we just learned about multiple testing.

WarningP-hacking and researcher degrees of freedom

P-hacking (also called data dredging) means trying many analyses and reporting the one that produces \(p < 0.05\). This practice includes: excluding “outlier” participants until significance appears, testing many outcome variables and reporting only the significant one, collecting data until \(p\) dips below 0.05, or splitting groups in various ways until a comparison “works.”

The statistician Andrew Gelman calls these researcher degrees of freedom — the garden of forking paths. At each fork (which variable? which subset? which test?), the researcher makes a choice, and each choice is an implicit hypothesis test. The final reported p-value looks like it came from a single test, but it is really the survivor of many. This structure is exactly the multiple testing problem, except the multiplicity is invisible: it doesn’t appear in the paper.

Simmons, Nelson, and Simonsohn (2011) demonstrated that exploiting just a few common degrees of freedom — optional stopping, controlling for gender, dropping a condition — could produce \(p < 0.05\) for a completely nonexistent effect over 60% of the time.

Two structural features of science make the problem worse:

  • The file drawer problem. Studies with \(p > 0.05\) are hard to publish, so they go in the “file drawer.” If 20 labs independently test the same false hypothesis, one will get \(p < 0.05\) by chance — and that’s the only one you’ll read about. The published literature is a biased sample of all the research ever conducted.
  • Incentive misalignment. Careers are built on novel, statistically significant findings. Null results don’t get published, don’t earn grants, and don’t lead to promotions. This misalignment creates strong pressure (often unconscious) to find significance.
ImportantDefinition: Pre-registration

Pre-registration means publicly committing to your research question, data collection plan, and analysis plan before looking at the data. This commitment eliminates researcher degrees of freedom by fixing the garden of forking paths in advance. A pre-registered analysis is a single, pre-specified test — not the survivor of many. Pre-registration doesn’t prevent exploratory analysis; it simply requires labeling it honestly as exploratory (hypothesis-generating) rather than confirmatory (hypothesis-testing).

The lesson for this course: multiple testing corrections (Bonferroni, BH) handle the known multiplicity — the hundreds of tests the intern explicitly ran. Pre-registration and transparent reporting handle the hidden multiplicity — the choices a researcher makes before the final analysis ever appears. Both are manifestations of the same core problem: if you test many things and cherry-pick, you will find “significance” whether or not anything is real.

From testing to correlation

So far we’ve been testing each player separately — a binary question (does rest help this player?). But maybe we’re asking the wrong question. Instead of hundreds of yes/no answers, can we measure one continuous relationship: how does rest relate to performance across all players?

“Correlation is not causation.”

Let’s see this in action. First, the picture.

Code
# Cap rest at 7 days to avoid long injury absences
games_corr = df[df['REST_DAYS'] <= 7].copy()

# Violin plot by rest day (avoids overplotting vs raw scatter)
fig, ax = plt.subplots(figsize=(8, 5))
rest_vals = sorted(games_corr['REST_DAYS'].unique())
bp_data = [games_corr[games_corr['REST_DAYS'] == d]['PTS'].values for d in rest_vals]
parts = ax.violinplot(bp_data, positions=rest_vals, showmedians=True, widths=0.7)
for pc in parts['bodies']:
    pc.set_facecolor('steelblue')
    pc.set_alpha(0.5)

# Overlay mean points by rest day
means = games_corr.groupby('REST_DAYS')['PTS'].mean()
ax.plot(means.index, means.values, 'ro-', markersize=8, linewidth=2, label='Mean PTS')
ax.set_xlabel('Rest Days')
ax.set_ylabel('Points Scored')
ax.set_title('Rest Days vs Points Scored')
ax.legend()
plt.tight_layout()
plt.show()

print("Notice: the means trend DOWNWARD. More rest → fewer points?!")
print("That can't be right... or can it?")

Notice: the means trend DOWNWARD. More rest → fewer points?!
That can't be right... or can it?
ImportantDefinition: Pearson Correlation Coefficient

The Pearson correlation \(r\) measures the linear association between two variables — you derived this in your probability course when studying covariance:

\[r = \frac{\sum (x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum (x_i - \bar{x})^2 \sum (y_i - \bar{y})^2}}\]

\(r\) ranges from \(-1\) to \(+1\). It is unitless and symmetric: \(r(x, y) = r(y, x)\). Important: \(r = 0\) means no linear relationship, not no relationship at all. Two variables can be strongly related (e.g., quadratic) and still have \(r\) near 0. Outliers can also strongly distort \(r\).

Code
r, p = stats.pearsonr(games_corr['REST_DAYS'], games_corr['PTS'])
print(f"Correlation (rest days vs points): r = {r:.4f}")
print(f"p-value: {p:.6f}")
print(f"\nWith n = {len(games_corr):,} observations, even tiny correlations are")
print(f"*statistically* significant — but that doesn't mean the correlation is meaningful.")
Correlation (rest days vs points): r = -0.1020
p-value: 0.000000

With n = 72,156 observations, even tiny correlations are
*statistically* significant — but that doesn't mean the correlation is meaningful.
WarningEcological Correlation Fallacy

Correlations computed on group averages (e.g., one point per state or per team) can be much stronger than correlations computed on individual observations. These ecological correlations are misleading — averaging hides within-group variation. Always check whether a correlation is based on individual data or aggregated summaries.

Correlation is not causation: the confounding problem

That negative correlation seems backward. Rest hurts performance?

TipThink About It

Before reading on — why might more rest be associated with fewer points? What’s different about players who get extended rest? Discuss with your neighbor.

Coaches don’t randomly assign rest. Consider:

  • Star players play almost every game (short rest, high points)
  • Bench players are more likely to miss games (extended rest, low points)
  • Players returning from minor injuries get extended rest
  • Back-to-backs cluster during road trips (schedule-driven)

The reason for rest is confounded with the outcome of rest. We’ll formalize this idea in Chapter 18 using causal diagrams (DAGs), which make it precise to say exactly which variables are confounders.

Code
# Who gets extended rest? Compare player quality across rest categories.
player_stats = df.groupby('PLAYER_NAME').agg(
    ppg=('PTS', 'mean'),
    games=('PTS', 'count'),
    mean_rest=('REST_DAYS', 'mean')
).query('games >= 50')

ext_rest = df[df['REST_DAYS'] >= 4].groupby('PLAYER_NAME').size().reset_index(name='ext_rest_games')
player_stats = player_stats.reset_index().merge(ext_rest, on='PLAYER_NAME', how='left')
player_stats['ext_rest_games'] = player_stats['ext_rest_games'].fillna(0)
player_stats['ext_rest_frac'] = player_stats['ext_rest_games'] / player_stats['games']

Now we can plot each player’s scoring average against the fraction of their games with extended rest (4+ days). The .merge() call joins the extended-rest counts back to each player’s summary row.

Code
fig, ax = plt.subplots(figsize=(8, 5))
ax.scatter(player_stats['ppg'], player_stats['ext_rest_frac'],
           alpha=0.4, s=player_stats['games']/5, color='steelblue')
ax.set_xlabel('Points Per Game (player quality)')
ax.set_ylabel('Fraction of Games with Extended Rest (4+ days)')
ax.set_title('Who Gets Extended Rest?\n(size = number of games)')
plt.tight_layout()
plt.show()

r_qual, p_qual = stats.pearsonr(player_stats['ppg'], player_stats['ext_rest_frac'])
print(f"Correlation between PPG and extended rest fraction: r = {r_qual:.3f}")
print("Low-scoring players have MORE extended rest — they're the ones sitting out!")

Correlation between PPG and extended rest fraction: r = -0.517
Low-scoring players have MORE extended rest — they're the ones sitting out!

Simpson’s paradox in rest data

This reversal is Simpson’s paradox (named after Edward Simpson, 1951): the aggregate comparison shows rest hurts by nearly four points, but within individual players, the effect is less than a point. Most of the aggregate pattern was driven by who gets rest, not by rest itself.

Let’s see this directly.

Code
# Aggregate comparison (MISLEADING)
print("=== Aggregate Comparison (MISLEADING) ===")
agg_means = df.groupby('REST_CATEGORY')['PTS'].mean()
print(agg_means.round(2))
print(f"\nExtended rest: {agg_means.loc['Extended (4+d)']:.1f} PPG  vs  Back-to-back: {agg_means.loc['Back-to-back (1d)']:.1f} PPG")
print(f"Looks like rest HURTS by {agg_means.loc['Back-to-back (1d)'] - agg_means.loc['Extended (4+d)']:.1f} points!")
=== Aggregate Comparison (MISLEADING) ===
REST_CATEGORY
Back-to-back (1d)    11.32
Extended (4+d)        7.53
Normal (2d)          11.35
Short rest (3d)      11.07
Name: PTS, dtype: float64

Extended rest: 7.5 PPG  vs  Back-to-back: 11.3 PPG
Looks like rest HURTS by 3.8 points!

The fix is to compare each player to themselves. For each player-season, we subtract their average points. Now a value of +2 means “this player scored 2 points above their own average.” This transformation removes player quality from the comparison.

Code
# Compute each player-season's average points
player_season_avg = df.groupby(['PLAYER_ID', 'SEASON'])['PTS'].transform('mean')
# Subtract to get within-player deviation
df['PTS_DEMEANED'] = df['PTS'] - player_season_avg

print("=== Within-Player Comparison (CORRECT) ===")
within_means = df.dropna(subset=['REST_CATEGORY']).groupby('REST_CATEGORY')['PTS_DEMEANED'].mean()
print(within_means.round(3))
print(f"\nWithin-player, extended rest effect: {within_means.loc['Extended (4+d)']:+.3f} points")
print("Less than a point! Most of the aggregate effect was driven by WHO gets rest.")
=== Within-Player Comparison (CORRECT) ===
REST_CATEGORY
Back-to-back (1d)    0.336
Extended (4+d)      -0.597
Normal (2d)          0.113
Short rest (3d)     -0.058
Name: PTS_DEMEANED, dtype: float64

Within-player, extended rest effect: -0.597 points
Less than a point! Most of the aggregate effect was driven by WHO gets rest.

The side-by-side visualization below makes the paradox vivid: the left panel shows the misleading aggregate comparison, and the right panel shows the within-player story with 95% confidence intervals.

Code
# Visualize the paradox
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
colors = ['#e74c3c', '#3498db', '#2ecc71', '#f39c12']

# Left: aggregate (misleading)
axes[0].bar(range(len(agg_means)), agg_means.values, color=colors, edgecolor='black')
axes[0].set_xticks(range(len(agg_means)))
axes[0].set_xticklabels(agg_means.index, rotation=15, fontsize=9)
axes[0].set_ylabel('Mean Points')
axes[0].set_ylim(bottom=0)
axes[0].set_title('Aggregate: Rest Looks Harmful\n(Simpson\'s paradox!)')

# Right: within-player (correct)
sems = df.dropna(subset=['REST_CATEGORY']).groupby('REST_CATEGORY')['PTS_DEMEANED'].sem()
axes[1].bar(range(len(within_means)), within_means.values,
            yerr=sems.values * 1.96, color=colors, edgecolor='black', capsize=5)
axes[1].set_xticks(range(len(within_means)))
axes[1].set_xticklabels(within_means.index, rotation=15, fontsize=9)
axes[1].axhline(y=0, color='black', linewidth=0.5)
axes[1].set_ylabel('Mean Points (relative to player avg)')
axes[1].set_title('Within-Player: Almost No Effect\n(error bars = 95% CI)')
plt.tight_layout()
plt.show()

print("Left: misleading. Right: the real story.")
print("Always ask: WHY were these groups different to begin with?")

Left: misleading. Right: the real story.
Always ask: WHY were these groups different to begin with?

Key takeaways

Both halves of this chapter share a theme: statistical results can be deeply misleading when you don’t account for the structure of your analysis.

Multiple testing:

  1. Multiple testing inflates false positives. If you test \(m\) hypotheses at level \(\alpha\), you expect \(m \alpha\) false positives. Testing about 280 players guarantees ~14 “significant” results even when rest has zero effect.

  2. Always correct for multiple comparisons. Bonferroni (\(\alpha/m\)) controls the family-wise error rate; Benjamini-Hochberg controls the false discovery rate. Both dramatically reduce spurious findings.

  3. The p-value histogram is your friend. Under the null, p-values are uniform. A spike near 0 suggests real effects; a flat histogram suggests noise.

Correlation and confounding:

  1. Correlation is not causation. The negative correlation between rest and points reflects who gets rest, not the effect of rest itself. Confounding variables drive the pattern.

  2. Simpson’s paradox is real and common. Aggregate comparisons can reverse the individual-level story when groups are heterogeneous. Always ask what’s driving the comparison.

In Chapter 12, we’ll use regression to estimate the effect of rest while controlling for player quality, schedule difficulty, and fatigue — and we’ll test whether that coefficient is statistically significant.

Study guide

Key ideas

  • Multiple testing (multiple comparisons): Running many hypothesis tests on the same data, inflating the chance of false positives.
  • Family-wise error rate (FWER): The probability of making any false positive across all tests. Bonferroni controls this.
  • False discovery rate (FDR): The expected proportion of rejected hypotheses that are false. BH controls this.
  • Bonferroni correction: Test each hypothesis at \(\alpha / m\) instead of \(\alpha\).
  • Benjamini-Hochberg (BH) procedure: Sort p-values; find the largest \(k\) where \(p_{(k)} \le (k/m) \alpha\); reject hypotheses \(1, \ldots, k\).
  • p-value histogram: Diagnostic tool; uniform under the null, spike near 0 if signal exists.
  • Pearson correlation (\(r\)): Measures linear association between two variables. Ranges from \(-1\) to \(+1\). Unitless and symmetric.
  • Confounding: When a third variable influences both the treatment and the outcome, creating a spurious association.
  • Simpson’s paradox: An aggregate trend reverses (or disappears) when you look within subgroups.
  • Ecological correlation: Correlations computed on group averages can be much stronger than individual-level correlations; always check the unit of analysis.
  • Reproducibility crisis: Bem’s precognition paper (2011) exposed how standard methods could validate implausible claims. The Reproducibility Project (2015) found only ~36% of 100 psychology studies replicated.
  • P-hacking (data dredging): Trying many analyses and reporting only the one that produces \(p < 0.05\). Researcher degrees of freedom (Simmons et al., 2011) make invisible multiplicity.
  • Pre-registration: Publicly committing to research question, data collection, and analysis plan before seeing data — eliminating researcher degrees of freedom.
  • Running \(m\) tests at \(\alpha = 0.05\) produces \(\sim m \times 0.05\) false positives even when nothing is real.
  • Bonferroni is safe but conservative; BH is more powerful but allows a controlled fraction of false discoveries.
  • A flat p-value histogram means no widespread effect; a spike near 0 means some discoveries are real.
  • Correlation between two variables can be driven entirely by a confounder — always ask who is in each group.
  • Within-group analysis (demeaning) removes confounders that differ between groups.

Computational tools

  • scipy.stats.ttest_ind(a, b) — Welch’s two-sample t-test
  • scipy.stats.pearsonr(x, y) — Pearson correlation and p-value
  • BH procedure: sort p-values, compute thresholds \(k/m \times \alpha\), apply step-up rule
  • .groupby().transform('mean') — compute group means aligned to original rows (for demeaning)

For the quiz

  • Be able to compute the expected number of false positives given \(m\) and \(\alpha\).
  • Know when to use Bonferroni vs. BH (stakes determine the correction).
  • Be able to interpret a p-value histogram.
  • Explain why a correlation can be misleading due to confounding.
  • Recognize Simpson’s paradox: aggregate trend differs from within-group trend.