Causal Inference I — DAGs and Confounding

Code

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
import warnings
warnings.filterwarnings('ignore')
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (7, 4.5)
plt.rcParams['font.size'] = 12

# Load data
DATA_DIR = 'data'
np.random.seed(42)

You’re a high school senior choosing between Stanford and your state school. Stanford costs $200K more over four years. Stanford graduates earn a median of $86K ten years after enrollment — but is that caused by Stanford, or would those students earn a lot anyway? That’s a $200K question.

The central question of causal inference is: when can we interpret a statistical association as a cause-and-effect relationship?

Definition: Causal Effect

X causes Y if intervening to change X — while holding everything else fixed — would change Y. The causal effect of X on Y is the difference in Y that results from this intervention.

If we could magically send every student to Stanford without changing anything else about them, would their earnings go up? The interventionist definition makes the question precise: a cause is something whose manipulation changes the outcome.

“Correlation is not causation.” — a foundational principle of statistics

Today we formalize why — and develop the tools to tell the difference.

The Stanford premium

Let’s start with the data. The College Scorecard tracks outcomes for students at every US college.

Code

scorecard = pd.read_csv(f'{DATA_DIR}/college-scorecard/scorecard.csv')

# Clean: keep schools with SAT and earnings data
scorecard['SAT_AVG'] = pd.to_numeric(scorecard['SAT_AVG'], errors='coerce')
scorecard['MD_EARN_WNE_P10'] = pd.to_numeric(scorecard['MD_EARN_WNE_P10'], errors='coerce')
colleges = scorecard.dropna(subset=['SAT_AVG', 'MD_EARN_WNE_P10']).copy()

# Find Stanford
stanford = colleges[colleges['INSTNM'].str.contains('Stanford University', na=False)]
print(f"Stanford SAT average: {stanford['SAT_AVG'].values[0]:.0f}")
print(f"Stanford median earnings (10 yrs): ${stanford['MD_EARN_WNE_P10'].values[0]:,.0f}")
print(f"\nSchools with both SAT and earnings data: {len(colleges):,}")

Stanford SAT average: 1465
Stanford median earnings (10 yrs): $86,000

Schools with both SAT and earnings data: 1,272

We use pd.to_numeric() with errors='coerce' to convert non-numeric entries to NaN, then dropna() to keep only complete rows. Now let’s plot SAT scores against earnings with stats.linregress() to fit a regression line.

Code

fig, ax = plt.subplots(figsize=(10, 6))
ax.scatter(colleges['SAT_AVG'], colleges['MD_EARN_WNE_P10'], alpha=0.3, s=15, color='steelblue')

# Highlight Stanford
ax.scatter(stanford['SAT_AVG'], stanford['MD_EARN_WNE_P10'],
           color='red', s=100, zorder=5, label='Stanford')
ax.annotate('Stanford', xy=(stanford['SAT_AVG'].values[0], stanford['MD_EARN_WNE_P10'].values[0]),
            xytext=(10, 10), textcoords='offset points', fontsize=11, color='red')

# Regression line
slope, intercept, r, p, se = stats.linregress(colleges['SAT_AVG'], colleges['MD_EARN_WNE_P10'])
x_line = np.linspace(colleges['SAT_AVG'].min(), colleges['SAT_AVG'].max(), 100)
ax.plot(x_line, intercept + slope * x_line, 'r-', lw=2, alpha=0.7)

ax.set_xlabel('Average SAT Score')
ax.set_ylabel('Median Earnings (10 years after enrollment)')
ax.set_title(f'SAT Score vs. Earnings (r = {r:.2f})')
ax.legend()
plt.show()

print(f"Correlation: r = {r:.2f}")
print(f"Each 100-point SAT increase is associated with ${slope*100:,.0f} higher median earnings")

Correlation: r = 0.65
Each 100-point SAT increase is associated with $5,148 higher median earnings

Does attending a selective college cause higher earnings?

The correlation is striking: $r \approx 0.65$. Schools with higher average SATs have higher-earning graduates. This is the Stanford-vs-state-school question with a summary statistic attached — a naive interpretation would be: “attending a more selective school raises your earnings.”

Each row in this dataset is a school, not a student. SAT_AVG is the average SAT of the school’s admits, not any individual student’s score — so treat it as a stand-in for the bundle that makes a selective school selective: peer group, resources, alumni network, brand. The causal question is: does attending a school with that bundle cause a student to earn more?

But wait. What if something else drives both who attends a selective school and who earns more?

Think About It

What factors might affect both a student’s ability to attend a selective school and their future earnings?

Confounders: the hidden third variable

Definition: Confounder

A confounder is a common cause of both the treatment and the outcome — it must cause both, not just be correlated with both.

Here, family wealth is a classic confounder:

Wealthy families invest in test prep, top-tier high schools, application coaches $\rightarrow$ more likely to attend a selective college
Wealthy families provide professional networks, connections, inheritance $\rightarrow$ higher earnings

If family wealth drives both, then the selective-school-to-earnings correlation could be entirely spurious — a spurious association that exists only from a shared cause. Attending Stanford might not cause anything; the SAT_AVG column might just be a proxy for which schools admit wealthy students. (Recall from Chapter 11: “correlation is not causation.” Now let’s formalize why.)

Causal diagrams (DAGs)

Definition: DAG (Directed Acyclic Graph)

A Directed Acyclic Graph (DAG) is a visual tool for representing causal relationships. “Directed” means the arrows have direction (A causes B, not the reverse). “Acyclic” means there are no loops — you can never follow the arrows in a circle back to where you started. Each arrow means “causes” or “directly influences.” Another formal framework, the potential outcomes model (also called the Rubin causal model), defines causal effects algebraically rather than graphically.

Preview: the potential outcomes framework

DAGs represent causal structure visually. The potential outcomes model defines causal effects algebraically. For each unit, imagine two potential outcomes: $Y(1)$ under treatment and $Y(0)$ under control. The causal effect is $Y(1) - Y(0)$, but we only observe one of these for any individual (the fundamental problem of causal inference). Randomization ensures that the average causal effect can be estimated, even though individual effects cannot.

Write the treatment as “attend selective college” (proxied in the data by a school’s SAT_AVG). There are two paths from the treatment to earnings:

Causal path (directed path): Attend selective college $\to$ Earnings (maybe real — arrows point forward from treatment to outcome)
Backdoor path: Attend selective college $\leftarrow$ Wealth $\to$ Earnings (spurious — goes “behind” the treatment through a common cause)

The standard causal inference term for path 2 is a backdoor path: it connects treatment to outcome by going through a common cause rather than through a direct causal chain. A naive regression mixes both paths together. We can’t tell which is which without more information. Let’s draw it.

Code

def draw_dag(nodes, edges, annotations=None, title='', figsize=(8, 6)):
    """Draw a simple DAG. nodes: dict {name: (x,y)}, edges: list [(from, to)],
    annotations: list [(text, x, y, color)]."""
    fig, ax = plt.subplots(figsize=figsize)
    ax.set_xlim(0, 10); ax.set_ylim(0, 8); ax.axis('off')
    for name, (x, y) in nodes.items():
        fc = 'lightyellow' if '(unobserved)' in name else 'lightblue'
        ls = '--' if '(unobserved)' in name else '-'
        ax.add_patch(plt.Rectangle((x-1.2, y-0.4), 2.4, 0.8,
            facecolor=fc, edgecolor='navy', lw=2, ls=ls, zorder=3))
        ax.text(x, y, name, ha='center', va='center', fontsize=11, fontweight='bold', zorder=4)
    for (sn, dn) in edges:
        sx, sy = nodes[sn]; dx, dy = nodes[dn]
        ax.annotate('', xy=(dx, dy), xytext=(sx, sy),
            arrowprops=dict(arrowstyle='->', lw=2.5, color='navy'))
    for text, x, y, color in (annotations or []):
        ax.annotate(text, xy=(x, y), fontsize=10, color=color, fontstyle='italic', ha='center')
    if title:
        ax.set_title(title, fontsize=14, fontweight='bold')
    plt.tight_layout(); plt.show()

The draw_dag() helper above draws nodes as rectangles and connects them with arrows. Let’s use it to visualize the selectivity–earnings DAG.

Code

draw_dag(
    nodes={'Family Wealth\n(unobserved)': (5, 7), 'Attend selective\ncollege': (2, 4),
           'Earnings': (8, 4)},
    edges=[('Family Wealth\n(unobserved)', 'Attend selective\ncollege'),
           ('Family Wealth\n(unobserved)', 'Earnings'),
           ('Attend selective\ncollege', 'Earnings')],
    annotations=[('backdoor\npath', 3.2, 5.8, 'red'), ('causal\npath?', 5.0, 3.6, 'green')],
    title='Causal diagram (DAG): selective-college attendance and earnings'
)

The dashed yellow node marks a variable we cannot cleanly observe — family wealth is not in the Scorecard dataset. The chapter’s central difficulty: we can see the correlation along the red path, but we can’t separate the two paths without either measuring wealth or intervening.

When does regression estimate a causal effect?

In Chapter 12, we interpreted regression coefficients as “holding everything else constant.” Now we can see when that interpretation is causal and when it’s just descriptive.

Regression estimates a causal effect only when you block all backdoor paths between treatment and outcome — by controlling for at least one non-collider variable on each backdoor path — without blocking any causal paths.

In practice, this is nearly impossible with observational data (data collected without an experiment):

We can’t measure “family wealth” perfectly
There may be confounders we haven’t even thought of (motivation, neighborhood, genetics…)
Controlling for the wrong variable can actually create bias (we’ll see this with colliders soon)

The gap between correlation and causation motivates the distinction between observational data (collected without intervention) and experimental data (from a randomized experiment). In a randomized experiment, the treatment is assigned randomly, so there are no backdoor paths to worry about.

The Fundamental Problem

With observational data, you can never be sure you’ve blocked every backdoor path.

Simpson’s paradox: a confounding problem

In Chapter 11, you saw Simpson’s Paradox in NBA rest data: extended-rest players scored fewer points overall, but more points within each position group. The aggregate lied. We first saw hospital readmission data in Chapter 2, where the “worst” hospitals turned out to be small rural ones with noisy estimates. Now let’s see another way this data can mislead — and explain why using DAGs.

The question: are high-volume hospitals better or worse at preventing readmissions?

Code

hospitals = pd.read_csv(f'{DATA_DIR}/hospital-readmissions/hospital_level_summary.csv')
hospitals['star_rating_num'] = pd.to_numeric(hospitals['star_rating_num'], errors='coerce')

hospitals['is_high_volume'] = hospitals['total_discharges'] > hospitals['total_discharges'].median()

# Overall comparison
print("Overall readmission rates:")
print("-" * 50)
for group_name, group in hospitals.groupby('is_high_volume'):
    label = 'High-volume' if group_name else 'Low-volume'
    rate = group['overall_readmission_rate'].mean()
    n = len(group)
    print(f"  {label}: {rate:.2f}%  (n={n})")

print("\nHigh-volume hospitals appear BETTER overall.")

Overall readmission rates:
--------------------------------------------------
  Low-volume: 18.25%  (n=1529)
  High-volume: 16.90%  (n=1526)

High-volume hospitals appear BETTER overall.

We use pd.qcut() to split hospitals into three severity groups of roughly equal size, then compare readmission rates within each group.

Code

# Create severity categories for within-group comparison
hospitals['severity_cat'] = pd.qcut(hospitals['severity_proxy'], q=3,
                                     labels=['Low severity', 'Medium severity', 'High severity'])

But when we break the data down by patient severity, the story reverses:

Code

fig, axes = plt.subplots(1, 2, figsize=(11, 4), sharey=True)

# Left: overall
overall = hospitals.groupby('is_high_volume')['overall_readmission_rate'].mean()
colors_bar = ['#4878CF', '#E24A33']
axes[0].bar(['Low-volume', 'High-volume'],
            overall.values, color=colors_bar, edgecolor='white', width=0.5)
axes[0].set_ylabel('Mean Readmission Rate (%)')
axes[0].set_title('Overall: High-Volume Looks BETTER')

# Right: within severity categories
severity_cats = ['Low severity', 'Medium severity', 'High severity']
x = np.arange(len(severity_cats))
width = 0.35

rates_low = []
rates_high = []
for sev in severity_cats:
    subset = hospitals[hospitals['severity_cat'] == sev]
    rates_low.append(subset[~subset['is_high_volume']]['overall_readmission_rate'].mean())
    rates_high.append(subset[subset['is_high_volume']]['overall_readmission_rate'].mean())

axes[1].bar(x - width/2, rates_low, width, label='Low-volume', color='#4878CF', edgecolor='white')
axes[1].bar(x + width/2, rates_high, width, label='High-volume', color='#E24A33', edgecolor='white')
axes[1].set_xticks(x)
axes[1].set_xticklabels(severity_cats)
axes[1].set_ylabel('Mean Readmission Rate (%)')
axes[1].set_title('Within Severity: High-Volume is WORSE')
axes[1].legend()

plt.suptitle("Simpson's Paradox in Hospital Readmissions", fontsize=14, fontweight='bold', y=1.02)
plt.tight_layout()
plt.show()

Why does this happen?

The DAG tells the story:

Code

draw_dag(
    nodes={'Patient severity': (2.2, 5.5), 'Hospital choice\n(high vs low volume)': (5, 2),
           'Readmission rate': (8, 5.5)},
    edges=[('Patient severity', 'Hospital choice\n(high vs low volume)'),
           ('Patient severity', 'Readmission rate'),
           ('Hospital choice\n(high vs low volume)', 'Readmission rate')],
    annotations=[('backdoor\npath', 5, 6.2, 'red')],
    title="Simpson's paradox in hospitals: patient severity is the confounder",
    figsize=(8, 5)
)

Patient severity is a confounder: it causes both Hospital Choice (sicker patients tend to go to smaller, specialized hospitals) and Readmission Rate (sicker patients get readmitted more). High-volume hospitals see a less severe case mix, so their aggregate readmission rate looks lower. But within each severity level — blocking the backdoor path — high-volume hospitals actually perform worse.

Simpson’s Paradox is a confounding problem. The aggregate data tells the opposite story from the disaggregated data: high-volume hospitals look better overall but worse within each severity group. Which one is “right”? The one that blocks the backdoor path by controlling for the confounder.

Simpson’s paradox in policing: NYPD stop-and-frisk

Simpson’s Paradox doesn’t just show up in hospitals. It appears whenever a confounding variable has different distributions across groups — and the policy consequences can be enormous.

From 2008 to 2012, the NYPD conducted roughly 760,000 stops under its Concealed Weapons Possession (CPW) program. For each stop, officers recorded whether a weapon was found — the hit rate. A higher hit rate means officers are stopping people on stronger evidence; a lower hit rate suggests weaker justification for the stop.

Code

# NYPD stop-and-frisk hit rates (Goel, Rao, Shroff 2016; Grossman 2023 MSE 125 Lecture 1)
print("NYPD Stop-and-Frisk Hit Rates (2008-2012)")
print("=" * 55)
print(f"  City-wide hit rate, Black individuals:  2.5%")
print(f"  City-wide hit rate, White individuals:  13%")
print(f"  City-wide ratio: 13 / 2.5 = {13/2.5:.1f}x")
print(f"\n  Within-precinct ratio (controlling for location): 2.2x")
print(f"\n  The ratio drops from {13/2.5:.1f}x to 2.2x after controlling for precinct.")

NYPD Stop-and-Frisk Hit Rates (2008-2012)
=======================================================
  City-wide hit rate, Black individuals:  2.5%
  City-wide hit rate, White individuals:  13%
  City-wide ratio: 13 / 2.5 = 5.2x

  Within-precinct ratio (controlling for location): 2.2x

  The ratio drops from 5.2x to 2.2x after controlling for precinct.

Code

# Visualize the city-wide vs within-precinct ratio — same Simpson's reversal as hospitals
fig, ax = plt.subplots(figsize=(6, 4.5))
comparisons = ['City-wide\n(confounded by precinct)', 'Within-precinct\n(confounder blocked)']
ratios = [13 / 2.5, 2.2]
colors_hr = ['#d62728', '#4878CF']
bars = ax.bar(comparisons, ratios, color=colors_hr, edgecolor='white', width=0.55)
ax.axhline(y=1.0, color='black', linewidth=0.8, linestyle='--', alpha=0.6)
ax.text(1.45, 1.08, 'equal hit rates', color='black', fontsize=9, alpha=0.7, ha='right')
ax.set_ylabel('Hit-rate ratio (white ÷ Black)')
ax.set_title('NYPD stop-and-frisk: disparity shrinks after blocking the backdoor')
for bar, r in zip(bars, ratios):
    ax.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.1,
            f'{r:.1f}x', ha='center', va='bottom', fontsize=13, fontweight='bold')
ax.set_ylim(0, 6)
plt.tight_layout()
plt.show()

The city-wide numbers paint a stark picture: the hit rate for white individuals is 5.2x higher than for Black individuals, suggesting officers stop Black individuals on far weaker evidence. But the city-wide comparison is confounded by precinct. Different precincts have different stop rates and different racial compositions.

The DAG:

Code

draw_dag(
    nodes={'Precinct': (2, 5.5), 'Race of\nperson stopped': (5, 2), 'Hit rate': (8, 5.5)},
    edges=[('Precinct', 'Race of\nperson stopped'),
           ('Precinct', 'Hit rate'),
           ('Race of\nperson stopped', 'Hit rate')],
    annotations=[('backdoor\npath', 5, 6.2, 'red')],
    title="Stop-and-frisk: precinct is the confounder",
    figsize=(8, 5)
)

Precinct is a confounder: it influences both who gets stopped (racial composition varies by neighborhood) and the hit rate (policing intensity and weapon prevalence vary by location). When we compare hit rates within precincts — blocking the backdoor path — the ratio drops from 5.2x to 2.2x. (The within-precinct number isn’t computed from our summary data; it’s the result of Sharad Goel, Justin Rao, and Ravi Shroff’s stop-level analysis in the Annals of Applied Statistics, 2016.)

The within-precinct ratio is still evidence of disparate treatment: officers appear to stop Black individuals on weaker evidence even within the same precinct. But the magnitude of the disparity changes dramatically once we control for location. Getting the confounding structure right is essential for accurate policy analysis.

After the Stanford Open Policing Project published similar within-group analyses for the LAPD, the department changed its Metro Division operations (Los Angeles Times, October 2019). The statistical question — aggregate or disaggregated? — had real consequences for how a major city policed its residents.

John Snow and the birth of epidemiology

Historical Context: John Snow’s Cholera Investigation

London, 1854. Cholera is killing thousands. The prevailing theory: “bad air” (miasma) causes disease. A physician named John Snow had a different idea: contaminated water.

Let’s look at his data.

Code

deaths = pd.read_csv(f'{DATA_DIR}/john-snow/snow_deaths.csv')
pumps = pd.read_csv(f'{DATA_DIR}/john-snow/snow_pumps.csv')
water = pd.read_csv(f'{DATA_DIR}/john-snow/snow_water_companies.csv')

print("Each row is a cholera death with x,y coordinates on Snow's map")
print("\nDeath locations (first 5):")
print(deaths.head())
print(f"\nTotal deaths recorded: {len(deaths)}")
print(f"\nWater pumps:")
print(pumps[['pump', 'label']].to_string(index=False))

Each row is a cholera death with x,y coordinates on Snow's map

Death locations (first 5):
   rownames  case          x          y
0         1     1  13.588010  11.095600
1         2     2   9.878124  12.559180
2         3     3  14.653980  10.180440
3         4     4  15.220570   9.993003
4         5     5  13.162650  12.963190

Total deaths recorded: 578

Water pumps:
 pump          label
    1  Oxford Market
    2    Castle St E
    3   Oxford St #1
    4   Oxford St #2
    5 Gt Marlborough
    6   Crown Chapel
    7       Broad St
    8        Warwick
    9     Briddle St
   10        So Soho
   11        Dean St
   12    Coventry St
   13        Vigo St

Snow’s famous map plots each death as a point and each pump as a triangle. Let’s recreate it.

Code

fig, ax = plt.subplots(figsize=(10, 8))

# Plot deaths
ax.scatter(deaths['x'], deaths['y'], c='red', s=8, alpha=0.6, label='Deaths')

# Plot pumps
ax.scatter(pumps['x'], pumps['y'], c='blue', s=100, marker='^', zorder=5, label='Water pumps')
for _, pump in pumps.iterrows():
    ax.annotate(pump['label'], xy=(pump['x'], pump['y']),
                xytext=(5, 5), textcoords='offset points', fontsize=7, color='blue')

ax.set_xlabel('x coordinate')
ax.set_ylabel('y coordinate')
ax.set_title("Snow's Cholera Map: Deaths Cluster Near Broad Street Pump")
ax.legend()
plt.tight_layout()
plt.show()

print("Deaths cluster around the Broad Street pump -- the famous map.")
print("But is this CAUSAL evidence? Or just a spatial correlation?")

Deaths cluster around the Broad Street pump -- the famous map.
But is this CAUSAL evidence? Or just a spatial correlation?

Snow’s real evidence: the natural experiment

The famous cholera map is compelling, but it’s just a correlation — deaths cluster near a pump. Maybe those neighborhoods were just poorer or more crowded.

Snow’s true breakthrough was something different. Two water companies served overlapping neighborhoods in South London:

Southwark & Vauxhall: drew water from the Thames downstream of sewage outflows
Lambeth: had recently moved its intake upstream of the sewage

Crucially, houses served by different companies were intermingled on the same streets. Same neighborhoods, same air, same poverty — but different water.

Code

print("Deaths by Water Company (1854 cholera outbreak):")
print("=" * 55)
print(water.to_string(index=False))

sv_rate = water[water['company'] == 'Southwark & Vauxhall']['death_rate_per_10000'].values[0]
l_rate = water[water['company'] == 'Lambeth']['death_rate_per_10000'].values[0]

print(f"\nSouthwark & Vauxhall death rate: {sv_rate} per 10,000")
print(f"Lambeth death rate: {l_rate} per 10,000")
print(f"Ratio: {sv_rate / l_rate:.1f}x higher death rate for contaminated water")

Deaths by Water Company (1854 cholera outbreak):
=======================================================
             company  houses  deaths  death_rate_per_10000
Southwark & Vauxhall   40046    1263                   315
             Lambeth   26107      98                    37
      Rest of London  256423    1422                    59

Southwark & Vauxhall death rate: 315 per 10,000
Lambeth death rate: 37 per 10,000
Ratio: 8.5x higher death rate for contaminated water

Code

fig, ax = plt.subplots(figsize=(8, 5))

companies = water['company'].values
rates = water['death_rate_per_10000'].values
colors_wc = ['#d62728', '#2ca02c', '#7f7f7f']

bars = ax.bar(companies, rates, color=colors_wc, edgecolor='white', width=0.6)
ax.set_ylabel('Deaths per 10,000 houses')
ax.set_title('Cholera Death Rates by Water Company (1854)')

# Annotate bars
for bar, rate in zip(bars, rates):
    ax.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 5,
            f'{rate}', ha='center', va='bottom', fontsize=12, fontweight='bold')

# Ratio annotation
ax.annotate(f'{sv_rate/l_rate:.1f}x higher\ndeath rate',
            xy=(0, sv_rate), xytext=(0.5, sv_rate - 50),
            fontsize=12, color='red', fontweight='bold',
            arrowprops=dict(arrowstyle='->', color='red'))

plt.tight_layout()
plt.show()

Why this is a natural experiment

Definition: Natural Experiment

A natural experiment occurs when some external circumstance assigns treatment in a way that is as-if random — creating comparable groups without deliberate randomization.

Here, the “treatment” is contaminated water. The key features:

Feature	Status
Treatment assigned randomly?	No — but as-if random within neighborhoods
Groups comparable?	Yes — same streets, same poverty, same “bad air”
Confounders controlled?	Yes — the only difference was the water source
Temporal precedent?	Yes — Lambeth moved its intake before the outbreak

The intermingling of houses is what makes Snow’s evidence causal, not just correlational. He couldn’t run an experiment (you can’t randomly give people contaminated water!), but nature effectively ran one for him.

Code

draw_dag(
    nodes={'Water company\n(as-if random)': (2, 4.5), 'Water quality': (5, 4.5),
           'Cholera death': (8, 4.5)},
    edges=[('Water company\n(as-if random)', 'Water quality'),
           ('Water quality', 'Cholera death')],
    annotations=[('no backdoor path — neighborhoods are comparable', 5, 2.5, '#2ca02c')],
    title="Snow's natural experiment: no unblocked backdoor path",
    figsize=(9, 4)
)

The map is not the experiment

Here’s what most people get wrong about John Snow. The famous cholera map — deaths clustered around the Broad Street pump — is the image everyone remembers. It’s in every textbook, every documentary, every data science course.

But the map is not the causal evidence.

The map shows a spatial correlation: deaths near the pump. But maybe people near that pump were also poorer, more crowded, less healthy. The map can’t rule that out.

Snow’s actual causal argument came from the water company comparison — the natural experiment. Same neighborhoods, different water, 8.5x difference in death rates. That’s the evidence that contaminated water causes cholera.

The map is memorable. The natural experiment is convincing. They’re not the same thing.

Code

# Let's quantify the map's limitation
from scipy.spatial.distance import cdist  # computes distances between sets of points

death_coords = deaths[['x', 'y']].values
pump_coords = pumps[['x', 'y']].values

distances = cdist(death_coords, pump_coords)
nearest_pump_idx = distances.argmin(axis=1)

deaths_with_pumps = deaths.copy()
deaths_with_pumps['nearest_pump'] = pumps.iloc[nearest_pump_idx]['label'].values
deaths_with_pumps['distance'] = distances.min(axis=1)

# Deaths by nearest pump
pump_deaths = deaths_with_pumps.groupby('nearest_pump').size().sort_values(ascending=False)
print("Deaths by nearest pump:")
print(pump_deaths.head(8).to_string())
print(f"\nBroad Street pump accounts for {pump_deaths.get('Broad St', 0)} of {len(deaths)} deaths")
print("\nBut proximity is not causation. Snow knew he needed stronger evidence.")

Deaths by nearest pump:
nearest_pump
Broad St          359
So Soho            64
Crown Chapel       61
Briddle St         27
Oxford St #2       24
Warwick            16
Oxford St #1       12
Gt Marlborough      6

Broad Street pump accounts for 359 of 578 deaths

But proximity is not causation. Snow knew he needed stronger evidence.

A framework: levels of causal evidence

Snow’s investigation illustrates a hierarchy of evidence that we’ll use throughout causal inference. Not all evidence is created equal:

Level	Method	Example	Strength
1	Randomized experiment	Clinical trial (ACTG 175 from Chapter 10)	Gold standard
2	Natural experiment	Snow’s water companies	Very strong
3	Regression with confounders	School selectivity $\to$ earnings, controlling for wealth	Depends on what you control for
4	Simple correlation	School SAT_AVG and earnings ($r = 0.65$)	Weakest — could be confounded

We’ll see more of levels 1 and 2 in Chapter 19.

Can we fix regression? Controlling for confounders

If we can’t run an experiment, can we at least control for confounders in a regression?

Let’s try. In the college data, CONTROL indicates public (1), private nonprofit (2), or private for-profit (3). This variable is a rough proxy for resources and wealth. What happens when we add it?

Code

from sklearn.linear_model import LinearRegression

# Simple regression: SAT -> earnings
X_simple = colleges[['SAT_AVG']].values
y = colleges['MD_EARN_WNE_P10'].values

model_simple = LinearRegression().fit(X_simple, y)
print("Simple regression: SAT -> Earnings")
print(f"  Coefficient on SAT: ${model_simple.coef_[0]:.0f} per point")
print(f"  R-squared = {model_simple.score(X_simple, y):.3f}")

# Multiple regression: SAT + school type -> earnings
colleges['is_private_nonprofit'] = (colleges['CONTROL'] == 2).astype(int)
colleges['is_for_profit'] = (colleges['CONTROL'] == 3).astype(int)
X_multi = colleges[['SAT_AVG', 'is_private_nonprofit', 'is_for_profit']].values

model_multi = LinearRegression().fit(X_multi, y)
print(f"\nMultiple regression: SAT + school type -> Earnings")
print(f"  Coefficient on SAT: ${model_multi.coef_[0]:.0f} per point")
print(f"  Coefficient on private nonprofit: ${model_multi.coef_[1]:,.0f}")
print(f"  R-squared = {model_multi.score(X_multi, y):.3f}")

print(f"\nThe SAT coefficient barely changed (${model_simple.coef_[0]:.0f} → ${model_multi.coef_[0]:.0f}) when we controlled for school type.")
print("School type captures only a coarse difference in resources -- within each type,")
print("family wealth still varies enormously, so it can still confound the relationship")
print("even when public/private flags don't. Not every suspected confounder actually")
print("confounds at the level you measure it -- but the unmeasured confounders remain.")

Simple regression: SAT -> Earnings
  Coefficient on SAT: $51 per point
  R-squared = 0.417

Multiple regression: SAT + school type -> Earnings
  Coefficient on SAT: $52 per point
  Coefficient on private nonprofit: $31
  R-squared = 0.420

The SAT coefficient barely changed ($51 → $52) when we controlled for school type.
School type captures only a coarse difference in resources -- within each type,
family wealth still varies enormously, so it can still confound the relationship
even when public/private flags don't. Not every suspected confounder actually
confounds at the level you measure it -- but the unmeasured confounders remain.

Feedback loops as causal structure

Before leaving confounding, there’s a subtler cousin worth naming. Chapter 16 introduced feedback loops: cases where the model’s predictions change the outcome it’s trying to measure (predictive policing, credit scoring, sepsis alerts). The DAG view makes precise why those loops break standard analysis.

In a feedback-loop DAG, the model’s prediction is an intervention on the outcome, not just an observation of it. The arrow from prediction to outcome is a causal edge — not a passive correlation — and any path that goes through the model’s output cannot be blocked by conditioning on the model’s inputs.

Code

draw_dag(
    nodes={'Training\ndata': (1.5, 5), 'Model': (4, 5), 'Prediction': (6.5, 5),
           'Action': (6.5, 2.5), 'Outcome': (4, 2.5), 'New training\ndata (next round)': (1.5, 2.5)},
    edges=[('Training\ndata', 'Model'), ('Model', 'Prediction'),
           ('Prediction', 'Action'), ('Action', 'Outcome'),
           ('Outcome', 'New training\ndata (next round)'),
           ('New training\ndata (next round)', 'Training\ndata')],
    annotations=[('model changes what it measures', 4, 6.3, '#d62728')],
    title='Feedback-loop DAG: prediction is an intervention, not an observation',
    figsize=(9, 5.5)
)

The fix isn’t better covariates. Conditioning on model inputs cannot block a path that runs through the model’s output. The fix is structural: either sever the loop (for example, randomize some decisions as a holdout so that outcomes for those cases are generated without the model’s input) or analyze the deployed system as a causal system with its own treatment-and-outcome structure. Chapter 19 develops both approaches.

The limits of “controlling for”

Adding control variables to a regression can help — but only if:

You include all confounders. Miss one, and the estimate is biased.
You measure them well. A rough proxy for “family wealth” isn’t the same as actual wealth.
You don’t control for the wrong things. Controlling for a mediator (something on the causal path) or a collider (something caused by both treatment and outcome) can make bias worse.
The arrow structure must be passive. If the “treatment” is a deployed model’s prediction, no amount of covariate adjustment recovers a causal effect — see the feedback-loop DAG above.

The difficulty of controlling for all confounders is the core challenge of observational studies. You’re always one unmeasured confounder away from a wrong answer.

Think About It

What would you need to measure to estimate the causal effect of attending a selective college on earnings?

A trap: collider bias

Think About It

We said you should control for confounders. Should you always add more control variables to a regression? What could go wrong?

Here’s a puzzle. Imagine you’re at a prestigious consulting firm. You notice that the most talented analysts seem to have the weakest professional networks, and the best-connected people tend to be mediocre analysts. Does getting hired destroy your social skills? No — this is a statistical illusion called collider bias.

Definition: Collider

A variable is a collider on a particular path if both arrows on that path point into it. (Note: being a collider is a property of a variable on a specific path, not a property of the variable itself — the same variable might be a collider on one path but not another.)

  Talent ──→ Hired at Top Firm ←── Connections

In the simulation below, talent and connections are independent by construction. But among people hired at top firms (conditioning on the collider), they’re negatively correlated — you got in either through talent OR connections.

The resulting artifact is called collider bias or selection bias, and it’s a common trap in regression analysis.

Code

# Simulate collider bias
n = 5000
talent = np.random.normal(0, 1, n)
connections = np.random.normal(0, 1, n)

# Hiring depends on both (collider)
hire_score = talent + connections + np.random.normal(0, 0.5, n)
hired = hire_score > np.percentile(hire_score, 70)
print(f"Hired {hired.sum()} of {n} people (top 30% of hire score)")

Hired 1500 of 5000 people (top 30% of hire score)

Code

fig, axes = plt.subplots(1, 2, figsize=(11, 4))

# Left: full population
axes[0].scatter(talent, connections, alpha=0.1, s=5, color='gray')
r_all = np.corrcoef(talent, connections)[0, 1]
axes[0].set_xlabel('Talent'); axes[0].set_ylabel('Connections')
axes[0].set_title(f'Full Population: r = {r_all:.3f} (no relationship)')

# Right: among hired only — negative correlation!
axes[1].scatter(talent[hired], connections[hired], alpha=0.3, s=10, color='steelblue')
r_hired = np.corrcoef(talent[hired], connections[hired])[0, 1]
axes[1].set_xlabel('Talent'); axes[1].set_ylabel('Connections')
axes[1].set_title(f'Among Hired Only: r = {r_hired:.3f} (collider bias!)')

plt.suptitle('Collider Bias: Controlling for "Hired" Creates a Spurious Correlation',
             fontsize=13, fontweight='bold', y=1.02)
plt.tight_layout()
plt.show()

print(f"In the full population, talent and connections are unrelated (r = {r_all:.3f}).")
print(f"Among hired people, they're negatively correlated (r = {r_hired:.3f}).")
print("Controlling for a collider creates bias that wasn't there before!")
print("(The negative correlation holds regardless of the hiring threshold — try changing 70 to 50 or 90.)")

In the full population, talent and connections are unrelated (r = -0.002).
Among hired people, they're negatively correlated (r = -0.495).
Controlling for a collider creates bias that wasn't there before!
(The negative correlation holds regardless of the hiring threshold — try changing 70 to 50 or 90.)

Collider Bias

Conditioning on a common effect of two variables creates a fake relationship between them. That’s collider bias. Always check your DAG before adding control variables to a regression.

Confounder vs. Collider

A confounder is a common cause of treatment and outcome — controlling for it removes bias. A collider is a common effect of treatment and outcome — controlling for it creates bias. The two require opposite strategies: always control for confounders, never control for colliders.

DAG rules: what to control for

Before we summarize the rules, let’s work through one more example. Consider this DAG for the effect of Education on Health:

Code

draw_dag(
    nodes={'Socioeconomic\nbackground': (2, 6), 'Education': (1.5, 3),
           'Income': (5, 3), 'Health': (8.5, 3)},
    edges=[('Socioeconomic\nbackground', 'Education'),
           ('Socioeconomic\nbackground', 'Health'),
           ('Education', 'Income'), ('Income', 'Health')],
    annotations=[('backdoor path (confounder)', 5.2, 6.3, 'red'),
                 ('causal path through mediator', 5, 1.9, 'green')],
    title="Education → Health: socioeconomic background confounds, income mediates",
    figsize=(9, 6)
)

What’s the confounder? Socioeconomic Background — it’s a common cause of Education and Health. Should we control for it? Yes — it’s on a backdoor path. What’s the mediator? Income — it’s on the causal path from Education to Health. Should we control for Income? No — it would block the very effect we’re trying to measure.

Definition: Mediator

A mediator is a variable on the causal path between treatment and outcome (X $\rightarrow$ Z $\rightarrow$ Y). Controlling for a mediator removes part of the causal effect you are trying to measure.

Here are the key rules for reading causal diagrams:

Structure	Name	Example	Control for it?
X $\leftarrow$ Z $\rightarrow$ Y	Confounder	Wealth $\to$ Selective college, Wealth $\to$ Earnings	Yes! Blocks the backdoor path
X $\rightarrow$ Z $\rightarrow$ Y	Mediator	Selective college $\to$ Alumni network $\to$ Earnings	No! Blocks the causal path
X $\rightarrow$ Z $\leftarrow$ Y	Collider	Talent $\to$ Hired $\leftarrow$ Connections	No! Opens a spurious path

Definition: Back-Door Criterion

The back-door criterion says: to estimate the causal effect of X on Y, you must block all backdoor paths (paths into X through common causes) without blocking any causal paths (paths from X to Y) or opening any collider paths. An additional requirement: you cannot control for any descendant of the treatment (a variable caused by X), since doing so partly blocks the causal effect or introduces bias through collider structures downstream of X.

These rules follow from whether “controlling for” a variable opens or closes the path between treatment and outcome. Getting this wrong — controlling for the wrong variable — is one of the most common mistakes in applied statistics.

Think About It

In the DAG below, identify the confounder, mediator, and collider. Which should you control for?

X $\to$ M $\to$ Y, X $\leftarrow$ C $\to$ Y, X $\to$ Z $\leftarrow$ Y

Key takeaways

“Correlation is not causation.” Now you know why — and you have the tools (DAGs) to tell the difference.

Confounders create spurious associations. A confounder is a common cause of treatment and outcome, creating a backdoor path that makes correlation look like causation.
Causal diagrams (DAGs) make the reasoning visual. Draw the arrows, trace the paths, and you’ll see whether regression can identify a causal effect.
Simpson’s Paradox is a confounding problem: aggregate data can show the opposite of disaggregated data. The DAG tells you which is right.
Natural experiments (like Snow’s water companies — not his famous map!) provide causal evidence without randomization, by exploiting situations where treatment is as-if random.
Block all backdoor paths, but don’t block causal paths. Control for confounders (good), but never control for colliders (opens spurious paths) or mediators (removes the effect you’re trying to measure).

Back to the opening question: with observational data alone, can we estimate Stanford’s causal effect on earnings? Honestly — no. We can’t measure family wealth perfectly, can’t enumerate every confounder, and the only “treatment” we observe is the bundle of selection + admission + culture that makes Stanford Stanford. Next time: difference-in-differences, A/B tests, and counterfactual reasoning — the tools for designing studies that can answer questions like this.

Study guide

Key ideas

Causal inference: determining when a statistical association reflects a cause-and-effect relationship
Confounder: a common cause of treatment and outcome that creates a spurious association (a backdoor path)
DAG (Directed Acyclic Graph): a diagram of causal relationships; “directed” = arrows have direction, “acyclic” = no loops
Backdoor path: a non-causal path connecting treatment and outcome through a common cause
Causal path (directed path): a path where all arrows point forward from treatment to outcome
Back-door criterion: to estimate a causal effect, block all backdoor paths without blocking causal paths or opening collider paths
Collider: a variable on a path where both arrows point into it; conditioning on a collider opens a spurious path
Mediator: a variable on the causal path between treatment and outcome
Spurious association: a correlation arising entirely from a common cause, not from a causal relationship
Natural experiment: a situation where an external circumstance assigns treatment as-if randomly
Observational data vs. experimental data: data collected without intervention vs. data from a randomized experiment
Simpson’s Paradox: when a trend reverses after disaggregating by a confounding variable
Confounders create backdoor paths that make correlation look like causation; DAGs let you see and block them.
Regression estimates a causal effect only when all backdoor paths are blocked without blocking causal paths.
Controlling for a collider creates bias; controlling for a mediator removes the effect you want to measure.
Natural experiments exploit as-if-random assignment to provide causal evidence without a randomized trial.
Simpson’s Paradox is a confounding problem that DAGs explain: the aggregate hides a backdoor path.

Computational tools

No new computational tools are critical for this lecture — the key skills are conceptual (drawing and reading DAGs, identifying confounders/colliders/mediators)
pd.qcut() for creating severity categories
scipy.spatial.distance.cdist() for computing distances between points
The draw_dag() helper function pattern for visualizing causal diagrams

For the quiz

You should be able to:

Given a scenario, identify confounders, mediators, and colliders
Given a DAG, determine which variables to control for and which to leave alone
Explain why controlling for a collider creates bias (with an example)
Distinguish observational evidence (correlation) from experimental evidence (causation)
Explain Simpson’s Paradox using a DAG

How was this chapter? Help us improve these notes. Share feedback