Causal Inference I — DAGs and Confounding

Code
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
import warnings
warnings.filterwarnings('ignore')
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (7, 4.5)
plt.rcParams['font.size'] = 12

# Load data
DATA_DIR = 'data'
np.random.seed(42)

You’re a high school senior choosing between Stanford and your state school. Stanford costs $200K more over four years. Stanford graduates earn a median of $86K ten years after enrollment — but is that caused by Stanford, or would those students earn a lot anyway? That’s a $200K question.

The central question of causal inference is: when can we interpret a statistical association as a cause-and-effect relationship?

ImportantDefinition: Causal Effect

X causes Y if intervening to change X — while holding everything else fixed — would change Y. The causal effect of X on Y is the difference in Y that results from this intervention.

If we could magically send every student to Stanford without changing anything else about them, would their earnings go up? The interventionist definition makes the question precise: a cause is something whose manipulation changes the outcome.

“Correlation is not causation.” — a foundational principle of statistics

Today we formalize why — and develop the tools to tell the difference.

The Stanford Premium

Let’s start with the data. The College Scorecard tracks outcomes for students at every US college.

Code
scorecard = pd.read_csv(f'{DATA_DIR}/college-scorecard/scorecard.csv')

# Clean: keep schools with SAT and earnings data
scorecard['SAT_AVG'] = pd.to_numeric(scorecard['SAT_AVG'], errors='coerce')
scorecard['MD_EARN_WNE_P10'] = pd.to_numeric(scorecard['MD_EARN_WNE_P10'], errors='coerce')
colleges = scorecard.dropna(subset=['SAT_AVG', 'MD_EARN_WNE_P10']).copy()

# Find Stanford
stanford = colleges[colleges['INSTNM'].str.contains('Stanford University', na=False)]
print(f"Stanford SAT average: {stanford['SAT_AVG'].values[0]:.0f}")
print(f"Stanford median earnings (10 yrs): ${stanford['MD_EARN_WNE_P10'].values[0]:,.0f}")
print(f"\nSchools with both SAT and earnings data: {len(colleges):,}")
Stanford SAT average: 1465
Stanford median earnings (10 yrs): $86,000

Schools with both SAT and earnings data: 1,272

We use pd.to_numeric() with errors='coerce' to convert non-numeric entries to NaN, then dropna() to keep only complete rows. Now let’s plot SAT scores against earnings with stats.linregress() to fit a regression line.

Code
fig, ax = plt.subplots(figsize=(10, 6))
ax.scatter(colleges['SAT_AVG'], colleges['MD_EARN_WNE_P10'], alpha=0.3, s=15, color='steelblue')

# Highlight Stanford
ax.scatter(stanford['SAT_AVG'], stanford['MD_EARN_WNE_P10'],
           color='red', s=100, zorder=5, label='Stanford')
ax.annotate('Stanford', xy=(stanford['SAT_AVG'].values[0], stanford['MD_EARN_WNE_P10'].values[0]),
            xytext=(10, 10), textcoords='offset points', fontsize=11, color='red')

# Regression line
slope, intercept, r, p, se = stats.linregress(colleges['SAT_AVG'], colleges['MD_EARN_WNE_P10'])
x_line = np.linspace(colleges['SAT_AVG'].min(), colleges['SAT_AVG'].max(), 100)
ax.plot(x_line, intercept + slope * x_line, 'r-', lw=2, alpha=0.7)

ax.set_xlabel('Average SAT Score')
ax.set_ylabel('Median Earnings (10 years after enrollment)')
ax.set_title(f'SAT Score vs. Earnings (r = {r:.2f})')
ax.legend()
plt.show()

print(f"Correlation: r = {r:.2f}")
print(f"Each 100-point SAT increase is associated with ${slope*100:,.0f} higher median earnings")

Correlation: r = 0.65
Each 100-point SAT increase is associated with $5,148 higher median earnings

Does SAT Score Cause Higher Earnings?

The correlation is striking: \(r \approx 0.65\). Schools with higher SAT averages have higher-earning graduates. A naive interpretation: “raising SAT scores would raise earnings.”

But wait. What if something else drives both SAT scores and earnings?

TipThink About It

What factors might affect both a student’s SAT score and their future earnings?

Confounders: The Hidden Third Variable

ImportantDefinition: Confounder

A confounder is a common cause of both the treatment and the outcome — it must cause both, not just be correlated with both.

Here, family wealth is a classic confounder:

  • Wealthy families invest in SAT prep, better schools, tutors \(\rightarrow\) higher SAT scores
  • Wealthy families provide professional networks, connections, inheritance \(\rightarrow\) higher earnings

If family wealth drives both, then the SAT–earnings correlation could be entirely spurious — a spurious association that exists only from a shared cause. The SAT doesn’t cause anything; it’s just a proxy for wealth. (Recall from Chapter 11: “correlation is not causation.” Now let’s formalize why.)

Causal Diagrams (DAGs)

ImportantDefinition: DAG (Directed Acyclic Graph)

A Directed Acyclic Graph (DAG) is a visual tool for representing causal relationships. “Directed” means the arrows have direction (A causes B, not the reverse). “Acyclic” means there are no loops — you can never follow the arrows in a circle back to where you started. Each arrow means “causes” or “directly influences.” Another formal framework, the potential outcomes model (also called the Rubin causal model), defines causal effects algebraically rather than graphically.

NotePreview: the potential outcomes framework

DAGs represent causal structure visually. The potential outcomes model defines causal effects algebraically. For each unit, imagine two potential outcomes: \(Y(1)\) under treatment and \(Y(0)\) under control. The causal effect is \(Y(1) - Y(0)\), but we only observe one of these for any individual (the fundamental problem of causal inference). Randomization ensures that the average causal effect can be estimated, even though individual effects cannot.

There are two paths from SAT to Earnings:

  1. Causal path (directed path): SAT \(\to\) Admission \(\to\) Earnings (maybe real — all arrows point forward from treatment to outcome)
  2. Backdoor path: SAT \(\leftarrow\) Wealth \(\to\) Earnings (spurious — goes “behind” the treatment through a common cause)

The standard causal inference term for path 2 is a backdoor path: it connects treatment to outcome by going through a common cause rather than through a direct causal chain. A naive regression mixes both paths together. We can’t tell which is which without more information. Let’s draw it.

Code
def draw_dag(nodes, edges, annotations=None, title='', figsize=(8, 6)):
    """Draw a simple DAG. nodes: dict {name: (x,y)}, edges: list [(from, to)],
    annotations: list [(text, x, y, color)]."""
    fig, ax = plt.subplots(figsize=figsize)
    ax.set_xlim(0, 10); ax.set_ylim(0, 8); ax.axis('off')
    for name, (x, y) in nodes.items():
        fc = 'lightyellow' if '(unobserved)' in name else 'lightblue'
        ls = '--' if '(unobserved)' in name else '-'
        ax.add_patch(plt.Rectangle((x-1.2, y-0.4), 2.4, 0.8,
            facecolor=fc, edgecolor='navy', lw=2, ls=ls, zorder=3))
        ax.text(x, y, name, ha='center', va='center', fontsize=11, fontweight='bold', zorder=4)
    for (sn, dn) in edges:
        sx, sy = nodes[sn]; dx, dy = nodes[dn]
        ax.annotate('', xy=(dx, dy), xytext=(sx, sy),
            arrowprops=dict(arrowstyle='->', lw=2.5, color='navy'))
    for text, x, y, color in (annotations or []):
        ax.annotate(text, xy=(x, y), fontsize=10, color=color, fontstyle='italic', ha='center')
    if title:
        ax.set_title(title, fontsize=14, fontweight='bold')
    plt.tight_layout(); plt.show()

The draw_dag() helper above draws nodes as rectangles and connects them with arrows. Let’s use it to visualize the SAT–earnings DAG.

Code
draw_dag(
    nodes={'Family Wealth': (5, 7), 'SAT Score': (2, 4),
           'Admission': (5, 2.5), 'Earnings': (8, 4)},
    edges=[('Family Wealth', 'SAT Score'), ('Family Wealth', 'Earnings'),
           ('SAT Score', 'Admission'), ('Admission', 'Earnings')],
    annotations=[('backdoor\npath', 3.2, 5.8, 'red'), ('causal\npath?', 4.8, 3.2, 'green')],
    title='Causal Diagram (DAG): SAT Scores and Earnings'
)

When Does Regression Estimate a Causal Effect?

In Chapter 12, we interpreted regression coefficients as “holding everything else constant.” Now we can see when that interpretation is causal and when it’s just descriptive.

Regression estimates a causal effect only when you block all backdoor paths between treatment and outcome — by controlling for at least one non-collider variable on each backdoor path — without blocking any causal paths.

In practice, this is nearly impossible with observational data (data collected without an experiment):

  • We can’t measure “family wealth” perfectly
  • There may be confounders we haven’t even thought of (motivation, neighborhood, genetics…)
  • Controlling for the wrong variable can actually create bias (we’ll see this with colliders soon)

The gap between correlation and causation motivates the distinction between observational data (collected without intervention) and experimental data (from a randomized experiment). In a randomized experiment, the treatment is assigned randomly, so there are no backdoor paths to worry about.

WarningThe Fundamental Problem

With observational data, you can never be sure you’ve blocked every backdoor path.

Simpson’s Paradox: A Confounding Problem

In Chapter 11, you saw Simpson’s Paradox in NBA rest data: extended-rest players scored fewer points overall, but more points within each position group. The aggregate lied. We first saw hospital readmission data in Chapter 2, where the “worst” hospitals turned out to be small rural ones with noisy estimates. Now let’s see another way this data can mislead — and explain why using DAGs.

The question: are high-volume hospitals better or worse at preventing readmissions?

Code
hospitals = pd.read_csv(f'{DATA_DIR}/hospital-readmissions/hospital_level_summary.csv')
hospitals['star_rating_num'] = pd.to_numeric(hospitals['star_rating_num'], errors='coerce')

hospitals['is_high_volume'] = hospitals['total_discharges'] > hospitals['total_discharges'].median()

# Overall comparison
print("Overall readmission rates:")
print("-" * 50)
for group_name, group in hospitals.groupby('is_high_volume'):
    label = 'High-volume' if group_name else 'Low-volume'
    rate = group['overall_readmission_rate'].mean()
    n = len(group)
    print(f"  {label}: {rate:.2f}%  (n={n})")

print("\nHigh-volume hospitals appear BETTER overall.")
Overall readmission rates:
--------------------------------------------------
  Low-volume: 18.25%  (n=1529)
  High-volume: 16.90%  (n=1526)

High-volume hospitals appear BETTER overall.

We use pd.qcut() to split hospitals into three severity groups of roughly equal size, then compare readmission rates within each group.

Code
# Create severity categories for within-group comparison
hospitals['severity_cat'] = pd.qcut(hospitals['severity_proxy'], q=3,
                                     labels=['Low severity', 'Medium severity', 'High severity'])

But when we break the data down by patient severity, the story reverses:

Code
fig, axes = plt.subplots(1, 2, figsize=(11, 4), sharey=True)

# Left: overall
overall = hospitals.groupby('is_high_volume')['overall_readmission_rate'].mean()
colors_bar = ['#4878CF', '#E24A33']
axes[0].bar(['Low-volume', 'High-volume'],
            overall.values, color=colors_bar, edgecolor='white', width=0.5)
axes[0].set_ylabel('Mean Readmission Rate (%)')
axes[0].set_title('Overall: High-Volume Looks BETTER')

# Right: within severity categories
severity_cats = ['Low severity', 'Medium severity', 'High severity']
x = np.arange(len(severity_cats))
width = 0.35

rates_low = []
rates_high = []
for sev in severity_cats:
    subset = hospitals[hospitals['severity_cat'] == sev]
    rates_low.append(subset[~subset['is_high_volume']]['overall_readmission_rate'].mean())
    rates_high.append(subset[subset['is_high_volume']]['overall_readmission_rate'].mean())

axes[1].bar(x - width/2, rates_low, width, label='Low-volume', color='#4878CF', edgecolor='white')
axes[1].bar(x + width/2, rates_high, width, label='High-volume', color='#E24A33', edgecolor='white')
axes[1].set_xticks(x)
axes[1].set_xticklabels(severity_cats)
axes[1].set_ylabel('Mean Readmission Rate (%)')
axes[1].set_title('Within Severity: High-Volume is WORSE')
axes[1].legend()

plt.suptitle("Simpson's Paradox in Hospital Readmissions", fontsize=14, fontweight='bold', y=1.02)
plt.tight_layout()
plt.show()

Why Does This Happen?

The DAG tells the story:

  Patient Severity ──→ Hospital Choice ──→ Readmission Rate
         │                                       ↑
         └───────────────────────────────────────┘

Patient severity is a confounder: it causes both Hospital Choice (sicker patients tend to go to smaller, specialized hospitals) and Readmission Rate (sicker patients get readmitted more). High-volume hospitals see a less severe case mix, so their aggregate readmission rate looks lower. But within each severity level — blocking the backdoor path — high-volume hospitals actually perform worse.

Simpson’s Paradox is a confounding problem. The aggregate data tells the opposite story from the disaggregated data: high-volume hospitals look better overall but worse within each severity group. Which one is “right”? The one that blocks the backdoor path by controlling for the confounder.

Simpson’s Paradox in Policing: NYPD Stop-and-Frisk

Simpson’s Paradox doesn’t just show up in hospitals. It appears whenever a confounding variable has different distributions across groups — and the policy consequences can be enormous.

From 2008 to 2012, the NYPD conducted roughly 760,000 stops under its Concealed Weapons Possession (CPW) program. For each stop, officers recorded whether a weapon was found — the hit rate. A higher hit rate means officers are stopping people on stronger evidence; a lower hit rate suggests weaker justification for the stop.

Code
# NYPD stop-and-frisk hit rates (Goel, Rao, Shroff 2016; Grossman 2023 MSE 125 Lecture 1)
print("NYPD Stop-and-Frisk Hit Rates (2008-2012)")
print("=" * 55)
print(f"  City-wide hit rate, Black individuals:  2.5%")
print(f"  City-wide hit rate, White individuals:  13%")
print(f"  City-wide ratio: 13 / 2.5 = {13/2.5:.1f}x")
print(f"\n  Within-precinct ratio (controlling for location): 2.2x")
print(f"\n  The ratio drops from {13/2.5:.1f}x to 2.2x after controlling for precinct.")
NYPD Stop-and-Frisk Hit Rates (2008-2012)
=======================================================
  City-wide hit rate, Black individuals:  2.5%
  City-wide hit rate, White individuals:  13%
  City-wide ratio: 13 / 2.5 = 5.2x

  Within-precinct ratio (controlling for location): 2.2x

  The ratio drops from 5.2x to 2.2x after controlling for precinct.

The city-wide numbers paint a stark picture: the hit rate for white individuals is 5.2x higher than for Black individuals, suggesting officers stop Black individuals on far weaker evidence. But the city-wide comparison is confounded by precinct. Different precincts have different stop rates and different racial compositions.

The DAG:

  Precinct ──→ Race of person stopped
      │
      └──────→ Hit rate

Precinct is a confounder: it influences both who gets stopped (racial composition varies by neighborhood) and the hit rate (policing intensity and weapon prevalence vary by location). When we compare hit rates within precincts — blocking the backdoor path — the ratio drops from 5.1x to 2.2x.

The within-precinct ratio is still evidence of disparate treatment: officers appear to stop Black individuals on weaker evidence even within the same precinct. But the magnitude of the disparity changes dramatically once we control for location. Getting the confounding structure right is essential for accurate policy analysis.

After the Stanford Open Policing Project published similar within-group analyses for the LAPD, the department changed its Metro Division operations (Los Angeles Times, October 2019). The statistical question — aggregate or disaggregated? — had real consequences for how a major city policed its residents.

John Snow and the Birth of Epidemiology

NoteHistorical Context: John Snow’s Cholera Investigation

London, 1854. Cholera is killing thousands. The prevailing theory: “bad air” (miasma) causes disease. A physician named John Snow had a different idea: contaminated water.

Let’s look at his data.

Code
deaths = pd.read_csv(f'{DATA_DIR}/john-snow/snow_deaths.csv')
pumps = pd.read_csv(f'{DATA_DIR}/john-snow/snow_pumps.csv')
water = pd.read_csv(f'{DATA_DIR}/john-snow/snow_water_companies.csv')

print("Each row is a cholera death with x,y coordinates on Snow's map")
print("\nDeath locations (first 5):")
print(deaths.head())
print(f"\nTotal deaths recorded: {len(deaths)}")
print(f"\nWater pumps:")
print(pumps[['pump', 'label']].to_string(index=False))
Each row is a cholera death with x,y coordinates on Snow's map

Death locations (first 5):
   rownames  case          x          y
0         1     1  13.588010  11.095600
1         2     2   9.878124  12.559180
2         3     3  14.653980  10.180440
3         4     4  15.220570   9.993003
4         5     5  13.162650  12.963190

Total deaths recorded: 578

Water pumps:
 pump          label
    1  Oxford Market
    2    Castle St E
    3   Oxford St #1
    4   Oxford St #2
    5 Gt Marlborough
    6   Crown Chapel
    7       Broad St
    8        Warwick
    9     Briddle St
   10        So Soho
   11        Dean St
   12    Coventry St
   13        Vigo St

Snow’s famous map plots each death as a point and each pump as a triangle. Let’s recreate it.

Code
fig, ax = plt.subplots(figsize=(10, 8))

# Plot deaths
ax.scatter(deaths['x'], deaths['y'], c='red', s=8, alpha=0.6, label='Deaths')

# Plot pumps
ax.scatter(pumps['x'], pumps['y'], c='blue', s=100, marker='^', zorder=5, label='Water pumps')
for _, pump in pumps.iterrows():
    ax.annotate(pump['label'], xy=(pump['x'], pump['y']),
                xytext=(5, 5), textcoords='offset points', fontsize=7, color='blue')

ax.set_xlabel('x coordinate')
ax.set_ylabel('y coordinate')
ax.set_title("Snow's Cholera Map: Deaths Cluster Near Broad Street Pump")
ax.legend()
plt.tight_layout()
plt.show()

print("Deaths cluster around the Broad Street pump -- the famous map.")
print("But is this CAUSAL evidence? Or just a spatial correlation?")

Deaths cluster around the Broad Street pump -- the famous map.
But is this CAUSAL evidence? Or just a spatial correlation?

Snow’s Real Evidence: The Natural Experiment

The famous cholera map is compelling, but it’s just a correlation — deaths cluster near a pump. Maybe those neighborhoods were just poorer or more crowded.

Snow’s true breakthrough was something different. Two water companies served overlapping neighborhoods in South London:

  • Southwark & Vauxhall: drew water from the Thames downstream of sewage outflows
  • Lambeth: had recently moved its intake upstream of the sewage

Crucially, houses served by different companies were intermingled on the same streets. Same neighborhoods, same air, same poverty — but different water.

Code
print("Deaths by Water Company (1854 cholera outbreak):")
print("=" * 55)
print(water.to_string(index=False))

sv_rate = water[water['company'] == 'Southwark & Vauxhall']['death_rate_per_10000'].values[0]
l_rate = water[water['company'] == 'Lambeth']['death_rate_per_10000'].values[0]

print(f"\nSouthwark & Vauxhall death rate: {sv_rate} per 10,000")
print(f"Lambeth death rate: {l_rate} per 10,000")
print(f"Ratio: {sv_rate / l_rate:.1f}x higher death rate for contaminated water")
Deaths by Water Company (1854 cholera outbreak):
=======================================================
             company  houses  deaths  death_rate_per_10000
Southwark & Vauxhall   40046    1263                   315
             Lambeth   26107      98                    37
      Rest of London  256423    1422                    59

Southwark & Vauxhall death rate: 315 per 10,000
Lambeth death rate: 37 per 10,000
Ratio: 8.5x higher death rate for contaminated water
Code
fig, ax = plt.subplots(figsize=(8, 5))

companies = water['company'].values
rates = water['death_rate_per_10000'].values
colors_wc = ['#d62728', '#2ca02c', '#7f7f7f']

bars = ax.bar(companies, rates, color=colors_wc, edgecolor='white', width=0.6)
ax.set_ylabel('Deaths per 10,000 houses')
ax.set_title('Cholera Death Rates by Water Company (1854)')

# Annotate bars
for bar, rate in zip(bars, rates):
    ax.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 5,
            f'{rate}', ha='center', va='bottom', fontsize=12, fontweight='bold')

# Ratio annotation
ax.annotate(f'{sv_rate/l_rate:.1f}x higher\ndeath rate',
            xy=(0, sv_rate), xytext=(0.5, sv_rate - 50),
            fontsize=12, color='red', fontweight='bold',
            arrowprops=dict(arrowstyle='->', color='red'))

plt.tight_layout()
plt.show()

Why This Is a Natural Experiment

ImportantDefinition: Natural Experiment

A natural experiment occurs when some external circumstance assigns treatment in a way that is as-if random — creating comparable groups without deliberate randomization.

Here, the “treatment” is contaminated water. The key features:

Feature Status
Treatment assigned randomly? No — but as-if random within neighborhoods
Groups comparable? Yes — same streets, same poverty, same “bad air”
Confounders controlled? Yes — the only difference was the water source
Temporal precedent? Yes — Lambeth moved its intake before the outbreak

The intermingling of houses is what makes Snow’s evidence causal, not just correlational. He couldn’t run an experiment (you can’t randomly give people contaminated water!), but nature effectively ran one for him.

The DAG:

Water Source ──→ Water Quality ──→ Cholera Death

No confounding path! The neighborhoods are the same.

The Map Is Not the Experiment

Here’s what most people get wrong about John Snow. The famous cholera map — deaths clustered around the Broad Street pump — is the image everyone remembers. It’s in every textbook, every documentary, every data science course.

But the map is not the causal evidence.

The map shows a spatial correlation: deaths near the pump. But maybe people near that pump were also poorer, more crowded, less healthy. The map can’t rule that out.

Snow’s actual causal argument came from the water company comparison — the natural experiment. Same neighborhoods, different water, 8.5x difference in death rates. That’s the evidence that contaminated water causes cholera.

The map is memorable. The natural experiment is convincing. They’re not the same thing.

Code
# Let's quantify the map's limitation
from scipy.spatial.distance import cdist  # computes distances between sets of points

death_coords = deaths[['x', 'y']].values
pump_coords = pumps[['x', 'y']].values

distances = cdist(death_coords, pump_coords)
nearest_pump_idx = distances.argmin(axis=1)

deaths_with_pumps = deaths.copy()
deaths_with_pumps['nearest_pump'] = pumps.iloc[nearest_pump_idx]['label'].values
deaths_with_pumps['distance'] = distances.min(axis=1)

# Deaths by nearest pump
pump_deaths = deaths_with_pumps.groupby('nearest_pump').size().sort_values(ascending=False)
print("Deaths by nearest pump:")
print(pump_deaths.head(8).to_string())
print(f"\nBroad Street pump accounts for {pump_deaths.get('Broad St', 0)} of {len(deaths)} deaths")
print("\nBut proximity is not causation. Snow knew he needed stronger evidence.")
Deaths by nearest pump:
nearest_pump
Broad St          359
So Soho            64
Crown Chapel       61
Briddle St         27
Oxford St #2       24
Warwick            16
Oxford St #1       12
Gt Marlborough      6

Broad Street pump accounts for 359 of 578 deaths

But proximity is not causation. Snow knew he needed stronger evidence.

A Framework: Levels of Causal Evidence

Snow’s investigation illustrates a hierarchy of evidence that we’ll use throughout causal inference. Not all evidence is created equal:

Level Method Example Strength
1 Randomized experiment Clinical trial (ACTG 175 from Chapter 10) Gold standard
2 Natural experiment Snow’s water companies Very strong
3 Regression with confounders SAT \(\to\) earnings controlling for wealth Depends on what you control for
4 Simple correlation SAT and earnings (\(r = 0.65\)) Weakest — could be confounded

We’ll see more of levels 1 and 2 in Chapter 19.

Can We Fix Regression? Controlling for Confounders

If we can’t run an experiment, can we at least control for confounders in a regression?

Let’s try. In the college data, CONTROL indicates public (1), private nonprofit (2), or private for-profit (3). This variable is a rough proxy for resources and wealth. What happens when we add it?

Code
from sklearn.linear_model import LinearRegression

# Simple regression: SAT -> earnings
X_simple = colleges[['SAT_AVG']].values
y = colleges['MD_EARN_WNE_P10'].values

model_simple = LinearRegression().fit(X_simple, y)
print("Simple regression: SAT -> Earnings")
print(f"  Coefficient on SAT: ${model_simple.coef_[0]:.0f} per point")
print(f"  R-squared = {model_simple.score(X_simple, y):.3f}")

# Multiple regression: SAT + school type -> earnings
colleges['is_private_nonprofit'] = (colleges['CONTROL'] == 2).astype(int)
colleges['is_for_profit'] = (colleges['CONTROL'] == 3).astype(int)
X_multi = colleges[['SAT_AVG', 'is_private_nonprofit', 'is_for_profit']].values

model_multi = LinearRegression().fit(X_multi, y)
print(f"\nMultiple regression: SAT + school type -> Earnings")
print(f"  Coefficient on SAT: ${model_multi.coef_[0]:.0f} per point")
print(f"  Coefficient on private nonprofit: ${model_multi.coef_[1]:,.0f}")
print(f"  R-squared = {model_multi.score(X_multi, y):.3f}")

print(f"\nThe SAT coefficient barely changed (${model_simple.coef_[0]:.0f} → ${model_multi.coef_[0]:.0f}) when we controlled for school type.")
print("School type doesn't confound the SAT-earnings relationship much in this data.")
print("Not every suspected confounder actually confounds — but we still can't rule out other unmeasured confounders like family wealth.")
Simple regression: SAT -> Earnings
  Coefficient on SAT: $51 per point
  R-squared = 0.417

Multiple regression: SAT + school type -> Earnings
  Coefficient on SAT: $52 per point
  Coefficient on private nonprofit: $31
  R-squared = 0.420

The SAT coefficient barely changed ($51 → $52) when we controlled for school type.
School type doesn't confound the SAT-earnings relationship much in this data.
Not every suspected confounder actually confounds — but we still can't rule out other unmeasured confounders like family wealth.

The Limits of “Controlling For”

Adding control variables to a regression can help — but only if:

  1. You include all confounders. Miss one, and the estimate is biased.
  2. You measure them well. A rough proxy for “family wealth” isn’t the same as actual wealth.
  3. You don’t control for the wrong things. Controlling for a mediator (something on the causal path) or a collider (something caused by both treatment and outcome) can make bias worse.

The difficulty of controlling for all confounders is the core challenge of observational studies. You’re always one unmeasured confounder away from a wrong answer.

TipThink About It

What would you need to measure to estimate the causal effect of SAT scores on earnings?

A Trap: Collider Bias

TipThink About It

We said you should control for confounders. Should you always add more control variables to a regression? What could go wrong?

Here’s a puzzle. Imagine you’re at a prestigious consulting firm. You notice that the most talented analysts seem to have the weakest professional networks, and the best-connected people tend to be mediocre analysts. Does getting hired destroy your social skills? No — this is a statistical illusion called collider bias.

ImportantDefinition: Collider

A variable is a collider on a particular path if both arrows on that path point into it. (Note: being a collider is a property of a variable on a specific path, not a property of the variable itself — the same variable might be a collider on one path but not another.)

  Talent ──→ Hired at Top Firm ←── Connections

Among all people, talent and connections may be unrelated. But among people hired at top firms (conditioning on the collider), they’re negatively correlated — you got in either through talent OR connections.

The resulting artifact is called collider bias or selection bias, and it’s a common trap in regression analysis.

Code
# Simulate collider bias
n = 5000
talent = np.random.normal(0, 1, n)
connections = np.random.normal(0, 1, n)

# Hiring depends on both (collider)
hire_score = talent + connections + np.random.normal(0, 0.5, n)
hired = hire_score > np.percentile(hire_score, 70)
print(f"Hired {hired.sum()} of {n} people (top 30% of hire score)")
Hired 1500 of 5000 people (top 30% of hire score)
Code
fig, axes = plt.subplots(1, 2, figsize=(11, 4))

# Left: full population
axes[0].scatter(talent, connections, alpha=0.1, s=5, color='gray')
r_all = np.corrcoef(talent, connections)[0, 1]
axes[0].set_xlabel('Talent'); axes[0].set_ylabel('Connections')
axes[0].set_title(f'Full Population: r = {r_all:.3f} (no relationship)')

# Right: among hired only — negative correlation!
axes[1].scatter(talent[hired], connections[hired], alpha=0.3, s=10, color='steelblue')
r_hired = np.corrcoef(talent[hired], connections[hired])[0, 1]
axes[1].set_xlabel('Talent'); axes[1].set_ylabel('Connections')
axes[1].set_title(f'Among Hired Only: r = {r_hired:.3f} (collider bias!)')

plt.suptitle('Collider Bias: Controlling for "Hired" Creates a Spurious Correlation',
             fontsize=13, fontweight='bold', y=1.02)
plt.tight_layout()
plt.show()

print(f"In the full population, talent and connections are unrelated (r = {r_all:.3f}).")
print(f"Among hired people, they're negatively correlated (r = {r_hired:.3f}).")
print("Controlling for a collider creates bias that wasn't there before!")
print("(The negative correlation holds regardless of the hiring threshold — try changing 70 to 50 or 90.)")

In the full population, talent and connections are unrelated (r = -0.002).
Among hired people, they're negatively correlated (r = -0.495).
Controlling for a collider creates bias that wasn't there before!
(The negative correlation holds regardless of the hiring threshold — try changing 70 to 50 or 90.)
WarningCollider Bias

Conditioning on a common effect of two variables creates a fake relationship between them. That’s collider bias. Always check your DAG before adding control variables to a regression.

ImportantConfounder vs. Collider

A confounder is a common cause of treatment and outcome — controlling for it removes bias. A collider is a common effect of treatment and outcome — controlling for it creates bias. The two require opposite strategies: always control for confounders, never control for colliders.

DAG Rules: What to Control For

Before we summarize the rules, let’s work through one more example. Consider this DAG for the effect of Education on Health:

  Socioeconomic Background ──→ Education ──→ Income ──→ Health
              │                                           ↑
              └──────────────────────────────────────────┘

What’s the confounder? Socioeconomic Background — it’s a common cause of Education and Health. Should we control for it? Yes — it’s on a backdoor path. What’s the mediator? Income — it’s on the causal path from Education to Health. Should we control for Income? No — it would block the very effect we’re trying to measure.

ImportantDefinition: Mediator

A mediator is a variable on the causal path between treatment and outcome (X \(\rightarrow\) Z \(\rightarrow\) Y). Controlling for a mediator removes part of the causal effect you are trying to measure.

Here are the key rules for reading causal diagrams:

Structure Name Example Control for it?
X \(\leftarrow\) Z \(\rightarrow\) Y Confounder Wealth \(\to\) SAT, Wealth \(\to\) Earnings Yes! Blocks the backdoor path
X \(\rightarrow\) Z \(\rightarrow\) Y Mediator SAT \(\to\) Admission \(\to\) Earnings No! Blocks the causal path
X \(\rightarrow\) Z \(\leftarrow\) Y Collider Talent \(\to\) Hired \(\leftarrow\) Connections No! Opens a spurious path
ImportantDefinition: Back-Door Criterion

The back-door criterion says: to estimate the causal effect of X on Y, you must block all backdoor paths (paths into X through common causes) without blocking any causal paths (paths from X to Y) or opening any collider paths. An additional requirement: you cannot control for any descendant of the treatment (a variable caused by X), since doing so partly blocks the causal effect or introduces bias through collider structures downstream of X.

These rules follow from whether “controlling for” a variable opens or closes the path between treatment and outcome. Getting this wrong — controlling for the wrong variable — is one of the most common mistakes in applied statistics.

TipThink About It

In the DAG below, identify the confounder, mediator, and collider. Which should you control for?

X \(\to\) M \(\to\) Y,   X \(\leftarrow\) C \(\to\) Y,   X \(\to\) Z \(\leftarrow\) Y

Key Takeaways

“Correlation is not causation.” Now you know why — and you have the tools (DAGs) to tell the difference.

  • Confounders create spurious associations. A confounder is a common cause of treatment and outcome, creating a backdoor path that makes correlation look like causation.

  • Causal diagrams (DAGs) make the reasoning visual. Draw the arrows, trace the paths, and you’ll see whether regression can identify a causal effect.

  • Simpson’s Paradox is a confounding problem: aggregate data can show the opposite of disaggregated data. The DAG tells you which is right.

  • Natural experiments (like Snow’s water companies — not his famous map!) provide causal evidence without randomization, by exploiting situations where treatment is as-if random.

  • Block all backdoor paths, but don’t block causal paths. Control for confounders (good), but never control for colliders (opens spurious paths) or mediators (removes the effect you’re trying to measure).

Next time: difference-in-differences, A/B tests, and counterfactual reasoning — the tools for designing studies that establish causation.

Study guide

Key ideas

  • Causal inference: determining when a statistical association reflects a cause-and-effect relationship
  • Confounder: a common cause of treatment and outcome that creates a spurious association (a backdoor path)
  • DAG (Directed Acyclic Graph): a diagram of causal relationships; “directed” = arrows have direction, “acyclic” = no loops
  • Backdoor path: a non-causal path connecting treatment and outcome through a common cause
  • Causal path (directed path): a path where all arrows point forward from treatment to outcome
  • Back-door criterion: to estimate a causal effect, block all backdoor paths without blocking causal paths or opening collider paths
  • Collider: a variable on a path where both arrows point into it; conditioning on a collider opens a spurious path
  • Mediator: a variable on the causal path between treatment and outcome
  • Spurious association: a correlation arising entirely from a common cause, not from a causal relationship
  • Natural experiment: a situation where an external circumstance assigns treatment as-if randomly
  • Observational data vs. experimental data: data collected without intervention vs. data from a randomized experiment
  • Simpson’s Paradox: when a trend reverses after disaggregating by a confounding variable
  • Confounders create backdoor paths that make correlation look like causation; DAGs let you see and block them.
  • Regression estimates a causal effect only when all backdoor paths are blocked without blocking causal paths.
  • Controlling for a collider creates bias; controlling for a mediator removes the effect you want to measure.
  • Natural experiments exploit as-if-random assignment to provide causal evidence without a randomized trial.
  • Simpson’s Paradox is a confounding problem that DAGs explain: the aggregate hides a backdoor path.

Computational tools

  • No new computational tools are critical for this lecture — the key skills are conceptual (drawing and reading DAGs, identifying confounders/colliders/mediators)
  • pd.qcut() for creating severity categories
  • scipy.spatial.distance.cdist() for computing distances between points
  • The draw_dag() helper function pattern for visualizing causal diagrams

For the quiz

You should be able to:

  • Given a scenario, identify confounders, mediators, and colliders
  • Given a DAG, determine which variables to control for and which to leave alone
  • Explain why controlling for a collider creates bias (with an example)
  • Distinguish observational evidence (correlation) from experimental evidence (causation)
  • Explain Simpson’s Paradox using a DAG