---
title: "Validation beyond the random split"
execute:
enabled: true
jupyter: python3
---
You build a model to predict tomorrow's air quality. It scores R² = 0.66 on your test set. You deploy it. It fails to beat a model that predicts "same as yesterday." A hospital trains a sepsis-alert model. It validates at AUC 0.76. Clinicians start acting on the alerts; within a year, the model's performance quietly erodes. A predictive-policing algorithm flags a neighborhood as high-crime. Police patrol more heavily. Arrests go up. The model "confirms" its own prediction.
The validation machinery of [Chapter 6](lec06-validation.qmd) — random train/test splits, cross-validation — was built for a specific world: rows are exchangeable, the deployment distribution matches the training distribution, and the model's predictions don't affect the outcome. That world is real, and when it holds, random-split validation is trustworthy. This chapter is about the cases where it doesn't. There are four failure modes arranged from mildest to deepest:
1. **Temporal leakage.** Data has time structure. A random split puts neighbors of the test point in the training set. *Fix: validate with a temporal split (backtesting).*
2. **Distribution shift.** Deployment-time data comes from a regime the training set never saw. *Fix: test on out-of-distribution datasets, stratify by subgroup, widen intervals for extreme inputs.*
3. **Feedback loops.** The model's predictions change the outcome. *No passive validation procedure, random or temporal, catches this — you need to design the deployment itself (randomized holdouts, A/B tests).*
4. **Goodhart's law.** The metric you optimized becomes a target and stops measuring the goal. *Again a structural problem, not a split problem.*
Backtesting isn't another thing that fails; for time-series forecasting, it is established and basically trustworthy methodology, with caveats for sudden regime shifts. Subgroup validation and distribution-shift testing likewise work when you do them. The thesis isn't "validation is broken." The thesis is: **random-split validation is one tool, and you need the right tool for each failure mode.**
```{python}
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_absolute_error
import warnings
warnings.filterwarnings('ignore')
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (7, 4.5)
plt.rcParams['font.size'] = 12
DATA_DIR = 'data'
```
# Part 1: Temporal leakage
When your validation metric looks great but deployment fails, the first place to check is the *split*. If your data has temporal structure — anything that evolves over time, from stock prices to air quality to NBA performance — the textbook random train/test split is silently broken. Observations sit close together in time, and those neighbors leak information across the split.
## Air quality in California
The EPA publishes daily Air Quality Index (AQI) measurements for every U.S. county. AQI measures how polluted the air is — 0 is pristine, 50 is "good," and anything above 150 is "unhealthy." In California, wildfire smoke can push AQI into the thousands.
Let's load four years of data and focus on a few counties with good coverage.
```{python}
# Load all four years of EPA data, filtered to California
dfs = []
for year in range(2021, 2025):
df_year = pd.read_csv(f'{DATA_DIR}/epa-air-quality/daily_aqi_by_county_{year}.csv')
df_year = df_year[df_year['State Name'] == 'California']
dfs.append(df_year)
aqi = pd.concat(dfs, ignore_index=True)
aqi['Date'] = pd.to_datetime(aqi['Date'])
aqi = aqi.sort_values(['county Name', 'Date']).reset_index(drop=True)
print(f"California records: {len(aqi):,}")
print(f"Counties: {aqi['county Name'].nunique()}")
print(f"Date range: {aqi['Date'].min().date()} to {aqi['Date'].max().date()}")
```
We'll focus on three counties with good daily coverage and distinct air quality patterns: Los Angeles (urban smog), Sacramento (valley pollution), and Mono (extreme dust events).
```{python}
# Pick 3 counties with good daily coverage and interesting patterns
counties = ['Los Angeles', 'Sacramento', 'Mono']
aqi_sub = aqi[aqi['county Name'].isin(counties)].copy()
# Check coverage
for county in counties:
n = aqi_sub[aqi_sub['county Name'] == county].shape[0]
print(f"{county}: {n} daily observations")
```
### What does air quality look like over time?
Before modeling anything, let's look at the data. Notice the **seasonal pattern** — summer and fall have worse air quality (wildfire season). And notice the **spikes**: these are smoke events that dwarf normal variation.
```{python}
fig, axes = plt.subplots(3, 1, figsize=(10, 8), sharex=True)
for ax, county in zip(axes, counties):
county_data = aqi_sub[aqi_sub['county Name'] == county]
ax.plot(county_data['Date'], county_data['AQI'], linewidth=0.5, alpha=0.8)
ax.set_ylabel('AQI')
ax.set_title(f'{county} County')
ax.axhline(y=150, color='red', linestyle='--', alpha=0.5, label='Unhealthy threshold')
ax.legend(loc='upper right')
axes[-1].set_xlabel('Date')
fig.suptitle('Daily AQI in California Counties (2021–2024)', fontsize=14, y=1.01)
plt.tight_layout()
plt.show()
```
Mono County is wild — AQI values above 1,000 (and once above 8,000!) from extreme dust storms near Mono Lake. Los Angeles and Sacramento show the typical California pattern: seasonal wildfire spikes layered on top of baseline urban pollution.
:::{.callout-important}
## Definition: Non-Stationarity
A time series is **non-stationary** when its statistical properties (mean, variance) change over time. The AQI data above is non-stationary: the mean and variance shift with the seasons, and extreme events (wildfire spikes) appear unpredictably. The rest of this chapter explores why non-stationarity undermines standard evaluation methods.
:::
## Building a prediction model
Let's try to predict tomorrow's AQI. A natural approach: use **lag features** — yesterday's AQI, the 7-day average, etc. This approach is just the feature engineering from Chapter 5, applied to time.
We'll focus on Los Angeles for a clean demonstration.
```{python}
# Work with Los Angeles
la = aqi_sub[aqi_sub['county Name'] == 'Los Angeles'][['Date', 'AQI']].copy()
la = la.sort_values('Date').reset_index(drop=True)
# Create lag features
la['aqi_lag1'] = la['AQI'].shift(1) # yesterday
la['aqi_lag7_avg'] = la['AQI'].rolling(7).mean().shift(1) # 7-day average (as of yesterday)
la['aqi_lag3_avg'] = la['AQI'].rolling(3).mean().shift(1) # 3-day average
la['day_of_year'] = la['Date'].dt.dayofyear # seasonality feature
# Drop rows with NaN from lagging
la = la.dropna().reset_index(drop=True)
print(f"LA dataset: {len(la)} days with features")
la.head(10)
```
## The standard split — and why it fails here
Recall from Chapter 6 that we split data randomly into training and test sets. That approach assumed each observation was independent of the others. Time-ordered observations violate that assumption.
Here's what most tutorials (and most AI tools) would do — randomly shuffle the data and split 80/20. Let's see how well it works.
```{python}
# Random 80/20 split — the wrong way for time series
np.random.seed(42)
features = ['aqi_lag1', 'aqi_lag7_avg', 'aqi_lag3_avg', 'day_of_year']
mask = np.random.rand(len(la)) < 0.8
train_std = la[mask]
test_std = la[~mask]
model_std = LinearRegression()
model_std.fit(train_std[features], train_std['AQI'])
pred_std = model_std.predict(test_std[features])
r2_std = r2_score(test_std['AQI'], pred_std)
mae_std = mean_absolute_error(test_std['AQI'], pred_std)
print(f"Random split results:")
print(f" R-squared = {r2_std:.3f}")
print(f" MAE = {mae_std:.1f}")
print(f" Train size: {len(train_std)}, Test size: {len(test_std)}")
print()
print("Looks great! Ship it?")
```
R² about 0.66 — not bad for a first attempt. But a hidden problem lurks in the evaluation itself.
:::{.callout-tip}
## Think About It
Your team built an AQI forecasting model with R²=0.95 using a random train/test split. Before deploying it for public health alerts, would you trust that number? With a random split, the test set includes days *from the middle* of the time series — a test-set day from June 2022 has its neighbors (June 1, June 3, ...) in the training set, and the lag features are computed from data *right next to* training points. What split strategy would you insist on, and why?
:::
:::{.callout-important}
## Definition: Data Leakage (Temporal)
**Data leakage** occurs when information from the future (or from the test set) contaminates training, producing optimistically biased results. In time series, a random split causes leakage: test observations are highly correlated with nearby training observations. The lag feature "yesterday's AQI" for a test-set day was likely in the training set.
:::
## Proper backtesting: temporal split
The fix is simple: **train on the past, test on the future**. This protocol mimics how the model would actually be used — you'd train it on historical data and deploy it to predict tomorrow. The diagram below shows the difference: random splitting scatters test points throughout the timeline, while a temporal split draws a clean boundary.
```{python}
# Visualize random vs temporal split
fig, axes = plt.subplots(2, 1, figsize=(10, 3), sharex=True)
dates = la['Date'].values
n = len(dates)
# Random split
colors_random = np.where(mask[:n], 'steelblue', 'orange')
axes[0].scatter(dates, np.ones(n), c=colors_random, s=2, alpha=0.6)
axes[0].set_yticks([])
axes[0].set_ylabel('Random\nsplit', rotation=0, ha='right', va='center')
axes[0].set_title('Train (blue) vs Test (orange): Random Split vs Temporal Split')
# Temporal split
cutoff = pd.Timestamp('2023-01-01')
colors_temporal = np.where(la['Date'] < cutoff, 'steelblue', 'orange')
axes[1].scatter(dates, np.ones(n), c=colors_temporal, s=2, alpha=0.6)
axes[1].axvline(x=cutoff, color='black', linewidth=1.5, linestyle='--', label='Train/test boundary')
axes[1].set_yticks([])
axes[1].set_ylabel('Temporal\nsplit', rotation=0, ha='right', va='center')
axes[1].legend(loc='upper right')
plt.tight_layout()
plt.show()
print("Top: random split scatters test points everywhere (leakage!).")
print("Bottom: temporal split puts all test points in the future (correct).")
```
:::{.callout-tip}
## Think About It
We're about to train on 2021--2022 and test on 2023. Will R-squared be higher, lower, or about the same as the random split? Why?
:::
Let's train on 2021–2022 and test on 2023.
```{python}
# Temporal split — the right approach
train_temporal = la[la['Date'] < '2023-01-01']
test_temporal = la[(la['Date'] >= '2023-01-01') & (la['Date'] < '2024-01-01')]
model_temporal = LinearRegression()
model_temporal.fit(train_temporal[features], train_temporal['AQI'])
pred_temporal = model_temporal.predict(test_temporal[features])
r2_temporal = r2_score(test_temporal['AQI'], pred_temporal)
mae_temporal = mean_absolute_error(test_temporal['AQI'], pred_temporal)
print(f"Temporal split results (train: 2021-2022, test: 2023):")
print(f" R-squared = {r2_temporal:.3f}")
print(f" MAE = {mae_temporal:.1f}")
print(f" Train size: {len(train_temporal)}, Test size: {len(test_temporal)}")
```
Now let's compare the two approaches side by side.
```{python}
# Side-by-side comparison
print("=" * 50)
print(f"{'Method':<25} {'R-sq':>8} {'MAE':>8}")
print("=" * 50)
print(f"{'Random split (WRONG)':<25} {r2_std:>8.3f} {mae_std:>8.1f}")
print(f"{'Temporal split (RIGHT)':<25} {r2_temporal:>8.3f} {mae_temporal:>8.1f}")
print("=" * 50)
print(f"{'Leakage gap':<25} {r2_std - r2_temporal:>8.3f} {mae_temporal - mae_std:>8.1f}")
print()
gap = r2_std - r2_temporal
if gap > 0.01:
print(f"The random split inflated R-squared by {gap:.3f} -- that's the leakage.")
elif gap < -0.01:
print(f"Here the temporal split slightly outperforms (gap = {gap:.3f}).")
print("The leakage didn't help much -- but the correct evaluation is still the temporal one.")
else:
print(f"The gap is small ({gap:.3f}), but the temporal split is the only honest evaluation.")
print("Even when the numbers are similar, the random split's validity is compromised.")
```
In this dataset, the two R² values are close — the random split didn't dramatically inflate performance. But **the random split is still the wrong evaluation**. Its validity is compromised: test-set days have their neighbors in the training set, so the lag features get to "cheat." The fact that the numbers happen to be similar here doesn't justify the method. The temporal split is the only honest evaluation because it mimics how the model would actually be deployed.
:::{.callout-important}
## Definition: Naive Baseline
The **naive forecast** predicts that tomorrow's value equals today's value: $\hat{y}_{t+1} = y_t$. This baseline is the simplest possible forecast and the bar any model must clear. In the forecasting literature, other standard benchmarks include the **seasonal naive** method (predict the value from the same season last year) and the **drift** method (extrapolate the line between the first and last observations). These methods are collectively called **benchmark forecasting methods**.
:::
```{python}
# Naive forecast baseline: predict "same as yesterday"
naive_pred = test_temporal['aqi_lag1'] # yesterday's AQI IS the naive forecast
mae_naive_baseline = mean_absolute_error(test_temporal['AQI'], naive_pred)
r2_naive_baseline = r2_score(test_temporal['AQI'], naive_pred)
print(f"Naive baseline (same as yesterday):")
print(f" R-squared = {r2_naive_baseline:.3f}")
print(f" MAE = {mae_naive_baseline:.1f}")
print()
print(f"Linear model (temporal split):")
print(f" R-squared = {r2_temporal:.3f}")
print(f" MAE = {mae_temporal:.1f}")
print()
print("The naive baseline is surprisingly hard to beat!")
print("This is common in time series: yesterday is a strong predictor of today.")
```
This result illustrates why backtesting is essential. In production, your model only ever sees the past. Your evaluation should reflect that constraint.
### Scale-independent accuracy: MAPE and MASE
MAE tells you the average error in the same units as your target — useful for one series, but hard to compare across series with different scales. Is an MAE of 5 good for Los Angeles (typical AQI ~50) and Mono County (typical AQI ~30, but spikes to 8,000)?
**MAPE** (Mean Absolute Percentage Error) expresses errors as percentages:
$$\text{MAPE} = \frac{1}{n} \sum_{i=1}^{n} \left| \frac{y_i - \hat{y}_i}{y_i} \right| \times 100\%.$$
MAPE is intuitive — "the forecast is off by 15% on average" — but it breaks when actual values are near zero (division by zero) and penalizes over-forecasts more heavily than under-forecasts. The asymmetry comes from the bound: when you under-forecast, the percentage error is capped at 100% (you can't predict less than zero); when you over-forecast, it's unbounded. As a result, MAPE-optimal forecasts tend to be biased low. When the actual value is small, even a modest error produces a large percentage — MAPE disproportionately penalizes forecasts for low-value days.
**MASE** (Mean Absolute Scaled Error) compares your model's error to the naive baseline's error:
$$\text{MASE} = \frac{\text{MAE of your model}}{\text{MAE of the naive forecast}}.$$
MASE < 1 means you beat the naive forecast; MASE > 1 means you did worse than just predicting "same as yesterday." This metric connects directly to the baseline comparison above.^[Strictly, MASE uses the in-sample naive MAE as denominator (Hyndman & Koehler, 2006); we use the test-set version for simplicity.]
```{python}
# Compute MAPE and MASE on the temporal test set
actual = test_temporal['AQI'].values
predicted = pred_temporal
naive = test_temporal['aqi_lag1'].values
# MAPE — watch out for zeros
nonzero = actual != 0
mape = np.mean(np.abs((actual[nonzero] - predicted[nonzero]) / actual[nonzero])) * 100
# MASE — scale by naive forecast error
mae_model = np.mean(np.abs(actual - predicted))
mae_naive_ref = np.mean(np.abs(actual - naive))
mase = mae_model / mae_naive_ref
print(f"LA County AQI forecasts (temporal split, 2023):")
print(f" MAE = {mae_model:.2f}")
print(f" MAPE = {mape:.1f}%")
print(f" MASE = {mase:.3f}")
print()
if mase < 1:
print(f"MASE = {mase:.3f} < 1: the model beats the naive baseline.")
else:
print(f"MASE = {mase:.3f} >= 1: the model does NOT beat the naive baseline.")
print("MAPE gives an intuitive percentage, but would blow up if AQI hit 0.")
```
A MASE near 1 means our linear model barely matches the naive "same as yesterday" baseline. That's not a failure of the method — it's a genuine finding. AQI is highly autocorrelated, so yesterday's value is a strong predictor that's hard to improve on with a simple linear model. More sophisticated models (nonlinear, with weather covariates) might do better, but the naive baseline sets a high bar.
## Walk-forward validation
A single temporal split gives you one number. But model performance can *change over time*. **Walk-forward validation** trains on an expanding window and tests on the next period, giving you a performance curve. In the forecasting literature, this is called **time series cross-validation** — the same idea as k-fold cross-validation, but respecting temporal order.
Here's the idea. Each row is one "fold." The training set (blue) expands forward in time, and the model is tested on the next month (orange):
```
Fold 1: [===== TRAIN =====][TEST]
Fold 2: [======= TRAIN =======][TEST]
Fold 3: [========= TRAIN =========][TEST]
Fold 4: [=========== TRAIN ===========][TEST]
2021 2022 2023
```
We used an **expanding window** here (all past data). An alternative is a **sliding window** (fixed-width training set that moves forward), which can be better when very old data is from a different regime — exactly the distribution shift problem we'll see later.
This distinction is especially important when the data has regime changes — like wildfire season.
```{python}
# Walk-forward: train on expanding window, test on next month
# (freq='MS' = month start: generates the 1st of each month)
results = []
test_months = pd.date_range('2023-01-01', '2024-01-01', freq='MS')
for i, test_start in enumerate(test_months[:-1]):
test_end = test_months[i + 1]
train_wf = la[la['Date'] < test_start]
test_wf = la[(la['Date'] >= test_start) & (la['Date'] < test_end)]
if len(test_wf) < 10 or len(train_wf) < 100:
continue
model_wf = LinearRegression()
model_wf.fit(train_wf[features], train_wf['AQI'])
pred_wf = model_wf.predict(test_wf[features])
month_mae = mean_absolute_error(test_wf['AQI'], pred_wf)
month_r2 = r2_score(test_wf['AQI'], pred_wf)
results.append({
'month': test_start,
'mae': month_mae,
'r2': month_r2,
'n_test': len(test_wf),
'max_aqi': test_wf['AQI'].max()
})
print(f" {test_start.strftime('%b %Y')}: MAE = {month_mae:.1f}, R² = {month_r2:.3f}")
wf_results = pd.DataFrame(results)
wf_results['month_label'] = wf_results['month'].dt.strftime('%b %Y')
print(wf_results[['month_label', 'mae', 'r2', 'max_aqi']].to_string(index=False))
```
The monthly results tell a richer story than a single annual number. Let's visualize the performance over time.
```{python}
fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(10, 7), sharex=True)
ax1.bar(wf_results['month_label'], wf_results['mae'], color='steelblue', alpha=0.8)
ax1.set_ylabel('MAE')
ax1.set_title('Walk-Forward Validation: Monthly MAE (Los Angeles, 2023)')
ax1.axhline(y=mae_temporal, color='red', linestyle='--', alpha=0.5,
label=f'Full-year MAE = {mae_temporal:.1f}')
ax1.legend()
ax2.bar(wf_results['month_label'], wf_results['max_aqi'], color='coral', alpha=0.8)
ax2.set_ylabel('Max AQI in Month')
ax2.set_title('Worst Air Quality Day Each Month')
ax2.axhline(y=150, color='red', linestyle='--', alpha=0.5, label='Unhealthy threshold')
ax2.legend()
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()
```
Notice how the model's error **spikes during summer months** when wildfire smoke events push AQI to extremes. The model was trained on mostly "normal" air quality patterns and hasn't seen enough extreme events to handle them.
This pattern is the value of walk-forward validation: it reveals *when* your model breaks, not just how well it does on average.
## Lag features are just feature engineering
Recall from Chapter 5 that feature engineering transforms raw data into more useful predictors. Lag features are the same idea applied to time:
| Feature | What it captures |
|---------|-----------------|
| Yesterday's AQI | Short-term momentum |
| 3-day average | Recent trend |
| 7-day average | Weekly baseline |
| Day of year | Seasonal pattern |
| Same day last year | Year-over-year pattern |
Let's see which features matter most.
```{python}
# Which features contribute most?
print("Feature coefficients (temporal model):")
for feat, coef in zip(features, model_temporal.coef_):
print(f" {feat:<20s} {coef:+.4f}")
print(f" {'intercept':<20s} {model_temporal.intercept_:+.4f}")
print()
print("Yesterday's AQI dominates -- AQI is highly autocorrelated.")
print("The seasonal feature (day_of_year) adds a small boost.")
```
## A second example: NBA performance prediction
You might be thinking: "This example is an air quality problem. My field is tech/finance/sports." The temporal-leakage trap appears anywhere data has a time dimension. Try predicting how many points an NBA player scores tonight from their last 5 games.
```{python}
# Load NBA data and keep players with full three-season coverage
nba = pd.read_csv(f'{DATA_DIR}/nba/nba_load_management.csv')
nba['GAME_DATE'] = pd.to_datetime(nba['GAME_DATE'])
nba = nba.sort_values(['PLAYER_NAME', 'GAME_DATE']).reset_index(drop=True)
active = nba.groupby('PLAYER_NAME').size()
nba_active = nba[nba['PLAYER_NAME'].isin(active[active >= 150].index)].copy()
# Rolling lag features per player
nba_active['pts_lag1'] = nba_active.groupby('PLAYER_NAME')['PTS'].shift(1)
nba_active['pts_lag5_avg'] = nba_active.groupby('PLAYER_NAME')['PTS'].transform(
lambda x: x.rolling(5).mean().shift(1))
nba_active['min_lag5_avg'] = nba_active.groupby('PLAYER_NAME')['MIN'].transform(
lambda x: x.rolling(5).mean().shift(1))
nba_model = nba_active.dropna(subset=['pts_lag1', 'pts_lag5_avg', 'min_lag5_avg']).copy()
nba_features = ['pts_lag1', 'pts_lag5_avg', 'min_lag5_avg', 'HOME']
print(f"Modeling dataset: {len(nba_model):,} player-games; seasons {list(nba_model['SEASON'].unique())}")
```
Repeat the random-vs-temporal experiment: random 80/20 split versus train on seasons 1–2, test on season 3.
```{python}
np.random.seed(42)
mask_nba = np.random.rand(len(nba_model)) < 0.8
model_r = LinearRegression().fit(nba_model[mask_nba][nba_features], nba_model[mask_nba]['PTS'])
r2_r = r2_score(nba_model[~mask_nba]['PTS'], model_r.predict(nba_model[~mask_nba][nba_features]))
train_nba = nba_model[nba_model['SEASON'].isin(['2021-22', '2022-23'])]
test_nba = nba_model[nba_model['SEASON'] == '2023-24']
model_t = LinearRegression().fit(train_nba[nba_features], train_nba['PTS'])
r2_t = r2_score(test_nba['PTS'], model_t.predict(test_nba[nba_features]))
print(f"Random split R²: {r2_r:.3f}")
print(f"Temporal split R²: {r2_t:.3f} (gap {r2_r - r2_t:+.3f})")
```
Unlike AQI, the NBA model generalizes well across seasons — player scoring patterns are more stable year to year than air quality is. When temporal structure doesn't create strong distribution shift, the random split may not inflate performance. But the temporal split is still the correct evaluation: it mirrors how you'd actually deploy.
One more twist before we leave temporal leakage behind. Even a model with a decent R² doesn't translate to profit in a betting market — the market already incorporates the information your model uses. If your "edge" comes from the last 5 games' scoring average, so does everyone else's line, and the market has already priced it in. This is the first hint of a deeper problem: the market isn't a static environment. It *adapts* to public signals. Once the whole world knows a feature predicts scoring, the feature stops predicting market outcomes. The temporal split says your R² is real; the betting profit says the market isn't sitting still. That gap between *statistical* performance and *deployment* performance is the bridge to Part 2.
## Distribution shift: when the future doesn't look like the past
Let's go back to air quality and look at something the model has *never seen*.
Mono County, near Mono Lake in eastern California, occasionally experiences extreme dust/wildfire events that push AQI into the thousands — sometimes above 8,000. That's roughly 16x the "Hazardous" threshold of 500.
:::{.callout-tip}
## Think About It
If we train a model on "normal" days (AQI below 300), what would it predict for an AQI of 8,000? How far off would it be?
:::
```{python}
# Mono County extremes
mono = aqi_sub[aqi_sub['county Name'] == 'Mono'][['Date', 'AQI']].copy()
mono = mono.sort_values('Date').reset_index(drop=True)
# Create lag features
mono['aqi_lag1'] = mono['AQI'].shift(1)
mono['aqi_lag7_avg'] = mono['AQI'].rolling(7).mean().shift(1)
mono['aqi_lag3_avg'] = mono['AQI'].rolling(3).mean().shift(1)
mono['day_of_year'] = mono['Date'].dt.dayofyear
mono = mono.dropna().reset_index(drop=True)
print("Mono County AQI distribution:")
print(mono['AQI'].describe().to_string())
print(f"\nDays above 500: {(mono['AQI'] > 500).sum()}")
print(f"Max AQI: {mono['AQI'].max()}")
```
Let's train a linear model on "normal" days (AQI below 300) and see what it predicts during extreme events.
```{python}
# Train a model on "normal" data (AQI < 300) and see what it predicts for extremes
normal_mono = mono[mono['AQI'] < 300]
extreme_mono = mono[mono['AQI'] >= 300]
model_mono = LinearRegression()
model_mono.fit(normal_mono[features], normal_mono['AQI'])
if len(extreme_mono) > 0:
pred_extreme = model_mono.predict(extreme_mono[features])
print("Model predictions vs reality during extreme events:")
print(f"{'Actual AQI':>12} {'Predicted':>12} {'Error':>12}")
print("-" * 40)
for actual, pred in list(zip(extreme_mono['AQI'].values, pred_extreme))[:10]:
print(f"{actual:>12.0f} {pred:>12.0f} {actual - pred:>12.0f}")
coef_lag1 = model_mono.coef_[features.index('aqi_lag1')]
print()
print(f"Coefficient learned on aqi_lag1 (trained on AQI < 300): {coef_lag1:.4f}")
else:
print("No extreme events found in Mono County subset.")
```
The numbers are stark, but the visualization makes the failure visceral.
```{python}
# Visualize the model's failure during extreme events
fig, ax = plt.subplots(figsize=(10, 5))
ax.plot(mono['Date'], mono['AQI'], linewidth=0.5, alpha=0.7, label='Actual AQI')
# Predict on all data
pred_all = model_mono.predict(mono[features])
ax.plot(mono['Date'], pred_all, linewidth=0.5, alpha=0.7, color='orange',
label='Model prediction')
ax.axhline(y=500, color='red', linestyle='--', alpha=0.5, label='Hazardous (500)')
ax.set_ylabel('AQI')
ax.set_xlabel('Date')
ax.set_title('Mono County: Model vs Reality During Extreme Events')
ax.legend()
plt.tight_layout()
plt.show()
print("The orange line (predictions) can't reach the spikes.")
print("A linear model trained on normal conditions extrapolates nonsensically.")
```
Why does the model fail even on days when its input — yesterday's AQI — is itself extreme? Look at the learned coefficient on `aqi_lag1`: essentially zero. Inside the training regime (AQI below 300), day-to-day changes in Mono County air quality are dominated by noise, so ordinary least squares learns that yesterday's value carries almost no predictive signal and settles for "predict the training mean." When we plug in `aqi_lag1 = 8000`, the prediction barely budges from the training mean — the model never had a chance. The failure isn't that the model refuses to extrapolate; it's that in the training distribution, the relevant signal didn't exist to be learned.
:::{.callout-important}
## Definition: Distribution Shift
**Distribution shift** occurs when future data comes from a fundamentally different regime than the training data. It is a severe form of non-stationarity — not just gradual changes in the mean, but a qualitative change the training data never captured.
:::
Distribution shift is the deepest problem in time series forecasting:
- Your training data covers *normal* conditions
- The events that matter most — wildfires, pandemics, financial crashes — are precisely the ones your model has never seen
- No amount of clever cross-validation alone fixes this; it's a fundamental limitation of purely data-driven models
> "Prediction is very difficult, especially about the future." — attr. Niels Bohr
A clinical trial on adults can't tell you about children — the inference only covers the population you sampled. Same idea here: a model trained on AQI 0--300 can't tell you what happens at AQI 8,000. You can't extrapolate beyond the range of your data. [Chapter 6](lec06-validation.qmd) previewed this: random splits hold out observations that *look like* the ones you trained on, so cross-validation protects you against in-distribution noise, not against a future that doesn't look like the past.
In Chapter 17, we'll see that AutoML tools often make this same mistake — they use standard cross-validation by default and don't know your data is temporal.
Distribution shift can also be gradual — styles of play, market conditions, or climate patterns that evolve slowly can erode model performance without triggering obvious alarms.
If models can fail silently when conditions change, point forecasts hide exactly what matters. The next section quantifies the uncertainty around a forecast — at least under the regime the data captures.
## Prediction intervals: quantifying forecast uncertainty
A point forecast — "tomorrow's AQI will be 62" — hides everything that matters for decision-making. Should the county issue a health advisory? That depends on whether 62 means "almost certainly between 55 and 70" or "could easily be 200." **Prediction intervals** give you that range.
The idea: use the bootstrap. Recall from Chapter 8 that bootstrap resampling estimates uncertainty by simulating many alternative datasets from the data we have. Here we apply the same idea to forecast uncertainty: we resample the model's training residuals, add them to each forecast, and repeat many times. The spread of the resulting forecasts gives you a prediction interval.
```{python}
# Bootstrap prediction intervals for the temporal test set
np.random.seed(42)
train_resid = train_temporal['AQI'].values - model_temporal.predict(train_temporal[features])
n_boot = 1000
# Generate bootstrap forecasts by resampling residuals
boot_preds = np.zeros((n_boot, len(test_temporal)))
point_pred = model_temporal.predict(test_temporal[features])
for b in range(n_boot):
resampled_resid = np.random.choice(train_resid, size=len(test_temporal), replace=True)
boot_preds[b, :] = point_pred + resampled_resid
# 95% prediction interval: 2.5th and 97.5th percentiles
lower = np.percentile(boot_preds, 2.5, axis=0)
upper = np.percentile(boot_preds, 97.5, axis=0)
# How often does the actual value fall inside the interval?
coverage = np.mean((test_temporal['AQI'].values >= lower) & (test_temporal['AQI'].values <= upper))
avg_width = np.mean(upper - lower)
print(f"95% bootstrap prediction interval:")
print(f" Coverage: {coverage:.1%} of test days (target: 95%)")
print(f" Average width: {avg_width:.1f} AQI points")
```
Let's visualize the prediction intervals for the first two months of the test period. The shaded band shows the 95% interval around each point forecast.
```{python}
# Visualize prediction intervals for a 60-day window
fig, ax = plt.subplots(figsize=(10, 5))
window = slice(0, 60)
dates_w = test_temporal['Date'].values[window]
ax.fill_between(dates_w, lower[window], upper[window],
alpha=0.25, color='steelblue', label='95% prediction interval')
ax.plot(dates_w, point_pred[window], color='steelblue', linewidth=1.2, label='Point forecast')
ax.plot(dates_w, test_temporal['AQI'].values[window], color='black',
linewidth=0.8, alpha=0.8, label='Actual')
ax.set_ylabel('AQI')
ax.set_xlabel('Date')
ax.set_title('LA County AQI: Forecast with 95% Prediction Interval (Jan–Feb 2023)')
ax.legend()
plt.tight_layout()
plt.show()
print("Point forecasts alone are dangerous for decision-making.")
print("Always report intervals so stakeholders can plan for the plausible range.")
```
This resampling approach is completely general: you can construct prediction intervals for *any* model, not just linear regression. The key principle is that a forecast without a measure of uncertainty isn't a complete forecast. (The intervals above assume stationary residuals — in reality, bands should widen during high-variance regimes like wildfire season.)
# Part 2: Feedback loops
## When predictions change the world
Temporal leakage is a validation *technique* failure — random splits saw the future. Feedback loops are a deeper failure: even with a perfectly clean temporal split, your model can fail in deployment because *your model is part of the world it predicts*. What happens when a prediction model doesn't just reflect the world but changes it?
**Predictive policing.** A model predicts high crime in neighborhood X. Police patrol X more heavily. More patrols produce more arrests in X. The model retrains on the new data and "confirms" that X is high-crime. The prediction is self-fulfilling — the model creates the data that justifies itself. This is not confounding in the usual sense. In a confounding story, the association already exists and we misinterpret it. In a feedback loop, the model *causes* the association.
**Credit scoring.** In *Weapons of Math Destruction*, Cathy O'Neil describes how credit scores can trap people in cycles they can't escape.^[O'Neil, *Weapons of Math Destruction: How Big Data Increases Inequality and Threatens Democracy*, Crown, 2016.] A low credit score leads to denied loans. Without loans, a borrower can't build credit history. The score stays low — not because the person is irresponsible, but because the scoring system denied them the opportunity to demonstrate otherwise. The score creates the reality it claims to measure.
**Recommendation algorithms.** YouTube recommends increasingly extreme content because users engage with it. Users watch what's recommended. The algorithm learns they "prefer" extreme content and recommends more. The model shapes the preferences it claims to predict.
**Clinical alerts.** A hospital deploys a sepsis-prediction model trained on historical data. Once the alerts go live, clinicians treat flagged patients earlier. Fewer patients progress to sepsis. The model, evaluated after deployment, looks *worse* than it did on the validation set — not because it degraded, but because the thing it was predicting stopped happening. If the hospital responds by retraining on post-deployment data, the model's ground truth is now shaped by the model itself.
**Betting markets.** Back to the NBA example. Your model beats a naive baseline on historical data. You deploy it against the sportsbook. The line already encodes the same information your model uses — recent scoring averages, rest days, matchup difficulty — because thousands of sharp bettors have moved the line. Your "edge" disappears the moment the market has processed the same signals. The model doesn't fail because of distribution shift; it fails because the market *is* the feedback loop, adapting to public information in real time.
In each case, there is a causal arrow running from the model's *prediction* back to the *outcome*:
```{python}
#| fig-cap: "Feedback loop: the model's prediction changes the outcome it measures, and the new data feeds back into training."
fig, ax = plt.subplots(figsize=(10, 3.2))
ax.set_xlim(0, 10); ax.set_ylim(0, 3); ax.axis('off')
nodes = [('Training\ndata', 1.0, 2.3), ('Model', 2.8, 2.3), ('Prediction', 4.6, 2.3),
('Action', 6.4, 2.3), ('Outcome', 8.2, 2.3), ('New training\ndata', 9.5, 2.3)]
for name, x, y in nodes:
ax.add_patch(plt.Rectangle((x - 0.65, y - 0.3), 1.3, 0.6,
facecolor='lightblue', edgecolor='navy', lw=1.8, zorder=3))
ax.text(x, y, name, ha='center', va='center', fontsize=10, fontweight='bold', zorder=4)
for (_, x1, y1), (_, x2, y2) in zip(nodes[:-1], nodes[1:]):
ax.annotate('', xy=(x2 - 0.65, y2), xytext=(x1 + 0.65, y1),
arrowprops=dict(arrowstyle='->', lw=2, color='navy'))
# Cycle return arrow: from "New training data" back to "Training data"
ax.annotate('', xy=(1.0, 2.0), xytext=(9.5, 1.6),
arrowprops=dict(arrowstyle='->', lw=2, color='#d62728',
connectionstyle='arc3,rad=-0.3'))
ax.text(5.0, 0.6, 'cycle repeats', ha='center', va='center',
fontsize=10, color='#d62728', fontstyle='italic')
plt.tight_layout(); plt.show()
```
This is why "correlation is not causation" isn't the full story. Sometimes the model *is* the cause. When a model's predictions change the environment that generates future data, standard validation breaks down. The test set no longer represents the world the model will operate in, because the model itself reshapes that world.
Whenever you deploy a model, ask: how will the predictions change the data distribution in the future? If the answer is "they won't" (a weather forecaster doesn't affect the weather, a particle-physics model doesn't affect the particles), you're safe from feedback loops. If the answer is "they will" (credit scores affect who gets loans, sepsis alerts affect who gets treated), you have a feedback loop — and your validation set can't tell you how it will behave.
:::{.callout-note collapse="true"}
## Going deeper: performative prediction
The phenomenon we're calling "feedback loop" has a formal research framework called **performative prediction** (Perdomo, Zrnic, Mendler-Dünner, Hardt 2020). It treats the deployed model $f_\theta$ as inducing a data distribution $\mathcal{D}(\theta)$ that depends on the deployed parameters, and studies when iterative retraining converges to a fixed point (a "performatively stable" $\theta$ where the model is accurate against the distribution *it induces*). The framework also subsumes *strategic classification*, where individual agents deliberately change their features to score better. The formal machinery is mostly relevant for algorithm designers; the *phenomenon* is what matters for analysts, which is why we stick with "feedback loop" in this chapter.
:::
## Weapons of Math Destruction
Cathy O'Neil calls the worst cases **Weapons of Math Destruction** (WMDs).^[O'Neil, *Weapons of Math Destruction*, Crown, 2016. O'Neil's own three properties are opacity, scale, and damage; we focus here on a related triad — inspired by her framework — that highlights feedback dynamics.] Three properties, together, turn an ordinary predictive model into a WMD:
1. The outcome is **not easily measurable** (you can't compute a clean test error)
2. The predictions can have **negative consequences** for individuals
3. The model creates **self-fulfilling feedback loops** that reinforce its own biases
College rankings are a WMD: high rankings attract students, faculty, and donors, which raises the school's quality — the ranking creates the reality it claims to measure. Parole-risk models are a WMD: keeping someone in prison longer may *cause* them to reoffend (fewer job prospects, deeper criminal networks), confirming the model's prediction. By contrast, CERN's particle-physics models are *not* a WMD: the outcome is measurable, the experiments are rigorous, and the predictions don't change the particles.
:::{.callout-tip}
## Think About It
You build a model that predicts which students will fail a course, and the university uses it to assign tutoring resources. Is this a Weapon of Math Destruction? Check the three properties: is the outcome measurable? Could the prediction harm students (e.g., through stigma or self-fulfilling expectations)? Does it create a feedback loop? What design choices would make this a virtuous intervention rather than a vicious cycle?
:::
# Part 3: Goodhart's law
## "When a measure becomes a target..."
"When a measure becomes a target, it ceases to be a good measure" (attributed to Charles Goodhart, 1975; popularized by Marilyn Strathern). The law describes something deeper than overfitting. Its core is **game-theoretic**: once a measurement is tied to consequences, the units being measured *respond* to the measurement. Those who can game the metric most effectively benefit disproportionately. The data the metric generates after it becomes a target is no longer the data it was calibrated on — it's the data that results from agents trying to score well on it.
Three ingredients define a Goodhart setup:
1. A **proxy metric** $M$ that approximates the goal $G$ we actually care about.
2. **Consequences** tied to $M$: money, status, survival, promotion.
3. **Agents** who can adjust their behavior to move $M$ — and who differ in how well they can do so.
Before the consequences are attached, $M$ and $G$ may be tightly linked. After, $M$ reflects the mix of "underlying $G$" and "skill at gaming $M$" — and the more optimization pressure you apply, the more the second term dominates.
The pattern appears everywhere once you see it.
- **Hospital readmission penalties.** CMS penalizes hospitals with high 30-day readmission rates. The metric was designed to measure care quality. Some hospitals responded by declining to admit the sickest patients in the first place — keeping readmission rates low not by providing better care, but by selecting healthier patients. The hospitals that were best at patient selection benefited most. Those that kept treating the same mix of patients looked worse by comparison, even if their care hadn't changed.
- **School test scores.** When teacher evaluations depend on standardized test scores, some teachers shift from teaching understanding to teaching the test; some schools reclassify struggling students so they don't sit for the exam. Scores rise; learning does not. The distribution of "teachers who game well" is *not* the distribution of "teachers who teach well," so the metric's meaning drifts.
- **Emissions testing.** Volkswagen installed software that detected when a vehicle was being tested and switched to a low-emission mode. VW optimized on the test distribution (the lab cycle), not the deployment distribution (actual driving). They scored perfectly on $M$ while $G$ (real-world emissions) moved the opposite direction.
- **Citations and h-index.** Researchers respond to citation-based evaluation by salami-slicing papers, publishing incremental work, and forming citation cartels. Those best at playing the citation game do best under the metric — regardless of whether their research is best.
- **p-hacking.** The $p < 0.05$ threshold was designed to measure evidence strength. When careers depend on clearing it, some researchers run analyses until they find significance. p-hacking is Goodhart's law applied to scientific inference — researchers who are more thorough about exploring specifications have an edge over those who aren't.
Notice the shared structure: the measurement changes the distribution of the thing being measured, and *how much* it changes depends on which units are best at responding to the measurement. You can't read Goodhart as "the data was fine, we just optimized too hard." The data itself is different once the measurement goes live.
## The machine-learning reading (as a special case)
The famous ML phenomenon — train loss goes down while test loss rises — is a shadow of Goodhart, not its core. Go back to [Chapter 6](lec06-validation.qmd). When you train a high-capacity model on a fixed training set, you're the sole agent optimizing a proxy (training loss) for a goal (generalization loss). Your "gaming" happens entirely inside the learning algorithm: the model memorizes idiosyncratic noise in the training rows, which looks great on the training metric but doesn't transfer. Cross-validation — reserving data the optimizer never sees — is how ML practice resists this version.
But this is the *static* case: the data is fixed, only the model is optimizing. The full Goodhart picture is **dynamic**: the data itself is generated by strategic agents who respond to the metric *after it is deployed*. A recidivism-risk score that detains suspected flight risks changes which crimes go unobserved. A sepsis-alert model that triggers early intervention changes which patients develop sepsis. A ranking signal for ad quality changes which ads advertisers write. In every case, what gets to the training set next round depends on what the model did last round — and which agents are best at adapting. No amount of held-out test data from before deployment tells you what the deployed system does.
If you repeatedly tune hyperparameters against the validation set — if you let the validation set *become* the target — even the ML-internal version of Goodhart kicks in: the validation loss stops measuring generalization. This is why serious ML practice distinguishes **training / validation / test** sets, and why the test set is used *once*, after all decisions are locked.
## Asking the Goodhart question
The pattern is the same whether the optimizer is a gradient-descent run, a hospital administrator, or a student preparing for a standardized test: the measurement is tied to consequences, the measured party responds, and some respond more effectively than others. Whenever you use a number to make decisions, ask two questions:
1. **Could the agents being measured change their behavior to move this metric without advancing the underlying goal?**
2. **Which agents are best positioned to do so — and are they the ones the policy was supposed to reward?**
If the answer to the first is yes, you have a Goodhart problem. If the answer to the second is "not the intended beneficiaries," you have an equity problem on top of a Goodhart problem. Defensive designs include auditing with a second metric the optimizer can't see, capping how often any one party can update against the metric, or — the strongest defense, and the subject of [Chapter 19](lec19-causal-inference-2.qmd) — randomizing a holdout so you can still see what the unperturbed world would look like.
# Part 4: What to do
Temporal leakage, feedback loops, and Goodhart's law are distinct failure modes, but they share a recipe for defense.
**Before deployment:**
1. **Split by the right axis.** If time matters, split by time. If the unit of prediction is a person and people recur across the dataset, split by person. Ask: *what's the axis of generalization I actually care about?* and split on that axis.
2. **Simulate the deployed action.** Don't just evaluate the prediction — evaluate what the prediction will *cause*. If the model's output triggers a decision (treat, loan, patrol), simulate the downstream effect on the outcome distribution. A model whose predictions change behavior cannot be validated on passive historical data alone.
3. **Audit proxy-vs-outcome.** For every metric you optimize, write down the goal it's a proxy for. Where does the proxy diverge from the goal? That's where Goodhart will bite.
4. **Stress-test the tails.** Models fail on data they've never seen. Construct adversarial examples and out-of-distribution test cases before deployment, not after.
**After deployment:**
1. **Monitor drift.** Track model performance against fresh ground truth. When it erodes, investigate whether the world changed, whether the model changed the world, or whether the metric is being gamed.
2. **Hold out a control.** Don't let the model decide for 100% of cases. Reserve a random sample where decisions are made by some other mechanism — a simpler baseline, a human, a coin flip. The control gives you a clean signal about what *would* have happened without the model.
3. **A/B test major changes.** We'll develop the formal machinery in [Chapter 19](lec19-causal-inference-2.qmd). The instinct to randomize some decisions — so that the model's impact is measured causally, not just predictively — is the single most effective defense against feedback-loop failures.
[Chapter 17](lec17-automl-llms.qmd) closes this course with a 15-item checklist for evaluating any analysis. Items #13 (Goodhart) and #14 (feedback loops) refer back here. When you see a production model that's been running for a year, ask: which of the three failure modes would catch it, if any?
## Key takeaways
- **Random-split validation is one tool, not the whole toolbox.** Use the validation procedure matched to the failure mode, and recognize when no validation procedure will help.
- **Temporal leakage.** Random splits leak future information into training. Backtesting — temporal splits and walk-forward validation — is established, trustworthy methodology for forecasting (with caveats for sudden regime shifts).
- **Distribution shift.** Models trained on normal conditions can't predict unprecedented events. Test on explicit out-of-distribution datasets, stratify by subgroup, and widen intervals for extreme inputs.
- **Feedback loops.** When a model's predictions change the world, *no* passive validation procedure catches the problem — the test set no longer represents the deployment environment. Ask: *how do the predictions change the outcome distribution?* If they do, you need randomized holdouts, not better splits.
- **Weapons of Math Destruction** have three properties: unmeasurable outcome, negative consequences, self-fulfilling loop. College rankings and parole-risk models are WMDs; weather forecasts and particle-physics models aren't.
- **Goodhart's law is game-theoretic.** When a measurement is tied to consequences, the agents being measured respond, and those best at responding dominate the metric. Overfitting is the special case where the only agent is the learning algorithm itself.
- **What to do.** Before deployment: split by the right axis, simulate the action the prediction drives, audit proxy-vs-outcome, stress-test tails. After deployment: monitor drift, hold out a control, A/B test changes.
- **Always report prediction intervals** rather than point forecasts — a forecast without uncertainty isn't a complete forecast.
## Study guide
### Key ideas
- **Temporal split (backtesting):** evaluating a model by training on past data and testing on future data, mimicking real deployment.
- **Data leakage:** when information from the future (or the test set) contaminates training, producing optimistically biased results. Random train/test splits create leakage in time series because neighboring observations are correlated across the split.
- **Lag features:** predictors constructed from past values (yesterday's value, 7-day rolling average, same-day-last-year).
- **Walk-forward validation:** repeatedly training on an expanding (or sliding) window and testing on the next period. Reveals *when* a model fails, not just its average performance.
- **Distribution shift:** when future data comes from a fundamentally different regime than the training data. Models trained on normal conditions cannot predict unprecedented events.
- **Prediction intervals:** a range (e.g., 95%) around a point forecast capturing plausible future values. Bootstrap resampling of training residuals is a general-purpose method.
- **Feedback loop:** a causal arrow from the model's prediction back to the outcome — prediction → action → new data → new prediction. The test set no longer represents the deployment environment.
- **Weapon of Math Destruction (WMD):** a predictive model with (1) an unmeasurable outcome, (2) negative consequences for individuals, (3) a self-fulfilling feedback loop. Cathy O'Neil's framework.
- **Goodhart's law:** when a measure becomes a target, it ceases to be a good measure. Manifests in ML as overfitting (training loss as proxy for generalization), in policy as metric gaming, in science as p-hacking.
- **Proxy-vs-outcome auditing:** for every metric you optimize, write down the goal it's a proxy for. Where does the proxy diverge? That's where Goodhart will bite.
### Computational tools
- `pd.Series.shift(n)` — create lag features by shifting values forward or backward
- `pd.Series.rolling(k).mean()` — compute rolling window averages
- Temporal train/test split: filter by date (`df[df['Date'] < cutoff]`)
- Walk-forward loop: iterate over time periods with an expanding training window
- Bootstrap residual resampling: construct prediction intervals for any model
### For the quiz
You should be able to: (1) explain why random splitting fails for time series data and implement a temporal split, (2) describe walk-forward validation (expanding and sliding window variants), (3) explain what distribution shift is and why no cross-validation method can fully address it, (4) identify the three properties of a WMD given a deployment scenario, (5) give two examples of feedback loops and diagram the prediction-action-outcome cycle, (6) state Goodhart's law and explain how overfitting is a special case, (7) list three pre-deployment and three post-deployment practices that defend against these failure modes, and (8) explain why prediction intervals matter and how bootstrap resampling produces them.