Lecture 16: When validation isn’t enough

MSE 125 — Applied Statistics

Madeleine Udell

Wednesday, May 20, 2026

logistics

HW 4 (HMDA mortgage challenge) due Mon May 25
project final: keep building; group work reminder
today: validation failures in deployment

you shipped a model that passed validation

it cleared every train/test check
in deployment it quietly falls apart

what did the random split miss?

agenda

temporal leakage
distribution shift
feedback loops
Goodhart’s law
what to do

validation rested on three assumptions

from Chapter 6:

rows are exchangeable: order doesn’t matter
deployment matches training
predictions don’t affect outcomes

this lecture: when each one breaks

four ways the assumptions break

failure mode	what happens	fix
temporal leakage	random split leaks information about the test set	train past, test future
distribution shift	deployment regime never seen	monitor accuracy after deployment
feedback loops	prediction changes the outcome	randomized holdout
Goodhart’s law	metric becomes a target	audit proxy vs. goal

temporal leakage

the data: US monthly retail sales

US Census / FRED series RSAFSNA, not seasonally adjusted

long-run trend: sales grow with population, prices
annual cycle: sharp December spike, January trough
short-range autocorrelation: this month ≈ last month

non-stationarity

non-stationary series

a time series whose mean or variance changes over time

here: retail sales climbs year after year, and the December spike grows with it

non-stationarity is what makes time-series validation different from the i.i.d. (independent, identically distributed) world of Chapter 6

scoring a forecast: MAE

mean absolute error (MAE)

\text{MAE} = \frac{1}{n}\sum_{i=1}^{n}\left| y_i - \hat{y}_i \right|

average size of the forecast error, in the units of the target (here, dollars). lower is better

the bar to clear: naive baseline

strongly seasonal series → same month last year \hat{y}_t = y_{t-12}
fits no parameters, uses no features

Naive (same month last year) MAE, full series: $20.3B

any model that can’t beat this isn’t earning its complexity

same data, two ways to split

random: shuffle all months, hold out 20%
temporal: train early years, test later ones

data leakage

data leakage (temporal)

information unavailable at prediction time contaminates training, producing optimistically biased scores

here: a random split puts a 2010 test month’s immediate neighbors (the months just before and after) into the training set

three forecasts, three kinds of structure

lag features

predictors built from the series’ own past values

recent lags: last month, last quarter’s average → momentum
linear trend: a straight line through time → long-run growth
seasonal: sine/cosine of the month + the year-ago value → annual cycle

why sine/cosine, not a dummy per month? 12 monthly dummies cost 11 extra parameters and overfit a short history. two Fourier features, sin and cos, impose one smooth annual shape

Each model is built from lag features — past values of the series used as predictors for the present. Each leans on a different kind of structure. Name the term “lag features” here; it’s the quiz vocabulary and the operational vocabulary (“lag12”, “year-ago anchor”) all comes back to it.

The seasonal bullet hides a feature-engineering choice worth naming. The obvious way to encode “which month is it?” is twelve one-hot dummies — but that spends eleven free parameters, and with only a few dozen Januaries in the training window each monthly effect is estimated from little data and chases noise. The sine/cosine pair are Fourier features: they impose a single smooth wave that repeats every 12 months, capturing the annual cycle with two parameters instead of eleven. We use them here for that parsimony; with abundant data and a genuinely irregular month-to-month pattern, dummies can win.

predict before you see the answer

three forecasting models: recent lags, linear trend, seasonal

all score R² > 0.9 on a random split.

which still perform well when we train on the past and test on the future?

the reveal: random hides what temporal exposes

Model            random R²  temporal R²  temporal MAE
-----------------------------------------------------
Recent lags          0.907        0.378        $27.2B
Linear trend         0.933        0.446        $26.1B
Seasonal             0.975        0.936         $9.3B
Naive baseline           —        0.847        $15.9B

random split: all three excellent
temporal split: recent-lags & trend fall below the naive baseline
same models, same data: only the split changed

R² closer to 1 is better; MAE in $B, lower is better

why does only the seasonal model validate well?

random test month sits between its neighbors: easy to interpolate
temporal test month: training data ends years earlier
- trend extrapolates a line → falls behind as growth compounds
- recent lags: only short-range structure to lean on
- seasonal: carries the year-ago value + a stable annual shape

leakage bites the models whose accuracy needs neighbors nearby in time

walk-forward validation

repeat the temporal split at many cut points: train on an expanding window, test the next period, step forward

Fold 1: [===== TRAIN =====][TEST]
Fold 2: [======= TRAIN =======][TEST]
Fold 3: [========= TRAIN =========][TEST]

expanding window keeps all history; sliding uses a fixed width
called backtesting in finance

when does the model break?

low error through the calm years
spikes: 2008–09 crisis, far larger 2020 pandemic
a single average MAE would bury these regime changes

split by the right axis

a churn model predicts which customers will cancel
data: one row per customer per month (each customer recurs many times)

split at random by row? or some other way, and why?

match the model to the structure

decompose: trend × seasonal × residual

multiplicative: December is the same percentage above trend each year
seasonal multiplier: Dec 1.147 (+15% above trend), Feb 0.891
residual small except the COVID crater

Holt-Winters: let the structure adapt

tracks level, trend, seasonal: each re-estimated as data arrives
three smoothing parameters \alpha, \beta, \gamma \in [0,1]: how fast each adapts (near 1 chases recent data, near 0 barely moves)
state only ever looks backward → extends naturally into the future

alpha (level)    = 0.502
beta  (trend)    = 0.000
gamma (seasonal) = 0.204

\beta \approx 0: the growth rate barely changes year to year

demo: forecasting retail sales

switch to the notebook

naive vs linear regression vs Holt-Winters
24-month forecast (2018–19), trained through 2017

colab: lec16-feedback-loops.ipynb

the forecast, and the verdict

Model                                R²       MAE
Naive (lag12)                     0.672    $18.2B
Linear regression (lag12+trend)   0.894     $9.5B
Holt-Winters                      0.907     $9.0B

Holt-Winters wins by tracking the December multiplier. but a better model still doesn’t fix validation: COVID breaks any fit trained through 2019

distribution shift

the data: California daily AQI

retail had clean repeating structure. now a series with almost none

LA, Sacramento: seasonal wildfire spikes on baseline pollution
Mono County: dust storms push AQI above 8,000
AQI: 0 clean → 150 unhealthy → 500 hazardous

predict the failure

train a linear model on yesterday’s AQI, using only normal days (AQI < 300).

a dust storm hits: actual AQI = 8,000.

does it get close, or badly miss? roughly what does it predict?

the model flatlines

selected extreme days
Actual AQI    Predicted    Error
      7835           30     7805
      3404           30     3374
      1196           29     1167

coefficient on yesterday’s AQI ≈ −0.0004

distribution shift

future data comes from a different regime than the training data: a qualitative change the training set never captured

the events that matter most (wildfires, pandemics, crashes) are the ones your model has never seen

no cross-validation fixes this: it holds out points that look like training data

“prediction is very difficult, especially about the future” (attr. Bohr)

report a range, not a point

bootstrap from Chapter 8: resample training residuals, add to each forecast
works for any model: 95% interval here, coverage 93.7%, width 89.5 AQI
widens for ordinary noise, but blind to regimes never seen (Mono blew past it)

when predictions change the world

predictive policing

Photo: Mr. Satterly / Wikimedia Commons, CC0

model flags neighborhood as high-crime
police patrol it harder
more patrols → more arrests
model retrains, “confirms” itself

. . .

the model creates the data that justifies it

a causal arrow runs backward

prediction → action → outcome → new training data → repeat

feedback loops are everywhere

credit scoring: low score → denied loans → no history → score stays low
recommendations: extreme content → engagement → more extreme
sepsis alerts: treat flagged patients earlier → fewer cases → model looks worse
betting markets: the line already encodes your signal

ask before you deploy

will the predictions change the data distribution?

safe	feedback loop
weather forecasts	credit scores
particle-physics models	sepsis alerts

forecasters don’t change the weather; credit scores change who gets loans

Weapons of Math Destruction

Photo: GRuban / Wikimedia Commons, CC BY-SA 4.0

three traits together make a WMD:

outcome not easily measurable
negative consequences for individuals
self-fulfilling feedback loop

Cathy O’Neil, Weapons of Math Destruction (2016)

WMD or not?

college rankings: high rank attracts students, faculty, donors → raises quality → validates the rank
parole risk: longer prison can raise reoffense odds (lost job, networks) → confirms the prediction
weather, particle physics: not WMDs

is this a Weapon of Math Destruction?

a model predicts which students will fail a course; the university uses it to assign tutoring.

check the three traits:

is the outcome (failing) measurable?
could the prediction harm students?
does it create a loop?

“when a measure becomes a target,
it ceases to be a good measure”

Goodhart 1975; popularized by Strathern

Goodhart’s law is game-theoretic

attach consequences to a metric M
the agents being measured respond
those best at gaming M benefit disproportionately

three ingredients:

a proxy M for the goal G we care about
consequences tied to M (money, status, survival)
agents who can move M, and differ in how well

emissions testing

VW software detected the EPA lab cycle
full emissions controls only during the test
on the road: NOx up to ~40× the legal limit

. . .

optimized the test exactly. real-world emissions moved the opposite way

Photo: Mario R. Duran Ortiz / Wikimedia Commons, CC BY-SA 3.0; EPA Notice of Violation 2015

school accountability tests

when evaluations hinge on scores:

Chicago: answer-altering in ≥ 4–5% of classrooms
Florida: weak students reclassified as test-exempt

. . .

the score rises without learning rising

Photo: dcJohn / Wikimedia Commons, CC BY 2.0; Jacob & Levitt 2003, Figlio & Getzler 2002

the same shape, everywhere

hospital readmissions: hold patients in “observation,” divert to the ED
citations & h-index: coercive citation, citation cartels
p-hacking: try analyses until one clears p < 0.05

attach stakes → behavior changes → the numbers stop meaning what they did

Wadhera et al. 2018; Wilhite & Fong 2012; Fister et al. 2016; Simmons et al. 2011

p-hacking

Head et al., PLOS Biology 2015, CC BY 4.0

p < 0.05 meant to gauge evidence
careers depend on clearing it
researchers exploit “degrees of freedom”

. . .

a tell-tale excess of p-values just below 0.05

you have already seen Goodhart’s law

Chapter 6: train loss falls while test loss rises
a proxy (training loss) optimized so hard it stops tracking the goal (generalization)

same mechanism as VW and the hospitals, only the cast changes

there: people adapt after the metric carries stakes
in ML: the single agent is the learning algorithm, memorizing noise

cross-validation resists it: partly

separate train / validation / test
use the test set once, after all decisions are locked

but tune enough against the validation set, and it stops measuring generalization

and ML sees only one side: real Goodhart is dynamic, agents respond after deployment

game the metric

your hospital is paid on its 30-day readmission rate.

name a way to lower the measured rate without treating anyone better. what hidden check would catch you?

the Goodhart question

whenever a prediction drives a decision, ask:

can the agents move this metric without advancing the goal?
who’s best positioned to? are they who the policy meant to reward?

defenses:

audit with a hidden second metric
cap how often anyone updates against it
randomize a holdout

what to do

before deployment

split by the right axis: time? person? ask what generalization you need
simulate the deployed action: score what the prediction causes, not just the prediction
audit proxy vs. outcome: write down the goal each metric stands for
stress-test the tails: build out-of-distribution cases before deployment

after deployment

monitor drift: track performance against fresh ground truth
hold out a control: a sample where the model doesn’t decide
A/B test major changes: randomize to measure impact causally

A/B testing is the single most effective defense against feedback-loop failures

machinery: Chapter 19

what we covered

temporal leakage → walk-forward validation, not random splits
distribution shift → no split fixes a regime you never saw
feedback loops → randomized holdouts, not better splits
WMD → unmeasurable + harmful + self-fulfilling
Goodhart → metric becomes target; overfitting is the special case

next: working with AI

we can build, validate, and stress-test models.

can AutoML and LLMs do this for us?

Chapter 17: a 15-item checklist for any analysis
items #13 (Goodhart) and #14 (feedback loops) point back here

feedback

forms.gle/feedback

what worked? what didn’t? what’s still confusing?