Lecture 16: When validation isn’t enough

MSE 125 — Applied Statistics

Madeleine Udell

Wednesday, May 20, 2026

logistics

  • HW 4 (HMDA mortgage challenge) due Mon May 25
  • project final: keep building; group work reminder
  • today: validation failures in deployment

you shipped a model that passed validation

  • it cleared every train/test check
  • in deployment it quietly falls apart

what did the random split miss?

agenda

  • temporal leakage
  • distribution shift
  • feedback loops
  • Goodhart’s law
  • what to do

validation rested on three assumptions

from Chapter 6:

  • rows are exchangeable: order doesn’t matter
  • deployment matches training
  • predictions don’t affect outcomes

this lecture: when each one breaks

four ways the assumptions break

failure mode what happens fix
temporal leakage random split leaks information about the test set train past, test future
distribution shift deployment regime never seen monitor accuracy after deployment
feedback loops prediction changes the outcome randomized holdout
Goodhart’s law metric becomes a target audit proxy vs. goal

temporal leakage

the data: US monthly retail sales

US Census / FRED series RSAFSNA, not seasonally adjusted

  • long-run trend: sales grow with population, prices
  • annual cycle: sharp December spike, January trough
  • short-range autocorrelation: this month ≈ last month

non-stationarity

non-stationary series

a time series whose mean or variance changes over time

here: retail sales climbs year after year, and the December spike grows with it

non-stationarity is what makes time-series validation different from the i.i.d. (independent, identically distributed) world of Chapter 6

scoring a forecast: MAE

mean absolute error (MAE)

\text{MAE} = \frac{1}{n}\sum_{i=1}^{n}\left| y_i - \hat{y}_i \right|

average size of the forecast error, in the units of the target (here, dollars). lower is better

the bar to clear: naive baseline

  • strongly seasonal series → same month last year \hat{y}_t = y_{t-12}
  • fits no parameters, uses no features
Naive (same month last year) MAE, full series: $20.3B

any model that can’t beat this isn’t earning its complexity

same data, two ways to split

  • random: shuffle all months, hold out 20%
  • temporal: train early years, test later ones

data leakage

data leakage (temporal)

information unavailable at prediction time contaminates training, producing optimistically biased scores

here: a random split puts a 2010 test month’s immediate neighbors (the months just before and after) into the training set

three forecasts, three kinds of structure

lag features

predictors built from the series’ own past values

  • recent lags: last month, last quarter’s average → momentum
  • linear trend: a straight line through time → long-run growth
  • seasonal: sine/cosine of the month + the year-ago value → annual cycle

why sine/cosine, not a dummy per month? 12 monthly dummies cost 11 extra parameters and overfit a short history. two Fourier features, sin and cos, impose one smooth annual shape

predict before you see the answer

three forecasting models: recent lags, linear trend, seasonal

all score R² > 0.9 on a random split.

which still perform well when we train on the past and test on the future?

the reveal: random hides what temporal exposes

Model            random R²  temporal R²  temporal MAE
-----------------------------------------------------
Recent lags          0.907        0.378        $27.2B
Linear trend         0.933        0.446        $26.1B
Seasonal             0.975        0.936         $9.3B
Naive baseline           —        0.847        $15.9B
  • random split: all three excellent
  • temporal split: recent-lags & trend fall below the naive baseline
  • same models, same data: only the split changed

R² closer to 1 is better; MAE in $B, lower is better

why does only the seasonal model validate well?

  • random test month sits between its neighbors: easy to interpolate
  • temporal test month: training data ends years earlier
    • trend extrapolates a line → falls behind as growth compounds
    • recent lags: only short-range structure to lean on
    • seasonal: carries the year-ago value + a stable annual shape

leakage bites the models whose accuracy needs neighbors nearby in time

walk-forward validation

walk-forward validation

repeat the temporal split at many cut points: train on an expanding window, test the next period, step forward

Fold 1: [===== TRAIN =====][TEST]
Fold 2: [======= TRAIN =======][TEST]
Fold 3: [========= TRAIN =========][TEST]
  • expanding window keeps all history; sliding uses a fixed width
  • called backtesting in finance

when does the model break?

  • low error through the calm years
  • spikes: 2008–09 crisis, far larger 2020 pandemic
  • a single average MAE would bury these regime changes

split by the right axis

  • a churn model predicts which customers will cancel
  • data: one row per customer per month (each customer recurs many times)

split at random by row? or some other way, and why?

match the model to the structure

decompose: trend × seasonal × residual

  • multiplicative: December is the same percentage above trend each year
  • seasonal multiplier: Dec 1.147 (+15% above trend), Feb 0.891
  • residual small except the COVID crater

Holt-Winters: let the structure adapt

  • tracks level, trend, seasonal: each re-estimated as data arrives
  • three smoothing parameters \alpha, \beta, \gamma \in [0,1]: how fast each adapts (near 1 chases recent data, near 0 barely moves)
  • state only ever looks backward → extends naturally into the future
alpha (level)    = 0.502
beta  (trend)    = 0.000
gamma (seasonal) = 0.204

\beta \approx 0: the growth rate barely changes year to year

demo: forecasting retail sales

switch to the notebook

  • naive vs linear regression vs Holt-Winters
  • 24-month forecast (2018–19), trained through 2017

colab: lec16-feedback-loops.ipynb

the forecast, and the verdict

Model                                R²       MAE
Naive (lag12)                     0.672    $18.2B
Linear regression (lag12+trend)   0.894     $9.5B
Holt-Winters                      0.907     $9.0B

Holt-Winters wins by tracking the December multiplier. but a better model still doesn’t fix validation: COVID breaks any fit trained through 2019

distribution shift

the data: California daily AQI

retail had clean repeating structure. now a series with almost none

  • LA, Sacramento: seasonal wildfire spikes on baseline pollution
  • Mono County: dust storms push AQI above 8,000
  • AQI: 0 clean → 150 unhealthy → 500 hazardous

predict the failure

train a linear model on yesterday’s AQI, using only normal days (AQI < 300).

a dust storm hits: actual AQI = 8,000.

does it get close, or badly miss? roughly what does it predict?

the model flatlines

selected extreme days
Actual AQI    Predicted    Error
      7835           30     7805
      3404           30     3374
      1196           29     1167

coefficient on yesterday’s AQI ≈ −0.0004

distribution shift

distribution shift

future data comes from a different regime than the training data: a qualitative change the training set never captured

the events that matter most (wildfires, pandemics, crashes) are the ones your model has never seen

no cross-validation fixes this: it holds out points that look like training data

“prediction is very difficult, especially about the future” (attr. Bohr)

report a range, not a point

  • bootstrap from Chapter 8: resample training residuals, add to each forecast
  • works for any model: 95% interval here, coverage 93.7%, width 89.5 AQI
  • widens for ordinary noise, but blind to regimes never seen (Mono blew past it)

when predictions change the world

predictive policing

Photo: Mr. Satterly / Wikimedia Commons, CC0

  • model flags neighborhood as high-crime
  • police patrol it harder
  • more patrols → more arrests
  • model retrains, “confirms” itself

. . .

the model creates the data that justifies it

a causal arrow runs backward

prediction → action → outcome → new training data → repeat

feedback loops are everywhere

  • credit scoring: low score → denied loans → no history → score stays low
  • recommendations: extreme content → engagement → more extreme
  • sepsis alerts: treat flagged patients earlier → fewer cases → model looks worse
  • betting markets: the line already encodes your signal

ask before you deploy

will the predictions change the data distribution?

safe feedback loop
weather forecasts credit scores
particle-physics models sepsis alerts

forecasters don’t change the weather; credit scores change who gets loans

Weapons of Math Destruction

Photo: GRuban / Wikimedia Commons, CC BY-SA 4.0

three traits together make a WMD:

  1. outcome not easily measurable
  2. negative consequences for individuals
  3. self-fulfilling feedback loop

Cathy O’Neil, Weapons of Math Destruction (2016)

WMD or not?

  • college rankings: high rank attracts students, faculty, donors → raises quality → validates the rank
  • parole risk: longer prison can raise reoffense odds (lost job, networks) → confirms the prediction
  • weather, particle physics: not WMDs

is this a Weapon of Math Destruction?

a model predicts which students will fail a course; the university uses it to assign tutoring.

check the three traits:

  1. is the outcome (failing) measurable?
  2. could the prediction harm students?
  3. does it create a loop?

“when a measure becomes a target,
it ceases to be a good measure”

Goodhart 1975; popularized by Strathern

Goodhart’s law is game-theoretic

  • attach consequences to a metric M
  • the agents being measured respond
  • those best at gaming M benefit disproportionately

three ingredients:

  1. a proxy M for the goal G we care about
  2. consequences tied to M (money, status, survival)
  3. agents who can move M, and differ in how well

emissions testing

  • VW software detected the EPA lab cycle
  • full emissions controls only during the test
  • on the road: NOx up to ~40× the legal limit

. . .

optimized the test exactly. real-world emissions moved the opposite way

Photo: Mario R. Duran Ortiz / Wikimedia Commons, CC BY-SA 3.0; EPA Notice of Violation 2015

school accountability tests

when evaluations hinge on scores:

  • Chicago: answer-altering in ≥ 4–5% of classrooms
  • Florida: weak students reclassified as test-exempt

. . .

the score rises without learning rising

Photo: dcJohn / Wikimedia Commons, CC BY 2.0; Jacob & Levitt 2003, Figlio & Getzler 2002

the same shape, everywhere

  • hospital readmissions: hold patients in “observation,” divert to the ED
  • citations & h-index: coercive citation, citation cartels
  • p-hacking: try analyses until one clears p < 0.05

attach stakes → behavior changes → the numbers stop meaning what they did

Wadhera et al. 2018; Wilhite & Fong 2012; Fister et al. 2016; Simmons et al. 2011

p-hacking

Head et al., PLOS Biology 2015, CC BY 4.0

  • p < 0.05 meant to gauge evidence
  • careers depend on clearing it
  • researchers exploit “degrees of freedom”

. . .

a tell-tale excess of p-values just below 0.05

you have already seen Goodhart’s law

  • Chapter 6: train loss falls while test loss rises
  • a proxy (training loss) optimized so hard it stops tracking the goal (generalization)

same mechanism as VW and the hospitals, only the cast changes

  • there: people adapt after the metric carries stakes
  • in ML: the single agent is the learning algorithm, memorizing noise

cross-validation resists it: partly

  • separate train / validation / test
  • use the test set once, after all decisions are locked

but tune enough against the validation set, and it stops measuring generalization

and ML sees only one side: real Goodhart is dynamic, agents respond after deployment

game the metric

your hospital is paid on its 30-day readmission rate.

name a way to lower the measured rate without treating anyone better. what hidden check would catch you?

the Goodhart question

whenever a prediction drives a decision, ask:

  1. can the agents move this metric without advancing the goal?
  2. who’s best positioned to? are they who the policy meant to reward?

defenses:

  • audit with a hidden second metric
  • cap how often anyone updates against it
  • randomize a holdout

what to do

before deployment

  • split by the right axis: time? person? ask what generalization you need
  • simulate the deployed action: score what the prediction causes, not just the prediction
  • audit proxy vs. outcome: write down the goal each metric stands for
  • stress-test the tails: build out-of-distribution cases before deployment

after deployment

  • monitor drift: track performance against fresh ground truth
  • hold out a control: a sample where the model doesn’t decide
  • A/B test major changes: randomize to measure impact causally

A/B testing is the single most effective defense against feedback-loop failures

machinery: Chapter 19

what we covered

  • temporal leakage → walk-forward validation, not random splits
  • distribution shift → no split fixes a regime you never saw
  • feedback loops → randomized holdouts, not better splits
  • WMD → unmeasurable + harmful + self-fulfilling
  • Goodhart → metric becomes target; overfitting is the special case

next: working with AI

we can build, validate, and stress-test models.

can AutoML and LLMs do this for us?

  • Chapter 17: a 15-item checklist for any analysis
  • items #13 (Goodhart) and #14 (feedback loops) point back here

feedback

what worked? what didn’t? what’s still confusing?