MSE 125 — Applied Statistics
Wednesday, May 20, 2026
what did the random split miss?
from Chapter 6:
this lecture: when each one breaks
| failure mode | what happens | fix |
|---|---|---|
| temporal leakage | random split leaks information about the test set | train past, test future |
| distribution shift | deployment regime never seen | monitor accuracy after deployment |
| feedback loops | prediction changes the outcome | randomized holdout |
| Goodhart’s law | metric becomes a target | audit proxy vs. goal |
temporal leakage
US Census / FRED series RSAFSNA, not seasonally adjusted
non-stationary series
a time series whose mean or variance changes over time
here: retail sales climbs year after year, and the December spike grows with it
non-stationarity is what makes time-series validation different from the i.i.d. (independent, identically distributed) world of Chapter 6
mean absolute error (MAE)
\text{MAE} = \frac{1}{n}\sum_{i=1}^{n}\left| y_i - \hat{y}_i \right|
average size of the forecast error, in the units of the target (here, dollars). lower is better
Naive (same month last year) MAE, full series: $20.3B
any model that can’t beat this isn’t earning its complexity
data leakage (temporal)
information unavailable at prediction time contaminates training, producing optimistically biased scores
here: a random split puts a 2010 test month’s immediate neighbors (the months just before and after) into the training set
lag features
predictors built from the series’ own past values
why sine/cosine, not a dummy per month? 12 monthly dummies cost 11 extra parameters and overfit a short history. two Fourier features, sin and cos, impose one smooth annual shape
predict before you see the answer
three forecasting models: recent lags, linear trend, seasonal
all score R² > 0.9 on a random split.
which still perform well when we train on the past and test on the future?
Model random R² temporal R² temporal MAE
-----------------------------------------------------
Recent lags 0.907 0.378 $27.2B
Linear trend 0.933 0.446 $26.1B
Seasonal 0.975 0.936 $9.3B
Naive baseline — 0.847 $15.9B
R² closer to 1 is better; MAE in $B, lower is better
leakage bites the models whose accuracy needs neighbors nearby in time
walk-forward validation
repeat the temporal split at many cut points: train on an expanding window, test the next period, step forward
Fold 1: [===== TRAIN =====][TEST]
Fold 2: [======= TRAIN =======][TEST]
Fold 3: [========= TRAIN =========][TEST]
split by the right axis
split at random by row? or some other way, and why?
match the model to the structure
alpha (level) = 0.502
beta (trend) = 0.000
gamma (seasonal) = 0.204
\beta \approx 0: the growth rate barely changes year to year
switch to the notebook
colab: lec16-feedback-loops.ipynb
Model R² MAE
Naive (lag12) 0.672 $18.2B
Linear regression (lag12+trend) 0.894 $9.5B
Holt-Winters 0.907 $9.0B
Holt-Winters wins by tracking the December multiplier. but a better model still doesn’t fix validation: COVID breaks any fit trained through 2019
distribution shift
retail had clean repeating structure. now a series with almost none
predict the failure
train a linear model on yesterday’s AQI, using only normal days (AQI < 300).
a dust storm hits: actual AQI = 8,000.
does it get close, or badly miss? roughly what does it predict?
selected extreme days
Actual AQI Predicted Error
7835 30 7805
3404 30 3374
1196 29 1167
coefficient on yesterday’s AQI ≈ −0.0004
distribution shift
future data comes from a different regime than the training data: a qualitative change the training set never captured
the events that matter most (wildfires, pandemics, crashes) are the ones your model has never seen
no cross-validation fixes this: it holds out points that look like training data
“prediction is very difficult, especially about the future” (attr. Bohr)

when predictions change the world

Photo: Mr. Satterly / Wikimedia Commons, CC0
. . .
the model creates the data that justifies it
prediction → action → outcome → new training data → repeat
will the predictions change the data distribution?
| safe | feedback loop |
|---|---|
| weather forecasts | credit scores |
| particle-physics models | sepsis alerts |
forecasters don’t change the weather; credit scores change who gets loans

Photo: GRuban / Wikimedia Commons, CC BY-SA 4.0
three traits together make a WMD:
Cathy O’Neil, Weapons of Math Destruction (2016)
is this a Weapon of Math Destruction?
a model predicts which students will fail a course; the university uses it to assign tutoring.
check the three traits:
“when a measure becomes a target,
it ceases to be a good measure”
Goodhart 1975; popularized by Strathern
three ingredients:

. . .
optimized the test exactly. real-world emissions moved the opposite way
Photo: Mario R. Duran Ortiz / Wikimedia Commons, CC BY-SA 3.0; EPA Notice of Violation 2015

when evaluations hinge on scores:
. . .
the score rises without learning rising
Photo: dcJohn / Wikimedia Commons, CC BY 2.0; Jacob & Levitt 2003, Figlio & Getzler 2002
attach stakes → behavior changes → the numbers stop meaning what they did
Wadhera et al. 2018; Wilhite & Fong 2012; Fister et al. 2016; Simmons et al. 2011

Head et al., PLOS Biology 2015, CC BY 4.0
. . .
a tell-tale excess of p-values just below 0.05
same mechanism as VW and the hospitals, only the cast changes
but tune enough against the validation set, and it stops measuring generalization
and ML sees only one side: real Goodhart is dynamic, agents respond after deployment
game the metric
your hospital is paid on its 30-day readmission rate.
name a way to lower the measured rate without treating anyone better. what hidden check would catch you?
whenever a prediction drives a decision, ask:
defenses:
what to do
A/B testing is the single most effective defense against feedback-loop failures
machinery: Chapter 19
we can build, validate, and stress-test models.
can AutoML and LLMs do this for us?
what worked? what didn’t? what’s still confusing?