Lecture 6: Validation and the Bias-Variance Tradeoff

MSE 125 — Applied Statistics

Madeleine Udell

Wednesday, April 15, 2026

the vendor promised high accuracy

the model missed 67% of cases

the epic sepsis model

67%

of sepsis cases missed

18%

of all hospitalizations got a false alarm

independent evaluation on 27,697 patients at Michigan Medicine

Wong et al., “External Validation of a Widely Implemented Proprietary Sepsis Prediction Model,” JAMA Internal Medicine, 2021

not an isolated failure

dermatology AI

  • 90%+ accuracy on light skin
  • as low as 17% on dark skin
  • test set didn’t match deployment population

Amazon hiring tool

  • high accuracy on historical data
  • systematically penalized “women’s”
  • the model replicated historical bias

Epic and the dermatology AI failed by evaluating on data too similar to training

Amazon failed differently: training labels encoded historical discrimination — test R² would look fine; the labels were the problem

today

  • train/test split: honest evaluation on new data
  • bias-variance: why more isn’t always better
  • train/validate/test: choosing complexity fairly
  • regularization: lasso and ridge

training R² is not enough

the problem with training R²

training \(R^2\) measures how well the model memorizes the data

adding features always improves training \(R^2\)

but does it improve predictions on new data?

polynomials, when they help

polynomials let the slope bend — without leaving linear regression

polynomial regression: still a linear model

\[\widehat{y} = \beta_0 + \beta_1 x + \beta_2 x^2 + \beta_3 x^3\]

  • nonlinear in \(x\)
  • linear in \((\beta_0, \beta_1, \beta_2, \beta_3)\)
  • just more columns in \(X\)
PolynomialFeatures(degree=3, include_bias=False).fit_transform(X)

push the degree: the polynomial parade

training \(R^2\) climbs: 0.17 → 0.30 from degree 1 to 6

would you trust this?

it fits the data better and better!

what a great fit 🎉

…wait, is something wrong?

would you trust this model to predict a 7-bedroom listing?

training \(R^2\) says yes. your eye says no.

the fix: hold out test data

test \(R^2\) can go down when the model overfits

how much to hold out?

common defaults: 70/30 or 80/20 train/test

the tradeoff:

  • more training data → model learns more reliably
  • more test data → performance estimate is more stable

with abundant data (tens of thousands), the split matters less

with small data (hundreds), cross-validation (coming up next) is better than a single split

train R² vs test R²

train \(R^2\): computed on the data used to fit the model

  • measures how well the model explains what it has seen
  • always increases (or stays the same) with more features

test \(R^2\): computed on held-out data

  • measures how well the model predicts new observations
  • can decrease if the model overfits

the gap between them reveals overfitting

seven levels of model complexity

1,500 Airbnb listings, 60/40 train/test split

level what it adds features
1 bedrooms + bathrooms 2
2 + room type dummies 4
3 + borough dummies 8
4 + bedroom × borough interactions 12
5 all degree-2 terms 44
6 all degree-3 terms 164
7 all degree-4 terms 494

Q: which level will have the best test \(R^2\)?

raise your hand: level 1? · 3? · 5? · 7?

the reveal: train vs test R²

x-axis splits into two regimes: levels 1–4 hand-picked features · levels 5–7 polynomial explosion

can test \(R^2\) be negative? yes — level 7 scores \(\approx -8.4\), far worse than predicting the mean

level 7 scored test R² ≈ −8.4

what went wrong — and what would you try to fix it?

“With four parameters I can fit an elephant, and with five I can make him wiggle his trunk.”

— John von Neumann

level 7 packs in 494 features — enough to chase noise no honest pattern supports

the test set exposes it

what this split can and cannot detect

a random train/test split tests: new data from the same distribution

it cannot detect population shift — it only tests generalization within the training distribution

  • Epic trained on one hospital system, deployed at another
  • a random split within one hospital would have looked fine

distribution shift

covariate shift

where?

Epic sepsis: one hospital → another

temporal shift

when?

pre-COVID traffic model → March 2020

label shift

whom?

cancer screen: referral clinic (5%) → routine (0.3%)

random train/test splits see none of these

bias-variance tradeoff

the data-generating model

if we trained on a different 900 listings, would we get the same model?

we assume each observation is generated as

\[y = f(x) + \varepsilon, \qquad \varepsilon \sim N(0, \sigma^2)\]

  • \(f(x)\) — the signal (the true relationship)
  • \(\varepsilon\)noise (mean zero, variance \(\sigma^2\))
  • \(\hat{f}(x)\) — the model’s prediction (fit on a training set)

\(\sigma^2\) is the irreducible noise floor — no model can beat it

(the decomposition holds for any mean-zero noise; Gaussian here for concreteness)

seeing bias and variance

what differs between the left and right panels?

the definitions

bias — average prediction misses the truth

\[\text{bias}(x) = \mathbb{E}[\hat{f}(x)] - f(x)\]

variance — predictions scatter around their average

\[\text{var}(x) = \mathbb{E}\!\left[(\hat{f}(x) - \mathbb{E}[\hat{f}(x)])^2\right]\]

\[\text{MSE}(x) = \text{bias}(x)^2 + \text{var}(x) + \sigma^2\]

at a fixed test point \(x\), averaging over training sets

the bias-variance tradeoff

you can only measure bias and variance on simulated data

both definitions are expectations over training sets — you need many independent draws from the same distribution

real data gives you one training set and one fit — no way to average

also: bias compares \(\hat{f}\) to \(f(x)\) — the true function — which we don’t know on real data

on real data we read symptoms — train-test gaps, CV curves — not the quantities themselves

bias in the real world: parametric extrapolation

in 2020, the IHME (Institute for Health Metrics and Evaluation) COVID model fit a symmetric bell curve to daily deaths

shape assumed deaths would decline as fast as they rose

rigid model shape — assumed symmetry, not learned from data

Jewell et al., JAMA, 2020

sketch the U-curve

you know bias decreases with complexity and variance increases

sketch what happens to total test error (bias² + variance + σ²) as complexity grows

the U-curve

every point is simulated — 200 Monte Carlo training sets per degree

forward: random forests (ch 13) show a flat test-error curve — more trees never hurts. we’ll see why then.

mapping it back to the experiment

level training R² test R² diagnosis
1 low low high bias (underfitting)
3-5 moderate moderate sweet spot
6-7 high negative high variance (overfitting)
  • low training \(R^2\) → high bias
  • large train-test gap → high variance

underfit? overfit? the fix is mechanical

symptom diagnosis fix
low train R², close test R² high bias (underfit) add features, use a more flexible model
high train R², low test R² high variance (overfit) reduce features · regularize · add more data
both high, close sweet spot nothing to fix
both low, close high bias and high noise check data quality

“reduce features” and “add features” we just walked through — levels 1-7

what about add more data?

more data cures overfitting

level 6 — 164 polynomial features, catastrophic at 900 training rows (60% of 1,500)

same features, same model, just more rows

variance shrinks as the training set grows — bias does not

your friend shows you a model with training R² = 0.98 and test R² = 0.42

what’s the diagnosis? what would you recommend?

train / validate / test

why not just two sets?

we used the test set to compare seven models and picked the best

but that means our “test” performance is optimistic

we’ve implicitly fit a decision (which model to deploy) to the test data

think of it like studying for an exam

  • training = the textbook (study from it)
  • validation = a practice exam (if you do badly, change your approach)
  • test = the final (take it once, that score counts)

take the final three times, report the highest — no longer a fair estimate of what you know

the three-way split

  • training — the data the model learns from
  • validation — held-out data used to choose model complexity
  • test — touched only once, at the very end

common split: 60% train / 20% validate / 20% test

the test set stays pristine — no decisions depend on it

the workflow

  1. fit each model on blue rows (train)
  2. evaluate on yellow rows (validation) — pick the best
  3. report winner’s performance on red rows (test)

cross-validation: rotate the held-out fold

single train/test split is noisy — might get lucky or unlucky

every observation plays validator exactly once

CV confirms the pattern

error bars overlap for levels 3-5 — gains from interactions are modest

why CV beats a single split

200 random seeds, same model (level 4) · orange = CV, blue = single split

same mean, much narrower spread — CV reduces the variance of the estimate

so when you compare levels 3, 4, 5: CV tells them apart; a single split often can’t

leave-one-out CV

\(k = n\) folds: every observation is its own held-out fold

each fit uses \(n - 1\) training points — almost the full dataset

when to use it: small \(n\) (say, \(n < 200\) — a 50-patient clinical trial, a startup’s first revenue dataset)

when not to: larger \(n\) — 5- or 10-fold CV is faster and nearly identical

going deeper: approximate LOO-CV (Broderick et al., AISTATS 2019) gives near-LOO accuracy from a single fit

do the procedures agree?

we ran three procedures on the same 1,500 listings — do you expect them to agree?

method best level
single split (best test R²) 3
three-way split (validation) 4
5-fold CV 5
  • not a contradiction — each procedure has its own noise source
  • training-set size differs: CV fits on 1,200 listings per fold; single splits train on 900
  • more data supports more features, so CV leans toward more complex models
  • on 1,500 listings, levels 3–5 are indistinguishable — trust CV

taming complexity automatically

manual selection doesn’t scale

cross-validation tells us which model is best from a short list

but what if we have hundreds of candidate features?

we can’t try every subset — \(2^{200}\) is more than atoms in the universe

lasso: automatic feature selection

OLS: minimize the sum of squared residuals — no penalty on the coefficients.

lasso: add an L1 penalty that drives coefficients to exactly zero.

\(x_i\) includes a leading 1 for the intercept; \(\beta_0\) is not penalized

\[\min_{\beta} \; \sum_{i=1}^n (y_i - x_i^T \beta)^2 \;+\; \alpha \sum_{j=1}^p |\beta_j|\]

\(\alpha\) is the knob:

  • large \(\alpha\) → stronger penalty → more zeros → simpler model
  • small \(\alpha\) → lighter penalty → closer to OLS

we pick \(\alpha\) by cross-validation

ridge: shrink but keep

ridge adds an L2 penalty (sum of squared coefficients)

\[\min_{\beta} \; \sum_{i=1}^n (y_i - x_i^T \beta)^2 \;+\; \alpha \sum_{j=1}^p \beta_j^2\]

shrinks coefficients toward zero but never zeros them out — every feature survives with a dampened weight

predict the stem plots

three panels coming: OLS, ridge, lasso

what does each look like — tall and dense? shrunk? mostly zero?

OLS vs ridge vs lasso

OLS · all 164 features

test \(R^2\) = 0.39

ridge · all 164, shrunk

test \(R^2\) = 0.44

lasso · 29 out of 164

test \(R^2\) = 0.52

interpret the stem plot

you’re deploying an airbnb price suggestion tool — which model do you ship?

  • which is most interpretable for explaining a quoted price?
  • hosts self-report their listing — which needs the fewest inputs to deploy?
  • a host mistypes bedrooms as “66” — which model’s prediction is most distorted?
  • many polynomial features are correlated — which model handles that best?

which features did lasso keep?

top 10 survivors at the CV-optimal \(\alpha\)bedrooms, room type, borough, a few interactions

designing a diagnostic panel for type-2 diabetes

a hospital wants to screen patients for early type-2 diabetes using a single blood draw.

the lab has 200 candidate markers (glucose, A1C, insulin, inflammatory proteins, lipid fractions, …) in their historical dataset — every one has been measured on 5,000 previously-diagnosed patients.

but the screening panel that actually gets used in clinic can only test 5 markers — budget, turnaround time, and patient tolerance all limit the size of the panel.

your job: pick the 5 markers and train the model. OLS, ridge, or lasso?

what to consider:

  • the deployment constraint — 5 inputs, not 200
  • interpretability — clinicians will ask “why did you flag this patient?”
  • correlation — many markers move together (metabolic cascades)
  • what happens if a clinician mistypes a glucose reading?

the coefficient path

left: large α, everything zero · right: small α, approaches OLS

as α decreases, features enter the model one at a time

standardize before regularizing

the penalty \(\alpha \sum_j |\beta_j|\) treats every coefficient on the same scale

but raw features live on wildly different scales:

  • bedrooms: std ≈ 0.7 → coefficient ≈ 30
  • number_of_reviews: std ≈ 35 → coefficient ≈ 0.02

same penalty \(\alpha |\beta_j|\), but one coefficient is 1000× bigger — purely from units

\[\tilde{x}_j = \frac{x_j - \bar{x}_j}{s_j} \qquad \text{compute } \bar{x}_j, s_j \text{ on \textbf{train only}; apply to both train and test}\]

raw vs standardized lasso: same data, different story

feature std(\(x_j\)) raw coef standardized coef
bedrooms 0.66 52.21 34.78
bathrooms 0.42 0.33 1.54
number_of_reviews 34.52 −0.02 0
minimum_nights 8.98 −0.15 −0.38
availability_365 133.58 0.01 0

test \(R^2\): 0.151 vs 0.152 — essentially identical

but the raw lasso keeps two features that are just noise — standardized correctly drops them

going deeper: why L1 zeros coefficients

2D illustration (\(\beta_1\), \(\beta_2\)): L1 diamond has corners on the axes → tangency hits a corner → sparsity

choosing α by cross-validation

how do we pick the right penalty strength?

LassoCV: fit lasso at many \(\alpha\) values, score each by 5-fold CV

same procedure, three sizes

data size test set hyperparameter search notes
small (~150) 20% (even if noisy) LOO-CV on the rest wide bootstrap CI is honest
medium (~1,500) 20% 5-fold CV on the rest the Airbnb case
large (100k+) 20% (or fixed 10%) single train/val split ok OLS baseline is often enough

the shape of the procedure doesn’t change — only the details

which of these claims is wrong?

A. training R² always increases with more features

B. test R² always increases with more features

C. the \(R^2\) reported by cross-validation can be negative

D. lasso sets some coefficients to exactly zero

what we still can’t answer

  • will the model survive distribution shift at deployment? → chapter 16
  • is any individual coefficient significant? → chapter 12
  • does a feature cause higher prices? → chapter 18
  • can we capture nonlinear structure polynomials miss? → chapter 13

CV checks generalization — within distribution. that’s not everything.

summary

  • train/test split: hold out data for honest evaluation
  • bias-variance: simple models underfit, complex ones overfit
  • train/validate/test: selection decisions contaminate a test set — use a separate validation fold
  • cross-validation: reduces estimate variance — use it when you need to compare models
  • standardize then regularize: lasso zeros features, ridge shrinks them

next time

so far every outcome has been a number (price, score, count)

what if the outcome is a category? (spam/not spam, disease/healthy)

chapter 7: classification

one-minute feedback

  1. what was the most useful thing you learned today?
  2. what was the most confusing?

give feedback