Practice Quiz 8: Validation in Deployment

MSE 125 — Lecture 16

Use this practice quiz to prepare for Quiz 8 (Wednesday, May 27). The real quiz will have 2 questions in 10 minutes, closed-book. This practice set has 10 questions covering Lecture 16: temporal leakage and the random-vs-temporal split, walk-forward validation, splitting by the right axis (panel data), distribution shift, prediction intervals, feedback loops, Weapons of Math Destruction, Goodhart’s law (including how overfitting is the same phenomenon and how to spot gaming before it happens), pre/post-deployment defensive practices, and Holt-Winters / decomposition.

Every concept tested on the real quiz appears somewhere on this practice set, with a different scenario.

Question 1. A video-streaming service forecasts its monthly new subscribers. The series has grown over thirteen years and swings seasonally (a year-end holiday peak). An analyst fits a model from recent-lag features (last month, the average of the last three months) and scores it three ways:

score	R²
random 80/20 split	R₁
temporal split (train first 9 years, test last 4)	R₂
naive “same month last year” (test window)	R₃

Rank R₁, R₂, and R₃ from largest to smallest given the structure of this series, and explain your ranking.
Suggest one change the analyst could make to her model that you think would meaningfully improve her temporal forecast. Justify your suggestion from what you see in the plot.

Solution

(a) Expected ranking: R₁ > R₃ > R₂. Actual values: R₁ = 0.89, R₃ = 0.65, R₂ = 0.60.

R₁ > R₂ (random > temporal): temporal leakage. A random shuffle drops each held-out month in among its immediate neighbors (the months just before and after it), which sit in the training set; the recent-lag model only has to interpolate among points it has effectively already seen, not forecast genuinely new months.
R₃ > R₂ (naive > model under temporal): the recent-lag model lacks seasonal capability — lag1 and lag3_avg carry last month’s level forward but can’t anticipate the year-end holiday peak; naive same-month-last-year captures exactly that annual pattern by construction.
R₁ > R₃ (random > naive): leakage inflates the model’s score above even the no-parameter baseline; the random number isn’t measuring skill, it’s measuring how well the model interpolates among neighbors it has effectively memorized.

The deployment-honest estimate is R₂ = 0.60, which barely matches the naive 0.65 — a clear signal not to trust the 0.89.

(b) Running several candidate modifications on the same data and the same temporal split:

model	temporal R²	temporal MAE
recent lags only (baseline)	0.60	8.4K subs
+ linear trend feature t	0.63	6.6K subs
log-scale subscribers	0.59	8.5K subs
+ seasonal features (sin, cos of month; year-ago lag)	0.93	3.8K subs

The most powerful single change is adding seasonal features. The plot shows a clear annual cycle (a holiday peak each year and a quieter mid-year) the recent-lag-only model can’t anticipate; a year-ago lag (lag12) or sine/cosine of month gives the model the seasonal pattern it needs — R² jumps from 0.60 to 0.93 and MAE drops from 8.4K to 3.8K subscribers.

A linear trend feature also helps modestly (0.60 → 0.63) by letting the model track the rising level rather than relying on recent values alone. Log-scaling alone does not help here — the model’s bottleneck is missing seasonal capability, not the scale of swings, so rescaling the outcome doesn’t give the model the year-ago information it actually needs.

The lesson is the chapter’s: match the model to the structure of the data. Full credit for any plot-grounded suggestion in this spirit: seasonal features / year-ago lag / sin and cos of month / linear trend / switch to multiplicative Holt-Winters / log-scale revenue — each with a justification tied to what’s visible.

Question 2. A utility validates a demand-forecasting model with walk-forward validation: train on all prior years, forecast the next year, step forward, repeat. The yearly mean absolute error (MAE):

In which year does the model break down, and what does the pattern of bars suggest happened that year?
The utility had been reporting a single overall MAE averaged across all test years. Why is the walk-forward view more informative for deciding whether to trust the model?

Solution

(a) 2021 — its MAE is roughly 4× every other year. The model, trained on prior (calmer) years, could not forecast 2021: something happened that year the training data never contained — a regime change / demand shock (an extreme-weather year, say) outside the range the model had ever seen.

(b) A single averaged MAE blends the disastrous 2021 in with a dozen calm years, so the average looks acceptable and the failure is hidden. Walk-forward surfaces when the model fails. Knowing the model is reliable in normal years but blows up during shocks is exactly what you need to decide when you can trust it — an average across all years buries that.

Question 3. A health insurer is training a model to predict whether a member will file a high-cost claim in the next year. The dataset has one row per member per year, and about 80% of members appear in multiple years (most stay enrolled for several years).

The data team splits the dataset 80/20 at the row level (each row independently assigned to train or test) and reports R² = 0.71 on the held-out rows.

A senior actuary raises a concern. What is the leakage problem, and what split should the team use instead?
The same principle applies to a customer-churn model built from one row per customer per month. (i) What is the right split for that model? (ii) What is the general question to ask whenever you set up validation on data where the same unit recurs across rows?

Solution

(a) Most members appear in multiple years. A row-level random split puts the same member’s other years into the training set — the model effectively sees the member’s profile and earlier outcomes during training and then “predicts” them on the test set. The model learns member-specific patterns rather than the generalizable claim relationship, so R² = 0.71 reflects memorization, not real predictive skill. The right split is by member — the chapter’s “split by the right axis” rule: hold out 20% of members entirely, so that for each held-out member, all of their year-rows go to test, and no member appears in both train and test.

(b)(i) Split by customer, not by row — otherwise the model sees a customer’s other months during training and the leakage is identical to (a). (ii) The general question to ask is: what generalization axis do you care about? If you care about predicting outcomes for new members or customers, split along that axis (by member / by customer); if you care about predicting future outcomes for existing ones, use a temporal split. The split has to match the question you’re trying to answer.

Question 4. A hospital forecasts daily ER visits with a model trained on normal days, and reports a 95% bootstrap prediction interval (resampling the training residuals). The test period includes a heat wave. Over the whole test period the interval’s coverage is only 82.5% (target: 95%).

During the heat wave, actual visits hit 305 while the forecast stayed near 140 and the prediction interval contained none of the surge days. Name this failure mode and explain why the model could not predict the surge.
Cross-validation would not have caught this either. Explain why neither cross-validation nor the bootstrap interval protects against the surge — what do both methods assume that the heat wave violated?

Solution

(a) Distribution shift (an out-of-distribution regime the training set never saw). The model learned the normal-day relationship between recent visits and today’s visits; a heat wave is a qualitative change outside the training range, so the model stays near the normal-day level (~140) and cannot extrapolate to 305. The signal it would need — how ER visits behave during extreme heat — was never in the training data to be learned.

(b) Both methods assume the future looks like the past. Cross-validation holds out points that resemble the training data (in-distribution), so it guards against ordinary noise, not a regime it never saw. The bootstrap interval is built from the spread of normal-day residuals, so it widens for ordinary day-to-day variation but is blind to a surge no training day resembled. Neither can quantify uncertainty about conditions absent from the training data — which is exactly when models fail hardest.

Question 5. A navigation app predicts which roads will be congested and routes drivers onto the roads it predicts will be clear. After a few months it retrains on the fresh traffic data it has collected.

Describe the feedback loop (prediction → action → outcome → new data). Why is this different from ordinary confounding?
The team validates the model on a held-out set of historical trips and gets excellent accuracy. Why does this not tell them how the model performs in deployment — and what would?

Solution

(a) The model predicts road A will be clear → it routes drivers onto A (action) → A becomes congested (outcome) → the new training data records A as congested (new data) → the next prediction changes. The model’s own action changes the outcome it is trying to predict. This is not confounding: in confounding the association already exists in the world and we misread it causally, whereas here the model causes the association — it changes the data-generating environment.

(b) The historical held-out set comes from a world in which the model was not routing drivers, so it cannot reflect how the roads respond once the model acts on them. Passive historical validation cannot capture an effect the model itself creates. To measure real impact you must randomize: an A/B test that routes some drivers by the model and holds out a control routed otherwise, so you can see what the model causes versus what would have happened anyway.

Question 6. A company sells landlords a “tenant risk score” that predicts whether an applicant will miss rent. Landlords automatically deny anyone the model flags as high-risk. The score is trained on past eviction and payment records.

Is the tenant-risk score a Weapon of Math Destruction? Evaluate it against the three traits: (1) is the outcome easily measurable? (2) are there negative consequences for individuals? (3) is there a self-fulfilling loop?
Contrast: a model predicts next week’s regional pollen count so a pharmacy chain can stock antihistamines. Is this a WMD? Which trait decides it?

Solution

(a) Yes. (1) Outcome not easily measurable: the training labels (past evictions and payment records) reflect prior conditions and biases, not a clean measure of who is a good tenant. (2) Negative consequences: being automatically denied housing, with no review, is a severe individual harm. (3) Self-fulfilling loop: a denied applicant becomes more housing-unstable → accumulates a worse record → gets flagged again; the score helps create the instability it claims to predict. All three traits hold.

(b) No. The pollen model fails the self-fulfilling-loop trait (and the harm trait): forecasting the pollen count does not change the pollen count, so there is no feedback from the prediction back to the outcome, and no individual is harmed by the forecast itself. The deciding trait is the loop — the prediction doesn’t change the thing it measures, so it can’t be a WMD.

Question 7. A customer-support team is given a target: cut the median time to first response, with a bonus attached. The plot shows two metrics before and after the target was introduced.

The dashboard celebrates the drop in first-response time. Using both lines, explain what is actually happening. Identify the proxy metric and the true goal.
Overfitting — a model’s training loss falling while its validation loss rises — is described in Chapter 16 as the same phenomenon as this kind of metric-gaming. In the overfitting case, what plays the role of the gamed proxy, and what plays the role of the true goal?

Solution

(a) After the target, time to first response collapses (~4 h → ~0.4 h) while time to full resolution stays flat (~28 h). Agents are gaming the proxy — firing off an instant canned acknowledgment to stop the first-response clock — without resolving anything faster. The proxy is time to first response (the metric being optimized); the true goal is actually solving customers’ problems (≈ time to full resolution), which did not move. When a measure becomes a target, it stops tracking the goal.

(b) The gamed proxy is the training loss; the true goal is generalization — performance on new, unseen data (validation/test loss). The learning algorithm drives training loss down so hard that it stops tracking generalization, exactly as the support team drives first-response time down without advancing resolution. The only difference is the agent: a single optimizer here, people there.

Question 8. A regional delivery company pays its drivers a bonus for completing at least 95% of deliveries within the promised time window. The company’s stated goal: reliable, on-time service to customers.

Predict how the bonus will be gamed. Name at least two specific routes drivers could use to lift the measured on-time rate without actually getting more packages to customers faster.
Name the proxy and the underlying goal in this scenario. Which drivers benefit most from the bonus — is that the group the company meant to reward? Name one hidden second metric management could track to detect the gaming.

Solution

(a) Plausible gaming routes (any two from different mechanisms earn full credit):

False-attempt-marking: mark “attempted, no one home” for deliveries running late — these don’t count as missed-window deliveries.
Cherry-picking: complete the close, simple deliveries inside the window; defer the hard, distant ones, or negotiate wider windows for the harder routes.
Redefining the outcome: mark “delivered” on arrival at the address even when the package actually goes back to the truck.

(b) Proxy: the measured on-time-delivery rate. Goal: customers actually receiving their packages on time. The drivers best at gaming the proxy (cherry-picking, false-attempt-marking, definition-stretching) benefit most — not necessarily the drivers actually serving customers well, which is whom the company meant to reward. Hidden second metric (any one):

Customer complaint rate or re-delivery rate per address — rises when “delivered” doesn’t mean the customer got the package.
Share of “attempted, no one home” outcomes per driver — false-attempt gaming shows up as an anomalous spike.
Driver-level missed-package customer call-back rate — captures outcomes the operational metric does not.

The pattern: when consequences attach to a proxy, those best at gaming the proxy dominate it, not those best at the underlying goal — and a useful audit metric is one the optimizer can’t directly affect.

Question 9. A hotel chain decomposes its monthly occupancy with a multiplicative model and fits Holt-Winters. The estimated trend and the seasonal multipliers:

The fitted smoothing parameters are α = 0.13, β = 0.00, γ = 0.61.

Read the seasonal multipliers: which month is busiest, which is slowest, and what does a multiplicative multiplier of 1.37 mean in plain language?
The fit drove β ≈ 0. Which component does β control, and what does β ≈ 0 tell you about how this hotel’s occupancy is changing over time?

Solution

(a) Busiest: July (multiplier ≈ 1.37). Slowest: January (≈ 0.71). “Multiplicative” means the seasonal factor scales the trend rather than adding a fixed amount: July occupancy runs about 37% above the trend level (× 1.37), and because it is a percentage, the size of that summer bump grows in absolute terms as the trend rises. (January runs ~29% below trend, × 0.71.)

(b) β controls the trend (slope) update — how fast the estimated growth rate adapts to new data. β ≈ 0 means the slope is held essentially constant: the hotel’s occupancy grows at a steady, unchanging rate year over year. The level still climbs (the trend keeps rising via α and the carried-forward slope), but the rate of growth does not change, so there is nothing for β to track. (Compare α = 0.13, a slowly tracked level, and γ = 0.61, a seasonal shape that updates briskly as the summer swing widens.)

Question 10. A bank is about to deploy a machine-learning fraud-detection model. The model will automatically block transactions it flags as high-risk before they go through. The team asks for advice on how to validate the model before deployment and how to monitor it after.

Name two pre-deployment practices appropriate for this deployment, and briefly explain what each would catch.
Name two post-deployment practices, and briefly explain what each would catch.

Solution

(a) Pre-deployment (any two from different mechanisms):

Split by the right axis (here: by time). Fraud patterns evolve, so a random split puts future patterns alongside past ones and inflates the score. Train on the past and test on the future to estimate how the model will perform once deployed.
Stress-test the tails / out-of-distribution. Run the model against synthetic adversarial transactions and patterns the training set never contained — the events the bank most needs to catch are exactly the ones the historical data hasn’t seen.
Evaluate against the asymmetric cost structure, not just headline accuracy. False positives (legitimate transactions blocked, customers angered) and false negatives (real fraud waved through) carry very different costs; the model should be tuned and reported with that asymmetry in mind, not just on raw precision/recall.
Simulate the deployed action. Don’t just score predictions — score what happens once transactions are blocked: customer retries, account churn, downstream business impact. The deployed model takes an action; its validation should reflect that action.

(b) Post-deployment (any two from different mechanisms):

Monitor drift. Fraudsters adapt as they probe the model; track precision and recall on freshly confirmed-fraud cases and alert when performance erodes.
Hold out a random control. Let a small random sample of flagged transactions through unblocked — this is the only way to estimate the false-positive rate (how many “fraud” flags were actually legitimate), because a blocked transaction never generates the evidence that it was fine. (A separate manual-review sample of unflagged transactions gives the false-negative rate the same way.)
A/B test major model changes. Don’t replace the production model wholesale on the strength of historical numbers; randomize a fraction of transactions between the old and the new model and measure causal impact directly.

The deep point: passive historical validation can’t catch a model that changes the data-generating environment (a blocked transaction never produces “this was actually legitimate” evidence), so the post-deployment practices have to introduce randomization to keep generating honest feedback.

Other Formats