Final Exam — What to Expect

MSE 125 — Spring 2026

Published

May 25, 2026

The final exam is cumulative over Lec 1–16 (the causal inference week and Ch 20 fairness are enrichment material and are excluded). It is closed book, no devices, no AI, 90 minutes, printed in black and white, single version.

The 8 unit quizzes and their practice quizzes are your primary study resource — the final reuses the question types they introduced. Two new archetypes show up only on the final; this handout introduces them so they’re not a surprise on exam day.

Structure

Section Points Time What it tests
1. Tool literacy 25 $$22 min 8 MC + 5 fill-in. Quick decisions: which test, which model, which CV protocol, which plot. A formula strip at the top of the section gives Bonferroni, expected FP, recall/precision, \(R^2\), residual.
2. Interpretation & EDA 35 $$33 min 3 problems with figures: regression-table interpretation, EDA plot critique, classification + threshold reasoning.
3. Diagnose & supervise 40 $$35 min 3 longer problems. Starts with the AI code review (new archetype, 15 pts), then diagnose-the-phenomenon (new archetype, 12 pts), then unsupervised interpretation (12 pts).

No calculator. All arithmetic is doable on paper. Black-and-white printing — every figure distinguishes lines by linestyle, marker, and label, never by color.

Sample item 1 — AI code review

NoteSample

You asked an AI agent to predict customer churn from a labelled dataset. It returned this code:

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split

X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.2)
model = LogisticRegression().fit(X_tr, y_tr)

print("Accuracy:", accuracy_score(y_tr, model.predict(X_tr)))

(a) Name the bug in one short phrase.

(b) What single sanity check would have caught it?

(a) The reported accuracy is training accuracy, not test accuracy. Line: accuracy_score(y_tr, model.predict(X_tr)) — both arguments come from the training partition. The model is being evaluated on data it was fit on.

(b) Any of: compute accuracy on X_te, y_te and compare; expect training accuracy to be higher than test accuracy; use cross-validation to estimate generalization.

Sample item 2 — Diagnose the phenomenon

NoteSample

“You ran a two-sample \(t\)-test comparing means between two groups and got \(p = 0.001\), but you only had \(n = 4\) observations per group.” Name two plausible causes of this \(p\)-value, and how you’d check each.

Any two of (each with a check):

  • The \(t\)-test’s normality assumption is too weak at \(n=4\). Check: re-run a permutation test on the difference in means and compare the \(p\)-value.
  • An outlier is driving the result. Check: plot the raw data; remove the most extreme observation and re-fit.
  • The standard error estimate is unreliable at \(n=4\). Check: bootstrap the difference in means and look at the bootstrap CI’s width vs. the observed difference.
  • The effect is real but huge. Check: report the effect size (Cohen’s \(d\)) in addition to the \(p\)-value; if it’s enormous, the small \(n\) is consistent with a real, large effect that doesn’t need many observations to be detected.

The exam awards credit for any two defensible causes with sensible checks. The exact list above is one set of right answers.