MSE 125 — Applied Statistics
Monday, April 20, 2026
a model is 85% accurate
does it catch the patients who need it?
the problem
the data
the dumbest possible model beats random, passes QA, and saves zero lives
first we need a classifier
Framingham Heart Study: 4,240 patients, 10-year followup
outcome: TenYearCHD \(\in \{0, 1\}\) — did the patient develop coronary heart disease within 10 years?
“positive” = the outcome we’re trying to detect, not the desirable one
predictions below 0 and above 1 — not valid probabilities
\[p = \sigma(z) = \frac{1}{1 + e^{-z}}, \quad z = \beta_0 + \beta_1 x_1 + \cdots + \beta_d x_d\]
sportsbook: “the Celtics are 4-to-1 underdogs tonight”
Q: what probability of winning does that imply?
odds \(= \dfrac{P(\text{win})}{P(\text{lose})}\)
4-to-1 against → lose 4 games per 1 win → \(P(\text{win}) = \tfrac{1}{1+4} = 0.20\)
probability in \([0, 1]\) \(\leftrightarrow\) odds in \([0, \infty)\) — same info, different scale
| probability \(p\) | odds \(p/(1-p)\) | log-odds |
|---|---|---|
| 0.01 | 0.01 | \(-4.6\) |
| 0.20 | 0.25 | \(-1.4\) |
| 0.50 | 1.00 | \(\phantom{-}0.0\) |
| 0.80 | 4.00 | \(+1.4\) |
| 0.99 | 99.0 | \(+4.6\) |
log-odds range from \(-\infty\) to \(+\infty\) — perfect for a linear model
\[\log\frac{p}{1-p} = \beta_0 + \beta_1 x_1 + \cdots\]
Pr(Win) = 0.6 → odds = \(\frac{0.6}{0.4} = 1.5\)
I double your odds → new odds = 3.0
Q: new Pr(Win) = ?
\[\text{Pr(Win)} = \frac{3.0}{1 + 3.0} = 0.75\]
doubling odds \(\neq\) doubling probability — the scales are nonlinear
predicted risk climbs with age — we see the lower portion of the S-curve because the base rate is low
MLE (maximum likelihood estimation): pick \(\beta\) to makes observed labels most probable
for one observation with label \(y \in \{0, 1\}\) and predicted probability \(p\):
\[P(y \mid x) = p^{\,y} \, (1-p)^{1-y}\]
likelihood of \(n\) observations. independent, so multiply:
\[L(\beta) = \prod_{i=1}^n p_i^{\,y_i} \, (1-p_i)^{1-y_i}\]
take \(-\log\): products → sums, maximize → minimize
\[-\log L(\beta) = -\sum_{i=1}^n \big[ y_i \log p_i + (1-y_i) \log(1-p_i) \big]\]
\[\ell(\beta) = -\big[ y \log(p) + (1-y) \log(1-p) \big]\]
also called cross-entropy or negative log-likelihood
penalizes confident wrong predictions especially hard
Q: if \(y = 1\) and \(p = 0.9\), loss = \(-\log(0.9)\) ≈ ? what about \(p = 0.1\)?
no closed-form minimum. we’ll need an iterative solver.
a 10% prediction for a true 1? squared error 0.81, logistic loss 2.3
the loss function determines what the model finds
\[\beta \leftarrow \beta - \eta \cdot \nabla L(\beta)\]
logistic loss is convex — one valley, no false minima

Q: will the loss curve drop smoothly, oscillate, or bounce around?
commit to a prediction — then we run 50 iterations
50 iterations on age + BMI (body mass index) logistic regression. loss drops fast, then settles.
accuracy lies under class imbalance
Test accuracy: 0.855
85.5% accurate (at default threshold 0.5) — sounds great, right?
the “always predict no CHD” baseline:
Q: what accuracy does it get?
baseline accuracy = 0.848
our model accuracy = 0.855
improvement = 0.007
our fancy classifier barely beats the null

accuracy measures how often you’re right overall — when one class dominates, that’s easy and uninformative
1,096 test patients · 15% CHD base rate · threshold 0.5
| predicted no CHD | predicted CHD | |
|---|---|---|
| actually no CHD | TN (true negative) = ? | FP (false positive) = ? |
| actually CHD | FN (false negative) = ? | TP (true positive) = ? |
predict: of 167 actual CHD cases, how many does the model catch?
of 167 actual CHD cases, we caught 17
precision \[\frac{\text{TP}}{\text{TP} + \text{FP}}\]
“I flagged this patient — should I trust the flag?”
recall (sensitivity) \[\frac{\text{TP}}{\text{TP} + \text{FN}}\]
“of all the sick people, how many did we find?”
F1 score = harmonic mean of precision and recall — one number, but hides the cost asymmetry
two metrics → two different stories
Precision: 0.65
→ Of patients flagged as CHD, 65% actually had it
Recall: 0.10
→ Of actual CHD patients (167 in test set), we caught 17
accuracy hid a disaster — the model catches almost nobody
which type of error would you rather the model make?
for each: flag more, or flag fewer?
flag if the expected savings beat the expected cost
flag whenever P(disease | flagged) > break-even precision
\[p^* = \frac{1}{k+1}\]
where \(k = C_{FN}/C_{FP}\) is the cost ratio of a missed case to a false alarm
CHD: \(k = 100 \Rightarrow p^* \approx 1\%\) — push hard for recall
spam: \(k \ll 1 \Rightarrow p^* \approx 99\%\) — protect precision
each annotated point is one threshold — as recall climbs, precision falls
precision is P(disease | flagged) — read the y-axis as trustworthiness of a flag
stop where precision still beats \(p^*\) — beyond that, the marginal flag loses money
AUC (area under the curve) ≈ 0.75 — concordance: pick one CHD and one non-CHD patient; the model ranks the CHD patient higher 75% of the time (ties count as half)
PR curve
axes: precision · recall
depends on base rate
operational view — most honest when the positive class is rare
ROC curve
axes: TPR (true positive rate = recall) · FPR (false positive rate)
invariant to base rate
threshold-free summary (AUC) · compare models across datasets
same tradeoff, different invariances
you’re reporting to the hospital board on your CHD model
do you show the PR curve or the ROC curve?
one-sentence justification
low threshold → catches more, but each flag is less reliable
aggregate metrics hide local failures
a vendor pitches a cancer screening test as 99% accurate:
cancer prevalence: 0.5%
a patient tests positive — P(cancer | positive)?
A. 99% · B. 67% · C. 33% · D. 1%
false positives swamp true positives when the condition is rare
aggregate AUC is 0.75 · recall: age is the model’s strongest signal
predict: which age group does the model fail on?
commit to a group — and say why
for patients under 40, AUC = 0.36 — below 0.5, but based on only ~6 CHD cases; the estimate is noisy

bin predictions into deciles
does “10% risk” mean 10% observed rate?
. . .
AUC measures ranking
calibration measures absolute probabilities
. . .
a model can have good AUC and bad calibration (or vice versa) — which matters depends on how you use the output
the question that opened the lecture
a hospital’s model is “85% accurate” — deploy it?
now you know what to ask:
