Lecture 7: Classification — Logistic Regression and Metrics

MSE 125 — Applied Statistics

Madeleine Udell

Monday, April 20, 2026

logistics

  • quiz 3: Wed April 22, Lec 6-7 (validation + classification)
  • HW 2: due Fri April 24, regression, validation, classification
  • project proposal: due Fri May 1

a model is 85% accurate

does it catch the patients who need it?

the readmissions setup

the problem

  • hospital wants to flag 30-day readmissions
  • missed readmission costs up to $25,000 in care
  • CMS penalizes hospitals with excess readmission rates: up to 3% of total reimbursements

the data

  • 15% of patients get readmitted
  • 85% don’t
  • predict “no readmission” for everyone →
  • 85% accuracy

the dumbest possible model beats random, passes QA, and saves zero lives

today

  • the mechanics: logistic regression, gradient descent
  • the evaluation trap: why accuracy lies
  • threshold choice: no free lunch
  • what accuracy hides: base rates, subgroups, calibration

first we need a classifier

the outcome is binary

Framingham Heart Study: 4,240 patients, 10-year followup

  • started in 1948 in Framingham, Massachusetts
  • first study to identify cholesterol, blood pressure, and smoking as heart disease risk factors
  • still running, now on its third generation of participants

outcome: TenYearCHD \(\in \{0, 1\}\): did the patient develop coronary heart disease within 10 years?

  • CHD = 1: positive class
  • no CHD = 0: negative class

“positive” = the outcome we’re trying to detect, not the desirable one

can we just run linear regression?

predictions below 0 and above 1: not valid probabilities

the trick: squeeze the line through a sigmoid

\[p = \sigma(z) = \frac{1}{1 + e^{-z}}, \quad z = \beta_0 + \beta_1 x_1 + \cdots + \beta_d x_d\]

sports bettor’s intuition

sportsbook: “the Celtics are 4-to-1 underdogs tonight”

Q: what probability of winning does that imply?

odds \(= \dfrac{P(\text{win})}{P(\text{lose})}\)

4-to-1 against → lose 4 games per 1 win → \(P(\text{win}) = \tfrac{1}{1+4} = 0.20\)

probability in \([0, 1]\) \(\leftrightarrow\) odds in \([0, \infty)\): same info, different scale

from probabilities to log-odds

probability \(p\) odds \(p/(1-p)\) log-odds
0.01 0.01 \(-4.6\)
0.20 0.25 \(-1.4\)
0.50 1.00 \(\phantom{-}0.0\)
0.80 4.00 \(+1.4\)
0.99 99.0 \(+4.6\)

log-odds range from \(-\infty\) to \(+\infty\): perfect for a linear model

\[\log\frac{p}{1-p} = \beta_0 + \beta_1 x_1 + \cdots\]

check your intuition

Pr(Win) = 0.6 → odds = \(\frac{0.6}{0.4} = 1.5\)

I double your odds → new odds = 3.0

Q: new Pr(Win) = ?

\[\text{Pr(Win)} = \frac{3.0}{1 + 3.0} = 0.75\]

doubling odds \(\neq\) doubling probability. the scales are nonlinear

logistic regression on age alone

predicted risk climbs with age. we see the lower portion of the S-curve because the base rate is low

where does the loss come from?

MLE (maximum likelihood estimation): pick \(\beta\) to makes observed labels most probable

for one observation with label \(y \in \{0, 1\}\) and predicted probability \(p\):

\[P(y \mid x) = p^{\,y} \, (1-p)^{1-y}\]

likelihood of \(n\) observations. independent, so multiply:

\[L(\beta) = \prod_{i=1}^n p_i^{\,y_i} \, (1-p_i)^{1-y_i}\]

take \(-\log\): products → sums, maximize → minimize

\[-\log L(\beta) = -\sum_{i=1}^n \big[ y_i \log p_i + (1-y_i) \log(1-p_i) \big]\]

the logistic loss

\[\ell(\beta) = -\big[ y \log(p) + (1-y) \log(1-p) \big]\]

also called cross-entropy or negative log-likelihood

penalizes confident wrong predictions especially hard

  • \(y = 1\), \(p \to 0\): \(-\log(p) \to \infty\)
  • \(y = 0\), \(p \to 1\): \(-\log(1-p) \to \infty\)

Q: if \(y = 1\) and \(p = 0.9\), loss = \(-\log(0.9)\) ≈ ? what about \(p = 0.1\)?

  • \(-\log(0.9) = 0.11\) (confident, correct → small)
  • \(-\log(0.1) = 2.3\) (confident, wrong → large)

no closed-form minimum. we’ll need an iterative solver.

squared error vs. logistic loss

a 10% prediction for a true 1? squared error 0.81, logistic loss 2.3

the loss function determines what the model finds

gradient descent: hiking downhill

\[\beta \leftarrow \beta - \eta \cdot \nabla L(\beta)\]

  • \(\nabla L(\beta)\) = gradient (uphill direction); step the opposite way
  • \(\eta\) = learning rate (step size)

logistic loss is convex: one valley, no false minima

watch it converge

Q: will the loss curve drop smoothly, oscillate, or bounce around?

commit to a prediction. then we run 50 iterations

watch it converge

50 iterations on age + BMI (body mass index) logistic regression. loss drops fast, then settles.

accuracy lies under class imbalance

fit it, test it

# setup: Framingham data, 9 features, standardized, 70/30 split
model = LogisticRegression(penalty=None, max_iter=1000).fit(X_train_scaled, y_train)
test_acc = model.score(X_test_scaled, y_test)
print(f"Test accuracy: {test_acc:.2f}")
Test accuracy: 0.855

85.5% accurate (at default threshold 0.5). sounds great, right?

but wait

the “always predict no CHD” baseline:

Q: what accuracy does it get?

baseline accuracy  = 0.848
our model accuracy = 0.855
improvement        = 0.007

our fancy classifier barely beats the null

accuracy measures how often you’re right overall. when one class dominates, that’s easy and uninformative

the confusion matrix

1,096 test patients · 15% CHD base rate · threshold 0.5

predicted no CHD predicted CHD
actually no CHD TN (true negative) = ? FP (false positive) = ?
actually CHD FN (false negative) = ? TP (true positive) = ?

predict: of 167 actual CHD cases, how many does the model catch?

  • fewer than 20
  • 20–50
  • 50–100
  • more than 100

the confusion matrix

of 167 actual CHD cases, we caught 17

precision vs. recall

precision \[\frac{\text{TP}}{\text{TP} + \text{FP}}\]

“I flagged this patient: should I trust the flag?”

recall (sensitivity) \[\frac{\text{TP}}{\text{TP} + \text{FN}}\]

“of all the sick people, how many did we find?”

F1 score = harmonic mean of precision and recall: one number, but hides the cost asymmetry

two metrics → two different stories

our model’s precision and recall

Precision: 0.65
  → Of patients flagged as CHD, 65% actually had it

Recall:    0.10
  → Of actual CHD patients (167 in test set), we caught 17

accuracy hid a disaster. the model catches almost nobody

which type of error would you rather the model make?

  • CHD screening:
    • FP → $500 follow-up
    • FN → $50K emergency care
  • spam filter:
    • FP → real email lost → missed meeting
    • FN → spam in inbox → mild annoyance

for each: flag more, or flag fewer?

when is a flag worth it?

flag if the expected savings beat the expected cost

flag whenever P(disease | flagged) > break-even precision

\[p^* = \frac{1}{k+1}\]

where \(k = C_{FN}/C_{FP}\) is the cost ratio of a missed case to a false alarm

CHD: \(k = 100 \Rightarrow p^* \approx 1\%\), push hard for recall

spam: \(k \ll 1 \Rightarrow p^* \approx 99\%\), protect precision

precision-recall across thresholds

each annotated point is one threshold. as recall climbs, precision falls

precision is P(disease | flagged): read the y-axis as trustworthiness of a flag

stop where precision still beats \(p^*\). beyond that, the marginal flag loses money

another view: ROC (receiver operating characteristic)

AUC (area under the curve) ≈ 0.75. concordance: pick one CHD and one non-CHD patient; the model ranks the CHD patient higher 75% of the time (ties count as half)

PR vs ROC: when to use which

PR curve

axes: precision · recall

depends on base rate

operational view: most honest when the positive class is rare

ROC curve

axes: TPR (true positive rate = recall) · FPR (false positive rate)

invariant to base rate

threshold-free summary (AUC) · compare models across datasets

same tradeoff, different invariances

you’re reporting to the hospital board on your CHD model

do you show the PR curve or the ROC curve?

one-sentence justification

threshold effect on metrics

low threshold → catches more, but each flag is less reliable

aggregate metrics hide local failures

a vendor pitches a cancer screening test as 99% accurate:

  • 99% sensitivity: catches 99% of cancer cases
  • 99% specificity: correctly clears 99% of healthy patients

cancer prevalence: 0.5%

a patient tests positive: P(cancer | positive)?

A. 99% · B. 67% · C. 33% · D. 1%

work it out on 10,000 patients

false positives swamp true positives when the condition is rare

aggregate AUC is 0.75 · recall: age is the model’s strongest signal

predict: which age group does the model fail on?

  • under 40
  • 40–54
  • 55+

commit to a group. say why

subgroup AUC by age

for patients under 40, AUC = 0.36: below 0.5, but based on only ~6 CHD cases; the estimate is noisy

calibration: do probabilities mean what they say?

bin predictions into deciles

does “10% risk” mean 10% observed rate?

. . .

AUC measures ranking

calibration measures absolute probabilities

. . .

a model can have good AUC and bad calibration (or vice versa). which matters depends on how you use the output

back to the readmissions hook

the question that opened the lecture

a hospital’s model is “85% accurate”. deploy it?

now you know what to ask:

  • what’s the baseline? (accuracy trap)
  • what’s precision and recall at the chosen threshold?
  • ROC curve: is the chosen threshold a good point on the tradeoff?
  • does the model work across all patient groups?
  • are the probabilities calibrated?

key takeaways

  • accuracy is misleading with imbalanced classes: check the baseline first
  • the loss function determines what the model finds: logistic loss penalizes confident wrong answers
  • gradient descent on a convex loss finds the global minimum (unique when the data isn’t separable)
  • precision / recall reveal what accuracy hides: the right threshold is a cost-benefit call
  • AUC summarizes ranking across thresholds; calibration checks absolute probabilities
  • always check subgroup performance: aggregate metrics hide local failures

next time

  • Chapter 8: bootstrap: quantify uncertainty in any estimate (accuracy, AUC, odds ratio, …)
  • Chapter 12: formal inference on logistic coefficients (confidence intervals, p-values)
  • Chapter 13: decision trees: same classification problem, very different model

one-minute feedback

  1. what was the most useful thing you learned today?
  2. what was the most confusing?

give feedback