Lecture 7: Classification — Logistic Regression and Metrics

MSE 125 — Applied Statistics

Madeleine Udell

Monday, April 20, 2026

logistics

  • quiz 3: Wed April 22 — Lec 6-7 (validation + classification)
  • HW 2: due Fri April 24 — regression, validation, classification
  • project proposal: due Fri May 1

a model is 85% accurate

does it catch the patients who need it?

the readmissions setup

the problem

  • hospital wants to flag 30-day readmissions
  • missed readmission costs up to $25,000 in care
  • CMS penalizes hospitals with excess readmission rates — up to 3% of total reimbursements

the data

  • 15% of patients get readmitted
  • 85% don’t
  • predict “no readmission” for everyone →
  • 85% accuracy

the dumbest possible model beats random, passes QA, and saves zero lives

today

  • the mechanics: logistic regression, gradient descent
  • the evaluation trap: why accuracy lies
  • threshold choice: no free lunch
  • what accuracy hides: base rates, subgroups, calibration

first we need a classifier

the outcome is binary

Framingham Heart Study: 4,240 patients, 10-year followup

  • started in 1948 in Framingham, Massachusetts
  • first study to identify cholesterol, blood pressure, and smoking as heart disease risk factors
  • still running — now on its third generation of participants

outcome: TenYearCHD \(\in \{0, 1\}\) — did the patient develop coronary heart disease within 10 years?

  • CHD = 1 — positive class
  • no CHD = 0 — negative class

“positive” = the outcome we’re trying to detect, not the desirable one

can we just run linear regression?

predictions below 0 and above 1 — not valid probabilities

the trick: squeeze the line through a sigmoid

\[p = \sigma(z) = \frac{1}{1 + e^{-z}}, \quad z = \beta_0 + \beta_1 x_1 + \cdots + \beta_d x_d\]

sports bettor’s intuition

sportsbook: “the Celtics are 4-to-1 underdogs tonight”

Q: what probability of winning does that imply?

odds \(= \dfrac{P(\text{win})}{P(\text{lose})}\)

4-to-1 against → lose 4 games per 1 win → \(P(\text{win}) = \tfrac{1}{1+4} = 0.20\)

probability in \([0, 1]\) \(\leftrightarrow\) odds in \([0, \infty)\) — same info, different scale

from probabilities to log-odds

probability \(p\) odds \(p/(1-p)\) log-odds
0.01 0.01 \(-4.6\)
0.20 0.25 \(-1.4\)
0.50 1.00 \(\phantom{-}0.0\)
0.80 4.00 \(+1.4\)
0.99 99.0 \(+4.6\)

log-odds range from \(-\infty\) to \(+\infty\) — perfect for a linear model

\[\log\frac{p}{1-p} = \beta_0 + \beta_1 x_1 + \cdots\]

check your intuition

Pr(Win) = 0.6 → odds = \(\frac{0.6}{0.4} = 1.5\)

I double your odds → new odds = 3.0

Q: new Pr(Win) = ?

\[\text{Pr(Win)} = \frac{3.0}{1 + 3.0} = 0.75\]

doubling odds \(\neq\) doubling probability — the scales are nonlinear

logistic regression on age alone

predicted risk climbs with age — we see the lower portion of the S-curve because the base rate is low

where does the loss come from?

MLE (maximum likelihood estimation): pick \(\beta\) to makes observed labels most probable

for one observation with label \(y \in \{0, 1\}\) and predicted probability \(p\):

\[P(y \mid x) = p^{\,y} \, (1-p)^{1-y}\]

likelihood of \(n\) observations. independent, so multiply:

\[L(\beta) = \prod_{i=1}^n p_i^{\,y_i} \, (1-p_i)^{1-y_i}\]

take \(-\log\): products → sums, maximize → minimize

\[-\log L(\beta) = -\sum_{i=1}^n \big[ y_i \log p_i + (1-y_i) \log(1-p_i) \big]\]

the logistic loss

\[\ell(\beta) = -\big[ y \log(p) + (1-y) \log(1-p) \big]\]

also called cross-entropy or negative log-likelihood

penalizes confident wrong predictions especially hard

  • \(y = 1\), \(p \to 0\): \(-\log(p) \to \infty\)
  • \(y = 0\), \(p \to 1\): \(-\log(1-p) \to \infty\)

Q: if \(y = 1\) and \(p = 0.9\), loss = \(-\log(0.9)\) ≈ ? what about \(p = 0.1\)?

  • \(-\log(0.9) = 0.11\) (confident, correct → small)
  • \(-\log(0.1) = 2.3\) (confident, wrong → large)

no closed-form minimum. we’ll need an iterative solver.

squared error vs. logistic loss

a 10% prediction for a true 1? squared error 0.81, logistic loss 2.3

the loss function determines what the model finds

gradient descent: hiking downhill

\[\beta \leftarrow \beta - \eta \cdot \nabla L(\beta)\]

  • \(\nabla L(\beta)\) = gradient (uphill direction); step the opposite way
  • \(\eta\) = learning rate (step size)

logistic loss is convex — one valley, no false minima

watch it converge

Q: will the loss curve drop smoothly, oscillate, or bounce around?

commit to a prediction — then we run 50 iterations

watch it converge

50 iterations on age + BMI (body mass index) logistic regression. loss drops fast, then settles.

accuracy lies under class imbalance

fit it, test it

# setup: Framingham data, 9 features, standardized, 70/30 split
model = LogisticRegression(penalty=None, max_iter=1000).fit(X_train_scaled, y_train)
test_acc = model.score(X_test_scaled, y_test)
print(f"Test accuracy: {test_acc:.2f}")
Test accuracy: 0.855

85.5% accurate (at default threshold 0.5) — sounds great, right?

but wait

the “always predict no CHD” baseline:

Q: what accuracy does it get?

baseline accuracy  = 0.848
our model accuracy = 0.855
improvement        = 0.007

our fancy classifier barely beats the null

accuracy measures how often you’re right overall — when one class dominates, that’s easy and uninformative

the confusion matrix

1,096 test patients · 15% CHD base rate · threshold 0.5

predicted no CHD predicted CHD
actually no CHD TN (true negative) = ? FP (false positive) = ?
actually CHD FN (false negative) = ? TP (true positive) = ?

predict: of 167 actual CHD cases, how many does the model catch?

  • fewer than 20
  • 20–50
  • 50–100
  • more than 100

the confusion matrix

of 167 actual CHD cases, we caught 17

precision vs. recall

precision \[\frac{\text{TP}}{\text{TP} + \text{FP}}\]

“I flagged this patient — should I trust the flag?”

recall (sensitivity) \[\frac{\text{TP}}{\text{TP} + \text{FN}}\]

“of all the sick people, how many did we find?”

F1 score = harmonic mean of precision and recall — one number, but hides the cost asymmetry

two metrics → two different stories

our model’s precision and recall

Precision: 0.65
  → Of patients flagged as CHD, 65% actually had it

Recall:    0.10
  → Of actual CHD patients (167 in test set), we caught 17

accuracy hid a disaster — the model catches almost nobody

which type of error would you rather the model make?

  • CHD screening:
    • FP → $500 follow-up
    • FN → $50K emergency care
  • spam filter:
    • FP → real email lost → missed meeting
    • FN → spam in inbox → mild annoyance

for each: flag more, or flag fewer?

when is a flag worth it?

flag if the expected savings beat the expected cost

flag whenever P(disease | flagged) > break-even precision

\[p^* = \frac{1}{k+1}\]

where \(k = C_{FN}/C_{FP}\) is the cost ratio of a missed case to a false alarm

CHD: \(k = 100 \Rightarrow p^* \approx 1\%\) — push hard for recall

spam: \(k \ll 1 \Rightarrow p^* \approx 99\%\) — protect precision

precision-recall across thresholds

each annotated point is one threshold — as recall climbs, precision falls

precision is P(disease | flagged) — read the y-axis as trustworthiness of a flag

stop where precision still beats \(p^*\) — beyond that, the marginal flag loses money

another view: ROC (receiver operating characteristic)

AUC (area under the curve) ≈ 0.75 — concordance: pick one CHD and one non-CHD patient; the model ranks the CHD patient higher 75% of the time (ties count as half)

PR vs ROC — when to use which

PR curve

axes: precision · recall

depends on base rate

operational view — most honest when the positive class is rare

ROC curve

axes: TPR (true positive rate = recall) · FPR (false positive rate)

invariant to base rate

threshold-free summary (AUC) · compare models across datasets

same tradeoff, different invariances

you’re reporting to the hospital board on your CHD model

do you show the PR curve or the ROC curve?

one-sentence justification

threshold effect on metrics

low threshold → catches more, but each flag is less reliable

aggregate metrics hide local failures

a vendor pitches a cancer screening test as 99% accurate:

  • 99% sensitivity — catches 99% of cancer cases
  • 99% specificity — correctly clears 99% of healthy patients

cancer prevalence: 0.5%

a patient tests positive — P(cancer | positive)?

A. 99% · B. 67% · C. 33% · D. 1%

work it out on 10,000 patients

false positives swamp true positives when the condition is rare

aggregate AUC is 0.75 · recall: age is the model’s strongest signal

predict: which age group does the model fail on?

  • under 40
  • 40–54
  • 55+

commit to a group — and say why

subgroup AUC by age

for patients under 40, AUC = 0.36 — below 0.5, but based on only ~6 CHD cases; the estimate is noisy

calibration — do probabilities mean what they say?

bin predictions into deciles

does “10% risk” mean 10% observed rate?

. . .

AUC measures ranking

calibration measures absolute probabilities

. . .

a model can have good AUC and bad calibration (or vice versa) — which matters depends on how you use the output

back to the readmissions hook

the question that opened the lecture

a hospital’s model is “85% accurate” — deploy it?

now you know what to ask:

  • what’s the baseline? (accuracy trap)
  • what’s precision and recall at the chosen threshold?
  • ROC curve — is the chosen threshold a good point on the tradeoff?
  • does the model work across all patient groups?
  • are the probabilities calibrated?

key takeaways

  • accuracy is misleading with imbalanced classes — check the baseline first
  • the loss function determines what the model finds: logistic loss penalizes confident wrong answers
  • gradient descent on a convex loss finds the global minimum (unique when the data isn’t separable)
  • precision / recall reveal what accuracy hides — the right threshold is a cost-benefit call
  • AUC summarizes ranking across thresholds; calibration checks absolute probabilities
  • always check subgroup performance — aggregate metrics hide local failures

next time

  • Chapter 8: bootstrap — quantify uncertainty in any estimate (accuracy, AUC, odds ratio, …)
  • Chapter 12: formal inference on logistic coefficients (confidence intervals, p-values)
  • Chapter 13: decision trees — same classification problem, very different model

one-minute feedback

  1. what was the most useful thing you learned today?
  2. what was the most confusing?

give feedback