MSE 125 — Slides – Lecture 13: Decision Trees and Random Forests

logistics

HW 3 review sessions this week
Quiz 6 Wednesday May 13: regression inference + trees
project midterm report due Friday May 15

the brief

Airbnb host, NYC

goal: develop a pricing tool to help hosts set competitive nightly rates

ch 5: a reasonable R^2, but you picked the interactions, polynomial degree, and neighborhood encodings by hand

keep hand-engineering, or use a model that splits on its own?

today

decision trees: splits found automatically, no encoding needed
a single tree overfits: and CV picks the depth
random forests: average overfit trees, watch variance collapse
trees vs. linear: geometry of the data picks the model
feature importance: what the forest uses, what to doubt

bridge from logistic regression

	logistic regression	decision tree
boundary	one hyperplane	many axis-aligned cuts
feature engineering	manual (Ch 5)	automatic
categories, missing	encode + impute	native
output	probability via \sigma(\beta^\top x)	mean of leaf

both follow the same recipe: fit, score, CV, predict.

how a tree splits

decision tree, in pictures

each leaf reports a predicted price + sample count

tree vocabulary

decision tree

recursive if/then splits on one feature at a time

leaf: node with no children; carries a prediction
internal node: any non-leaf node; asks a yes/no question, routes to a child
root: the single internal node at the top
depth: longest root-to-leaf path
max_depth: sklearn knob that limits depth

look at the second split

on the more-expensive side, the tree splits on bedrooms ≤ 1.5.

why did the algorithm pick bedrooms here, instead of bathrooms or accommodates?

how a split is chosen

at each node, search every (j, s): feature j, threshold s:

\min_{j,\, s} \left[ \sum_{x_i \in R_1(j,s)} (y_i - \bar{y}_1)^2 + \sum_{x_i \in R_2(j,s)} (y_i - \bar{y}_2)^2 \right]

(sum of squared errors, SSE)

each child predicts the mean \bar{y} of its training rows
algorithm enumerates every (feature, threshold) pair
pick the split that drops total SSE most
recurse inside each child

greedy partitioning, step by step

greedy: best split now, never reconsidered
axis-aligned: every cut parallel to an axis; a diagonal becomes a staircase

classification: same recipe, swap the loss

Gini impurity

\text{Gini} = 1 - \sum_{k=1}^K \hat p_k^2

binary case: 2\hat p (1 - \hat p)

pure node: Gini = 0
50/50 binary node: Gini = 0.5
everything else (depth, CV, evaluation) is identical

deep enough to memorize, shallow enough to generalize

do deep trees overfit?jkjkjkjjkjk

we’re about to fit a single tree at depths 1–20 on Airbnb, with min_samples_leaf=5.

sketch what train R^2 and test R^2 look like as depth grows.

at what depth does test R^2 peak?

a single tree overfits

shallow: underfits, high bias
deep: memorizes training rows, high variance
test R^2 peaks near depth 7, then collapses

shallow vs. deep, on a 2D slice

deep tree’s quilt shows memorization: adding a bedroom can lower the prediction

CV picks the depth

same recipe used for lasso \alpha, polynomial degree, k in kNN

from regression to classification: fill in the blanks

the regression recipe: split on SSE, predict the mean \bar y.

for classification, two things change:

splitting criterion: SSE → ?
leaf prediction: mean of y → ?

a host with a 20-bedroom Manhattan apartment wants a price.

both models were fit on Airbnb listings, prices in $10–$500/night.

linear regression (simplified, Manhattan, entire home): \widehat{\text{price}} = \$80 \;+\; \$30 \cdot \text{bedrooms}

depth-3 tree (Manhattan, entire home, bedrooms branch):

bedrooms \leq 1.5 → \$135
1.5 < bedrooms \leq 2.5 → \$190
bedrooms > 2.5 → \$270

1. what does each model predict for 20 bedrooms?

2. which would you trust, and why might both be wrong?

DISCUSSION: think-pair-share (5 min). 1 min think + 2 min pair + 2 min debrief.

Numerical answers (push for these first):

Linear: 80 + 30 \cdot 20 = \$680 — the slope keeps going.
Tree: the bedrooms > 2.5 leaf, $270 — the prediction flatlines because the tree was never trained on anything bigger.

Pedagogical points:

Linear extrapolates. It extends the fitted slope past the training range. If the true relationship really is linear out to 20 bedrooms, the extrapolation is roughly right; if it bends (luxury surcharges, capacity caps), it’s wrong in a smooth direction.
Trees flatline. A tree’s prediction is the mean training y inside one leaf. For inputs past the training range, the input routes into the boundary leaf — whose mean was calibrated on 3-, 4-, 5-bedroom listings. The tree is guaranteed to be wrong out there, by construction.
Both are wrong here. Training data was capped at $500; a real Manhattan 20-bedroom listing might be $5,000+. Neither model has any signal that far out.

Connects forward to the “when to doubt the forest” section: distribution shift is a forest’s weakness, not a linear-model strength.

If students get stuck on (2), prompt: “where does the training data end? what happens past that point in each model?”

average overfit trees

the random forest, briefly

random forest

ensemble of deep trees, each trained on a bootstrap sample of rows, considering only a random subset of features at each split. predictions are averaged (regression) or voted (classification).

bagging: each tree sees different rows
feature subsampling: each split considers different features
trees memorize different noise → averaging shrinks variance

bootstrap samples in pictures

each tree sees a different subset of rows
some rows appear multiple times, some not at all
skipped rows (\approx 37%) → out-of-bag (OOB) free validation

Galton, Vox Populi (1907)

787 villagers at a country fair, 1906
each guesses the weight of an ox
guesses wildly varied: under and over by hundreds of pounds
median guess: 1207 lb
true weight: 1198 lb
median within 1% of truth

“the middlemost estimate expresses the vox populi … an excellent democratic judgment.”

Galton, “Vox Populi”, Nature 75, 450–451 (1907)

can the same idea work on overfit trees?

bootstrapped trees vs. their average

thin blue: 25 deep trees, each fit on its own bootstrap sample
thick black: the average prediction
where individual trees disagree, disagreements partly cancel

variance shrinks with B

each 4\times more trees → half the spread
curves bend toward a floor

why feature subsampling matters

variance of the average of B trees, with per-tree variance \sigma^2 and pairwise correlation \rho:

\text{Var}(\bar f_B) = \rho \sigma^2 + \frac{1 - \rho}{B}\sigma^2

the (1-\rho)\sigma^2/B term shrinks like 1/B: what averaging buys
the \rho\sigma^2 floor doesn’t depend on B: only on how correlated the trees are
feature subsampling lowers \rho: beats the floor down

predict: more trees, more overfit?

we sweep the forest from B = 1 to B = 500 trees on Airbnb.

does test R^2 keep climbing forever, hit a ceiling, or U-curve?

raise hands.

more trees rarely hurt

no U-curve in n_estimators: staircase replaces the bias-variance tradeoff
depth is the bias-variance lever, B is variance-only

the surprise

test R^2 comparison

model	test R^2
linear (bedrooms + bath + room + borough)	0.571
linear + bedrooms × borough interactions	0.576
decision tree (depth 4)	0.559
random forest (100 trees)	0.657

forest improves R^2 by 0.08 over hand-engineered linear, with no manual feature engineering

the forest improves R^2 by 0.08 over the linear model.

name three reasons you might still ship the linear model.

DISCUSSION: think-pair-share (4 min). 1 min think + 2 min pair + 1 min debrief.

Tight-prompt variant of “are linear models useless?” — forces students to commit to a number (three) and defend a counterintuitive call (ship the worse-on-R² model).

Target answers — collect from class:

interpretable coefficients. “$1 extra bathroom is worth +$62/night, controlling for everything else.” A regulator or a host can read it.
odds ratios for logistic. Log-odds per feature change → multiplicative odds — the host knows what to push.
small datasets. Forests need lots of rows; LR works at n = 100.
stable extrapolation. Linear models extrapolate (cleanly or otherwise); a forest’s prediction is bounded by the training y range, so for inputs outside the training distribution it returns whichever leaf the splits route them to — often a poor proxy.
sparsity / lasso. Linear models can produce a small, selectable set of features; forests use everything.
causal-flavored interpretation under strong assumptions. Coefficient = marginal effect holding other things fixed (Ch 18).

Push for at least three. Wrap with: forests dominate prediction on tabular benchmarks; linear models dominate when you also need interpretability, small data, or extrapolation.

geometry of the data picks the model

trees vs. linear, two synthetic problems

what wins where

linear models

good at smooth, monotone effects

bad at corners and thresholds

a single slope, not a staircase

trees / forests

good at corners, thresholds, interactions

bad at smooth diagonals

a slope becomes a staircase

fit both. compare held-out scores. don’t guess the geometry.

trees handle missing data

missing values in trees

modern tree libraries (sklearn ≥1.3, XGBoost, LightGBM) learn at each split which child to send missing-valued rows to, picking the direction that reduces the loss more. no imputation needed.

	linear regression	decision tree
missing values	breaks the model; must impute first	handled natively

real tabular data is mostly missing somewhere. trees skip the imputation question entirely.

which model would you reach for first, and why?

insurance underwriter

needs one-line explanation for each premium decision

music app

wants the model to discover which combinations of behavior predict upgrades. no hand-coded rules

clinic chart coding

flag billing errors

40 fields, most patients missing 5–10

DISCUSSION: think-pair-share (5 min). 1 min think + 2 min pair + 2 min debrief.

Target answers:

Insurance underwriter → logistic regression. Coefficients give the per-feature explanation a regulator wants. Trees can be visualized but lose interpretability quickly past depth 3.
Music app → tree / forest. The hidden rule is a threshold-and-interaction pattern (genre count AND playlist count, only on some devices). A tree can discover that rule from raw signals; a linear model needs to be hand-fed the cross.
Clinic charts → tree / forest. 40 features with massive missingness — trees handle missing data natively (XGBoost, LightGBM, recent sklearn). Linear regression breaks without a number to multiply.

Key insight: the right model is a property of the data + decision. Interpretability requirements, missingness, sample size, and the geometry of the truth all matter.

If running short, cut to two scenarios.

what the forest used vs. what raises the price

feature importance: MDI

mean decrease in impurity (MDI)

impurity drop at a split = parent’s impurity − weighted average of children’s impurity (impurity = SSE for regression, Gini for classification)

MDI for feature j = sum of impurity drops across every split on j, weighted by samples, averaged across trees, normalized so features’ MDIs sum to 1

answers: how much does the forest use this feature?

what the forest uses

high MDI ≠ “this feature causes the outcome”. only “the forest splits on it”

MDI vs. permutation importance

MDI: counts split chances, biased toward continuous, high-cardinality columns
permutation: measures held-out R^2 drop after shuffling a column, model-agnostic but slow

when to doubt the forest

three failure modes

distribution shift: predictions are bounded by the training y range; new inputs route into some leaf whose mean may be irrelevant
importance is descriptive, not causal: high MDI on accommodates doesn’t mean adding capacity raises price
correlated features share importance: bedrooms and accommodates should be interpreted as a group

Three limitations that a strong test-set R² does not address:

Distribution shift. Each prediction is the mean outcome in a training leaf, so the forest cannot extrapolate beyond its training data. For inputs unlike anything seen at training — new neighborhood, post-pandemic price regime — there is no nearby leaf to draw from. Same lesson as Ch 6.
Importance ≠ causal. Both MDI and permutation answer “what does the model use?” — not “what makes the price go up?”. Do not act on importance as a causal claim.
Correlated features. When two features carry overlapping signal, the per-feature importance understates each one’s real importance — the forest leans on the other when one is shuffled.

Next two slides drill into (1) and (3) with concrete demonstrations.

About 1 min for the overview; drill-ins are short.

distribution shift: linear vs. forest extrapolation

linear extends its slope past the training range
forest flatlines: predicts the nearest leaf’s mean

correlated features hide importance

permutation drop in test R²

shuffle a feature’s column; measure how much R² drops

bedrooms and accommodates: correlation 0.64

shuffle	R^2 drop
`bedrooms` alone	0.10
`accommodates` alone	0.13
both together	0.29

with near-duplicates, each alone \approx 0; the group \gg the sum

back to the host

Tip

Friday’s brief.

ship the forest: improves R^2 by 0.08 over hand-engineered linear, no hand-picked interactions
monitor distribution shift: drift on inputs; e.g., new neighborhoods, new price regimes
don’t present importance as causal: what the model used, not what raises a price

summary

trees split automatically: find structure, no encoding, no imputation needed
a single deep tree memorizes: train R^2 \to 1, test R^2 collapses, CV picks the depth
random forests average overfit trees: bagging + feature subsampling, more trees rarely hurt
trees vs. linear is a geometry choice: corners + thresholds vs. smooth slopes
importance is descriptive: what the forest used, not what causes the outcome

demo: trees in the notebook

colab.research.google.com/…/lec13-trees.ipynb

what to watch:

max_depth sweep: see the U-curve happen
n_estimators sweep: see the staircase
swap regression for classification: same recipe, AUC (area under ROC curve, Ch 9) instead of R^2

next: dimensionality reduction (Ch 14)

what if your features are already redundant?

PCA: find directions of maximum variance
scree plot picks the number of components
standardization is essential. otherwise the largest-magnitude column dominates

feedback

forms.gle/feedback

what worked? what didn’t? what’s still confusing?