Lecture 13: Decision Trees and Random Forests

MSE 125 — Applied Statistics

Madeleine Udell

Monday, May 11, 2026

logistics

  • HW 3 review sessions this week
  • Quiz 6 Wednesday May 13 — regression inference + trees
  • project midterm report due Friday May 15

the brief

Airbnb host, NYC

goal: develop a pricing tool to help hosts set competitive nightly rates

ch 5: a reasonable R^2 — but you picked the interactions, polynomial degree, and neighborhood encodings by hand

keep hand-engineering, or use a model that splits on its own?

today

  • decision trees — splits found automatically, no encoding needed
  • a single tree overfits — and CV picks the depth
  • random forests — average overfit trees, watch variance collapse
  • trees vs. linear — geometry of the data picks the model
  • feature importance — what the forest uses, what to doubt

bridge from logistic regression

logistic regression decision tree
boundary one hyperplane many axis-aligned cuts
feature engineering manual (Ch 5) automatic
categories, missing encode + impute native
output probability via \sigma(\beta^\top x) mean of leaf

both follow the same recipe: fit, score, CV, predict.

how a tree splits

decision tree, in pictures

each leaf reports a predicted price + sample count

tree vocabulary

decision tree

recursive if/then splits on one feature at a time

  • leaf — node with no children; carries a prediction
  • internal node — any non-leaf node; asks a yes/no question, routes to a child
  • root — the single internal node at the top
  • depth — longest root-to-leaf path
  • max_depth — sklearn knob that limits depth

look at the second split

on the more-expensive side, the tree splits on bedrooms ≤ 1.5.

why did the algorithm pick bedrooms here, instead of bathrooms or accommodates?

how a split is chosen

at each node, search every (j, s) — feature j, threshold s:

\min_{j,\, s} \left[ \sum_{x_i \in R_1(j,s)} (y_i - \bar{y}_1)^2 + \sum_{x_i \in R_2(j,s)} (y_i - \bar{y}_2)^2 \right]

(sum of squared errors, SSE)

  • each child predicts the mean \bar{y} of its training rows
  • algorithm enumerates every (feature, threshold) pair
  • pick the split that drops total SSE most
  • recurse inside each child

greedy partitioning, step by step

  • greedy — best split now, never reconsidered
  • axis-aligned — every cut parallel to an axis; a diagonal becomes a staircase

classification: same recipe, swap the loss

Gini impurity

\text{Gini} = 1 - \sum_{k=1}^K \hat p_k^2

binary case: 2\hat p (1 - \hat p)

  • pure node: Gini = 0
  • 50/50 binary node: Gini = 0.5
  • everything else — depth, CV, evaluation — is identical

deep enough to memorize, shallow enough to generalize

do deep trees overfit?

we’re about to fit a single tree at depths 1–20 on Airbnb, with min_samples_leaf=5.

sketch what train R^2 and test R^2 look like as depth grows.

at what depth does test R^2 peak?

a single tree overfits

  • shallow — underfits, high bias
  • deep — memorizes training rows, high variance
  • test R^2 peaks near depth 7, then collapses

shallow vs. deep, on a 2D slice

deep tree’s quilt shows memorization — adding a bedroom can lower the prediction

CV picks the depth

same recipe used for lasso \alpha, polynomial degree, k in kNN

from regression to classification: fill in the blanks

the regression recipe: split on SSE, predict the mean \bar y.

for classification, two things change:

  1. splitting criterion: SSE → ?
  2. leaf prediction: mean of y?

a host with a 20-bedroom Manhattan apartment wants a price.

both models were fit on Airbnb listings, prices in $10–$500/night.

linear regression (simplified, Manhattan, entire home): \widehat{\text{price}} = \$80 \;+\; \$30 \cdot \text{bedrooms}

depth-3 tree (Manhattan, entire home, bedrooms branch):

  • bedrooms \leq 1.5\$135
  • 1.5 < bedrooms \leq 2.5\$190
  • bedrooms > 2.5\$270

1. what does each model predict for 20 bedrooms?

2. which would you trust — and why might both be wrong?

average overfit trees

the random forest, briefly

random forest

ensemble of deep trees, each trained on a bootstrap sample of rows, considering only a random subset of features at each split. predictions are averaged (regression) or voted (classification).

  • bagging — each tree sees different rows
  • feature subsampling — each split considers different features
  • trees memorize different noise → averaging shrinks variance

bootstrap samples in pictures

  • each tree sees a different subset of rows
  • some rows appear multiple times, some not at all
  • skipped rows (\approx 37%) → out-of-bag (OOB) free validation

Galton, Vox Populi (1907)

  • 787 villagers at a country fair, 1906
  • each guesses the weight of an ox
  • guesses wildly varied — under and over by hundreds of pounds
  • median guess: 1207 lb
  • true weight: 1198 lb
  • median within 1% of truth

“the middlemost estimate expresses the vox populi … an excellent democratic judgment.”

Galton, “Vox Populi”, Nature 75, 450–451 (1907)

can the same idea work on overfit trees? — see next slide

bootstrapped trees vs. their average

  • thin blue: 25 deep trees, each fit on its own bootstrap sample
  • thick black: the average prediction
  • where individual trees disagree, disagreements partly cancel

variance shrinks with B

  • each 4\times more trees → half the spread
  • curves bend toward a floor

why feature subsampling matters

variance of the average of B trees, with per-tree variance \sigma^2 and pairwise correlation \rho:

\text{Var}(\bar f_B) = \rho \sigma^2 + \frac{1 - \rho}{B}\sigma^2

  • the (1-\rho)\sigma^2/B term shrinks like 1/B — what averaging buys
  • the \rho\sigma^2 floor doesn’t depend on B — only on how correlated the trees are
  • feature subsampling lowers \rho — beats the floor down

predict: more trees, more overfit?

we sweep the forest from B = 1 to B = 500 trees on Airbnb.

does test R^2 keep climbing forever, hit a ceiling, or U-curve?

raise hands.

more trees rarely hurt

  • no U-curve in n_estimators — staircase replaces the bias-variance tradeoff
  • depth is the bias-variance lever, B is variance-only

the surprise

test R^2 comparison

model test R^2
linear (bedrooms + bath + room + borough) 0.571
linear + bedrooms × borough interactions 0.576
decision tree (depth 4) 0.559
random forest (100 trees) 0.657

forest improves R^2 by 0.08 over hand-engineered linear — with no manual feature engineering

the forest improves R^2 by 0.08 over the linear model.

name three reasons you might still ship the linear model.

geometry of the data picks the model

trees vs. linear, two synthetic problems

what wins where

linear models

good at smooth, monotone effects

bad at corners and thresholds

a single slope, not a staircase

trees / forests

good at corners, thresholds, interactions

bad at smooth diagonals

a slope becomes a staircase

fit both. compare held-out scores. don’t guess the geometry.

trees handle missing data

missing values in trees

modern tree libraries (sklearn ≥1.3, XGBoost, LightGBM) learn at each split which child to send missing-valued rows to — picking the direction that reduces the loss more. no imputation needed.

linear regression decision tree
missing values break the model — must impute first handled natively

real tabular data is mostly missing somewhere. trees skip the imputation question entirely.

which model would you reach for first, and why?

insurance underwriter

needs one-line explanation for each premium decision

music app

wants the model to discover which combinations of behavior predict upgrades — no hand-coded rules

clinic chart coding

flag billing errors

40 fields, most patients missing 5–10

what the forest used vs. what raises the price

feature importance: MDI

mean decrease in impurity (MDI)

impurity drop at a split = parent’s impurity − weighted average of children’s impurity (impurity = SSE for regression, Gini for classification)

MDI for feature j = sum of impurity drops across every split on j, weighted by samples, averaged across trees, normalized so features’ MDIs sum to 1

answers: how much does the forest use this feature?

what the forest uses

high MDI ≠ “this feature causes the outcome” — only “the forest splits on it”

MDI vs. permutation importance

  • MDI — counts split chances, biased toward continuous, high-cardinality columns
  • permutation — measures held-out R^2 drop after shuffling a column, model-agnostic but slow

when to doubt the forest

three failure modes

  • distribution shift — predictions are bounded by the training y range; new inputs route into some leaf whose mean may be irrelevant
  • importance is descriptive, not causal — high MDI on accommodates doesn’t mean adding capacity raises price
  • correlated features share importancebedrooms and accommodates: interpret as a group

distribution shift: linear vs. forest extrapolation

  • linear extends its slope past the training range
  • forest flatlines — predicts the nearest leaf’s mean

correlated features hide importance

permutation drop in test R²

shuffle a feature’s column; measure how much R² drops

bedrooms and accommodates: correlation 0.64

shuffle R^2 drop
bedrooms alone 0.10
accommodates alone 0.13
both together 0.29

with near-duplicates, each alone \approx 0; the group \gg the sum

back to the host

Tip

Friday’s brief.

  • ship the forest — improves R^2 by 0.08 over hand-engineered linear, no hand-picked interactions
  • monitor distribution shift — drift on inputs; e.g., new neighborhoods, new price regimes
  • don’t present importance as causalwhat the model used, not what raises a price

summary

  • trees split automatically — find structure, no encoding, no imputation needed
  • a single deep tree memorizes — train R^2 \to 1, test R^2 collapses, CV picks the depth
  • random forests average overfit trees — bagging + feature subsampling, more trees rarely hurt
  • trees vs. linear is a geometry choice — corners + thresholds vs. smooth slopes
  • importance is descriptive — what the forest used, not what causes the outcome

demo: trees in the notebook

colab.research.google.com/…/lec13-trees.ipynb

what to watch:

  • max_depth sweep — see the U-curve happen
  • n_estimators sweep — see the staircase
  • swap regression for classification — same recipe, AUC (area under ROC curve, Ch 9) instead of R^2

next: dimensionality reduction (Ch 14)

what if your features are already redundant?

  • PCA — find directions of maximum variance
  • scree plot picks the number of components
  • standardization is essential — otherwise the largest-magnitude column dominates

feedback

what worked? what didn’t? what’s still confusing?