MSE 125 — Applied Statistics
Monday, May 11, 2026
goal: develop a pricing tool to help hosts set competitive nightly rates
ch 5: a reasonable R^2 — but you picked the interactions, polynomial degree, and neighborhood encodings by hand
keep hand-engineering, or use a model that splits on its own?

| logistic regression | decision tree | |
|---|---|---|
| boundary | one hyperplane | many axis-aligned cuts |
| feature engineering | manual (Ch 5) | automatic |
| categories, missing | encode + impute | native |
| output | probability via \sigma(\beta^\top x) | mean of leaf |
both follow the same recipe: fit, score, CV, predict.
how a tree splits
each leaf reports a predicted price + sample count
decision tree
recursive if/then splits on one feature at a time
max_depth — sklearn knob that limits depth
on the more-expensive side, the tree splits on bedrooms ≤ 1.5.
why did the algorithm pick bedrooms here, instead of bathrooms or accommodates?
at each node, search every (j, s) — feature j, threshold s:
\min_{j,\, s} \left[ \sum_{x_i \in R_1(j,s)} (y_i - \bar{y}_1)^2 + \sum_{x_i \in R_2(j,s)} (y_i - \bar{y}_2)^2 \right]
(sum of squared errors, SSE)
Gini impurity
\text{Gini} = 1 - \sum_{k=1}^K \hat p_k^2
binary case: 2\hat p (1 - \hat p)
deep enough to memorize, shallow enough to generalize
we’re about to fit a single tree at depths 1–20 on Airbnb, with min_samples_leaf=5.
sketch what train R^2 and test R^2 look like as depth grows.
at what depth does test R^2 peak?
deep tree’s quilt shows memorization — adding a bedroom can lower the prediction
same recipe used for lasso \alpha, polynomial degree, k in kNN
the regression recipe: split on SSE, predict the mean \bar y.
for classification, two things change:
a host with a 20-bedroom Manhattan apartment wants a price.
both models were fit on Airbnb listings, prices in $10–$500/night.
linear regression (simplified, Manhattan, entire home): \widehat{\text{price}} = \$80 \;+\; \$30 \cdot \text{bedrooms}
depth-3 tree (Manhattan, entire home, bedrooms branch):
1. what does each model predict for 20 bedrooms?
2. which would you trust — and why might both be wrong?
average overfit trees
random forest
ensemble of deep trees, each trained on a bootstrap sample of rows, considering only a random subset of features at each split. predictions are averaged (regression) or voted (classification).
“the middlemost estimate expresses the vox populi … an excellent democratic judgment.”
Galton, “Vox Populi”, Nature 75, 450–451 (1907)
can the same idea work on overfit trees? — see next slide
variance of the average of B trees, with per-tree variance \sigma^2 and pairwise correlation \rho:
\text{Var}(\bar f_B) = \rho \sigma^2 + \frac{1 - \rho}{B}\sigma^2
we sweep the forest from B = 1 to B = 500 trees on Airbnb.
does test R^2 keep climbing forever, hit a ceiling, or U-curve?
raise hands.
n_estimators — staircase replaces the bias-variance tradeofftest R^2 comparison
| model | test R^2 |
|---|---|
| linear (bedrooms + bath + room + borough) | 0.571 |
| linear + bedrooms × borough interactions | 0.576 |
| decision tree (depth 4) | 0.559 |
| random forest (100 trees) | 0.657 |
forest improves R^2 by 0.08 over hand-engineered linear — with no manual feature engineering
the forest improves R^2 by 0.08 over the linear model.
name three reasons you might still ship the linear model.
geometry of the data picks the model
linear models
good at smooth, monotone effects
bad at corners and thresholds
a single slope, not a staircase
trees / forests
good at corners, thresholds, interactions
bad at smooth diagonals
a slope becomes a staircase
fit both. compare held-out scores. don’t guess the geometry.
missing values in trees
modern tree libraries (sklearn ≥1.3, XGBoost, LightGBM) learn at each split which child to send missing-valued rows to — picking the direction that reduces the loss more. no imputation needed.
| linear regression | decision tree | |
|---|---|---|
| missing values | break the model — must impute first | handled natively |
real tabular data is mostly missing somewhere. trees skip the imputation question entirely.
which model would you reach for first, and why?
insurance underwriter
needs one-line explanation for each premium decision
music app
wants the model to discover which combinations of behavior predict upgrades — no hand-coded rules
clinic chart coding
flag billing errors
40 fields, most patients missing 5–10
what the forest used vs. what raises the price
mean decrease in impurity (MDI)
impurity drop at a split = parent’s impurity − weighted average of children’s impurity (impurity = SSE for regression, Gini for classification)
MDI for feature j = sum of impurity drops across every split on j, weighted by samples, averaged across trees, normalized so features’ MDIs sum to 1
answers: how much does the forest use this feature?
high MDI ≠ “this feature causes the outcome” — only “the forest splits on it”
when to doubt the forest
accommodates doesn’t mean adding capacity raises pricebedrooms and accommodates: interpret as a grouppermutation drop in test R²
shuffle a feature’s column; measure how much R² drops
bedrooms and accommodates: correlation 0.64
| shuffle | R^2 drop |
|---|---|
bedrooms alone |
0.10 |
accommodates alone |
0.13 |
| both together | 0.29 |
with near-duplicates, each alone \approx 0; the group \gg the sum
Tip
Friday’s brief.
what to watch:
max_depth sweep — see the U-curve happenn_estimators sweep — see the staircasewhat if your features are already redundant?
what worked? what didn’t? what’s still confusing?