MSE 125 — Slides – Lecture 5: Multiple Regression and Feature Engineering

how much is a bathroom worth?

a host has a question

an Airbnb host is adding a second bathroom

how much more should they charge per night?

chapter 4’s model doesn’t know bathrooms exist

today

multiple regression
one-hot encoding
feature engineering
diagnostics

multiple regression with numeric features

adding bathrooms

chapter 4: $\widehat{\text{price}} = 57 + 66 \times \text{bedrooms}$

add bathrooms:

\[\widehat{\text{price}} = 27 + 58 \times \text{bedrooms} + 34 \times \text{bathrooms}\]

a 2BR / 1BA listing: $27 + 58(2) + 34(1) = \$177$

a 2BR / 2BA listing: $27 + 58(2) + 34(2) = \$211$

the feature matrix

each row of $X$ = one listing
each column = one feature (plus a column of ones)
$\widehat{y} = X\beta$ computes all predictions at once

the span grows with more features

\[\text{span}(X) = \{X\beta : \beta \in \mathbb{R}^p\}\]

the set of all possible predictions · the span of the columns of $X$

1 feature + intercept: 2D plane in $\mathbb{R}^n$
2 features + intercept: 3D subspace
more columns → bigger span → closer projection

training R² can only go up

adding a feature never decreases training $R^2$

new column → span grows (or stays)
closer projection → higher $R^2$
$R^2_{\text{train}}$ is monotonic in # features

danger: $R^2$ rewards you for adding noise

the normal equations

residual orthogonal to every feature column:

\[X^T \epsilon = 0 \quad \text{where } \epsilon = y - X\beta\]

rearrange:

\[X^T X \beta = X^T y\]

\[\widehat{\beta} = (X^T X)^{-1} X^T y\]

closed-form solution for the least-squares coefficients

normal equations: code

X = np.column_stack([np.ones(n), bedrooms, bathrooms])
beta = np.linalg.solve(X.T @ X, X.T @ y)
print(beta)

[26.66 58.42 34.38]

np.linalg.solve(A, b) — don’t invert; factor and solve
matches LinearRegression().fit(...) exactly

what does the coefficient mean?

\[\widehat{\text{price}} = 27 + \mathbf{58} \times \text{bedrooms}\] \[+ \; 34 \times \text{bathrooms}\]

coefficient = association holding bathrooms constant

simple vs multiple

model	bedrooms coef
bedrooms only	$66
+ bathrooms	$58

bathrooms coefficient: $34

the bedrooms coefficient dropped $8 — why?

why did the bedrooms coefficient drop from $66 to $58 when we added bathrooms?

if stuck: think about what a 3-bedroom apartment usually has

DISCUSSION: Think-pair-share (4 min) Prompt: “Why did the bedrooms coefficient drop when we added bathrooms?” Format: 30 sec think + 90 sec pair + 90 sec share If stuck: “A 3-bedroom apartment probably has how many bathrooms? A studio?” Key insight (draw out — this is the student’s phrase to arrive at, not yours): bedrooms and bathrooms travel together. Bigger apartments have more of both. In simple regression, the bedrooms coefficient absorbed part of what is really a bathrooms signal. Multiple regression disentangles the two by holding bathrooms constant. The word for this situation is confounding — reveal the word on the next slide after students have said “travel together” or “come as a package” out loud. Preview of Chapter 18 on causal inference. These are associations, not causal effects.

bedrooms and bathrooms travel together

simple regression on bedrooms alone absorbed part of the bathrooms signal

multiple regression disentangles them by holding each constant

these coefficients measure association, not causation

the word for this: confounding (→ Ch 18)

how do you put a string into a matrix?

the problem

df['room_type'].unique()
# ['Entire home/apt', 'Private room', 'Shared room']

you cannot multiply “Entire home/apt” by $\beta$

regression needs numbers

how would you convert room type to numbers for regression?

if stuck: what about entire home = 2, private room = 1, shared room = 0?

DISCUSSION: Think-pair-share (3 min) Prompt: “How would you convert room type to numbers?” If stuck: “What does a linear model do with your numbers? It multiplies them by a coefficient and adds them up — what does that assume about the spacing between categories?” Advance the “if stuck” fragment only after the think-pair phase — it is a wrong-answer anchor, not a solution. Integer encoding is the most tempting wrong answer; surfacing it gives stuck students something concrete to react to. Key insight (draw out from the share phase): integer encoding forces an ordering AND equal spacing — the model treats “entire” as literally 2× “private.” We did not tell the data that. A correct answer is one binary column per category. That is one-hot encoding, on the next slide. Call on two or three students before moving on — the wrong answer has to land out loud before the right answer means anything.

one-hot encoding

one binary (0/1) column per category

pd.get_dummies(df['room_type'])

entire home	private room	shared room
1	0	0
0	1	0
1	0	0
0	0	1

each row has exactly one 1 — the category it belongs to

the dummy variable trap

three dummy columns sum to the all-ones vector

but the intercept column is the all-ones vector

let $e$ = entire, $p$ = private, $s$ = shared
the columns are linearly dependent: $e + p + s = \mathbf{1}$
$X^TX$ is singular — no unique solution

reference level

Reference level

drop one category; its baseline is absorbed into the intercept. every other coefficient measures a difference from the reference.

pd.get_dummies(df['room_type'], drop_first=True)
# drops "Entire home/apt" (alphabetical first)

Q: if we dropped “Shared room” instead, which coefficients would change?

A: all of them — the baseline shifts

room type alone

Intercept (Entire home/apt baseline):  $186
  room_type_Private room:              −$108
  room_type_Shared room:               −$129

intercept ≈ mean price of entire homes
private room: $108 less than an entire home
shared room: $129 less than an entire home

the full model

price ~ bedrooms + bathrooms + room type

Intercept:                  $80.79
bedrooms:                  +$43.67
bathrooms:                 +$43.13
room_type_Private room:    −$95.77
room_type_Shared room:    −$119.77
R² = 0.432

how much is a bathroom worth?

about $43 more per night

holding bedrooms and room type constant

coefficients, visualized

negative bars = discounts from the entire-home baseline · intercept = $81 (the entire-home baseline)

Q: compute the predicted price of a private room, 2 bedrooms, 1 bathroom

$81 + 44(2) + 43(1) - 96 = \$116$

linear in parameters, not in features

the big idea

every model we’ve fit: $\widehat{y} = X\beta$

linear in $\beta$ — yes, always
linear in $x$ — not required

the columns of $X$ can be anything we compute from the raw data

interaction terms

is an extra bedroom associated with the same price change in Manhattan and the Bronx?

\[\widehat{y} = \beta_0 + \beta_1 \text{beds} + \beta_2 \mathbf{1}_{\text{Manh}} + \beta_3 (\text{beds} \times \mathbf{1}_{\text{Manh}})\]

where $\mathbf{1}_{\text{Manh}} = 1$ if Manhattan, $0$ otherwise

left: without $\beta_3$ — parallel lines · right: with $\beta_3$ — slopes differ by borough

borough-specific bedroom slopes

borough	$/bedroom
Manhattan	$81
Brooklyn	$63
Queens	$57
Staten Island	$55
Bronx	$16

a bedroom in Manhattan is associated with ~5× more than a bedroom in the Bronx

missing values as features

df['cleaning_fee_imputed']  = df['cleaning_fee'].fillna(0)
df['fee_missing']           = df['cleaning_fee'].isna().astype(int)

~19.6% of listings leave the cleaning fee blank
missing fee = host bundles cost, skipped field, or expects guest to tidy up
impute with zero — missing contribution reads directly as $\hat{\beta}_{\text{missing}}$
indicator column fee_missing flags imputed rows

the missing indicator earns its place

Q: does the indicator column add anything beyond the imputed fee?

baseline (no fee):             R² = 0.4322
+ imputed cleaning fee:        R² = 0.4845
+ imputed fee + fee_missing:   R² = 0.4982

cleaning_fee_imputed:  $0.80 per $1 of fee
fee_missing:          +$35.71

a blank fee carries its own price signal — missingness is information

a hint hiding in the interaction coefficients

dollars vary · percentages cluster

borough	$/bedroom	% per bedroom
Manhattan	$81	51%
Brooklyn	$63	64%
Queens	$57	71%
Staten Island	$55	80%
Bronx	$16	23%

four of five boroughs cluster 51–80% per bedroom · Bronx is the outlier

if one shared percentage replaces five dollar slopes, one coefficient does the work of five

a second clue: the fat right tail

median $100 · 90th percentile $250 · tail runs to $999

the top 5% of listings hold nearly half the level model’s squared error

a $50 miss on $100 (50% off) ≡ a $50 miss on $500 (10% off) — squared dollars can’t tell them apart

a bedroom is worth more in a luxury apartment

linear model: each bedroom adds a fixed dollar amount

adding a bedroom to a luxury loft is worth more dollars
adding a bedroom to a rundown studio is worth fewer dollars
few large Airbnbs exist — they command a disproportionate premium

prices grow multiplicatively, not additively

log transform straightens the curve

ax.set_yscale('log') — log y-axis, same data. level fit curves; log-level fit is straight.

multiplicative interpretation

model.fit(df[['bedrooms']], np.log(prices))
# β₁ = 0.33

\[e^{0.33} \approx 1.38\]

each bedroom multiplies price by ~1.38 — about 38% more per bedroom

same multiplier, different dollars

log model with borough: shared multiplier per bedroom is about 1.42 (about 42%)

apply that multiplier to each borough’s base:

borough	$/bedroom (0→1)
Manhattan	$39
Brooklyn	$25
Queens	$21
Staten Island	$19
Bronx	$18

same proportional jump · bigger base → bigger dollars

honest limit: the Bronx is an outlier we accept — the price of simplification

log on x: when $x$ spans orders of magnitude

\[y \sim \log(x) \quad\Longrightarrow\quad \text{a **1% increase** in } x \text{ adds about } \tfrac{\beta_1}{100} \text{ to } y\]

use it when $x$ ranges over orders of magnitude:

square footage (200 vs 2,000)
income ($30k vs $300k)
city population (10k vs 10M)

a unit change in income means something totally different at the bottom and top — a percentage change is the natural unit

elasticity: $\log(y) \sim \log(x)$

Elasticity

when both $y$ and $x$ are on the log scale, $\beta_1$ is the percentage change in $y$ from a 1% change in $x$.

\[\log(\text{price}) \sim \log(\text{accommodates}) \quad\Longrightarrow\quad \beta_1 \approx 0.66\]

a 10% larger listing is associated with about 6.6% more price

elasticities are dimensionless — compare effects across features on wildly different scales

The fourth row of the table, and the most important one for business students. Elasticity is what economists reach for when they want to compare the price sensitivity of demand across products, or wage response to education across countries, or any two features measured in different units. The key property: the elasticity is a pure number (percent per percent), so you can compare “bedroom premium” to “capacity premium” even though one is measured in integer bedrooms and the other in people. Concrete number: log(price) ~ log(accommodates) gives β₁ ≈ 0.66, so a 10% larger listing is associated with about 6.6% more price. Note that accommodates does span an order of magnitude in our data (1 to 16), so log on x is actually the right call here. Say “price elasticity of capacity” out loud once.

the four combinations

model	$\beta_1$ means	example
$y \sim x$	unit change in $x$ → $\beta_1$ change in $y$	each bed adds $66
$\log y \sim x$	unit change in $x$ → multiply $y$ by $e^{\beta_1}$	each bed × 1.38
$y \sim \log x$	1% change in $x$ → $\beta_1/100$ change in $y$	10% sqft → $5 more
$\log y \sim \log x$	1% change in $x$ → $\beta_1$% change in $y$	elasticity (e.g. 0.66)

when should the response be on a log scale?

for which of these would you use log(y)?

housing prices
test scores (0–100)
household incomes
body temperatures
company revenues

diagnostics: how you catch your own mistakes

adjusted R²

training $R^2$ never decreases with more features — even random noise

penalize for the number of features:

\[R^2_{\text{adj}} = 1 - \frac{(1 - R^2)(n - 1)}{n - p - 1}\]

where $n$ = sample size, $p$ = number of features

useless feature ⇒ $p$ grows, $R^2$ barely moves
adjusted $R^2$ drops

multicollinearity

beds and bedrooms are nearly the same feature · correlation ≈ 0.68

	without beds	with beds
bedrooms	+58.43	+28.48
bathrooms	+34.37	+25.52
beds	—	+30.98
$R^2$	0.2215	0.2755

bedrooms coefficient cut in half from $58 to $28 — beds absorbed the signal

symptoms of multicollinearity

coefficients become large and unstable
small changes in data → big swings in coefficients
opposite signs, blown-up magnitudes
$(X^TX)^{-1}$ is ill-conditioned: amplifies noise

predictions may still be fine — individual coefficients become uninterpretable

residual diagnostic vocabulary

residual = observed − predicted: $\epsilon_i = y_i - \hat{y}_i$

fan → missing variance-stabilizing transform (often log)
curve → missing polynomial or log feature
clusters → missing categorical feature

The normal equations guarantee X^T epsilon = 0 — no linear pattern remains between features and residuals. Nonlinear patterns can still lurk, and when they do they are diagnosing something specific. Top-left is the reference: a well-specified linear model gives residuals that scatter evenly around zero. The other three panels are the misspecification vocabulary. Fan (top-right, variance grows with the prediction): the functional form is missing a variance-stabilizing transformation — often a log applied to y. Curve (bottom-left, U or arch shape): the true relationship has curvature the linear model cannot express; add a polynomial or log(x) feature. Clusters (bottom-right, vertically separated clouds): a categorical variable has been omitted. The next slide applies these three labels to our own Airbnb residuals.

Airbnb residuals: level vs log

level model: fan from our vocabulary · spread grows with predicted price

log model: flat band · the fan is gone

what would you do next?

given the fan in the level-model residuals — fit the log model, or stay with level?

DISCUSSION: Think-pair-share (4 min — 60 sec pick a side, 90 sec share with a neighbor + what would you tell a host?, 60 sec debrief). Prompt: “Given the fan in the level-model residuals, fit the log model or stay with level?” If stuck: “Which model gives more honest uncertainty on a $500 prediction? Which gives more interpretable units to a host?” Key insight — draw out: Log model wins on residual structure and on the percentage interpretation that business stakeholders care about (a shared 42% premium per bedroom across all boroughs is one number, not five). Level model is still easier to explain to a non-technical audience and gives dollar coefficients directly. Either can be right — the residual plot is a diagnostic, not a verdict. The important thing is to look at the residuals before trusting the fit.

key takeaways

multiple regression: more columns in $X$, same projection math
normal equations: $X^TX\beta = X^Ty$
holding constant: coefficients are partial associations
one-hot encoding: categories as binary columns with a reference level
feature engineering: interactions, indicators, logs — all linear in $\beta$
diagnostics: residual plots catch what $R^2$ misses

what we still can’t answer

is the $43 bathroom coefficient real or noise? → Chapter 12 (inference)
does this model generalize to new data? → Chapter 6 (validation)
does a bathroom cause higher prices? → Chapter 18 (causation)

we can fit models. next: how to trust them.

logistics

quiz 2 — this Wednesday (Apr 15) in class
HW 2 — due Friday Apr 24
read Chapter 6 (validation) before Wednesday

one-minute feedback

what was the most useful thing you learned today?
what was the most confusing?

give feedback

model	\(\beta_1\) means	example
\(y \sim x\)	unit change in \(x\) → \(\beta_1\) change in \(y\)	each bed adds $66
\(\log y \sim x\)	unit change in \(x\) → multiply \(y\) by \(e^{\beta_1}\)	each bed × 1.38
\(y \sim \log x\)	1% change in \(x\) → \(\beta_1/100\) change in \(y\)	10% sqft → $5 more
\(\log y \sim \log x\)	1% change in \(x\) → \(\beta_1\)% change in \(y\)	elasticity (e.g. 0.66)

Lecture 5: Multiple Regression and Feature Engineering

a host has a question

today

adding bathrooms

the feature matrix

the span grows with more features

training R² can only go up

the normal equations

normal equations: code

what does the coefficient mean?

simple vs multiple

bedrooms and bathrooms travel together

the problem

one-hot encoding

the dummy variable trap

reference level

room type alone

the full model

how much is a bathroom worth?

coefficients, visualized

the big idea

interaction terms

borough-specific bedroom slopes

missing values as features

the missing indicator earns its place

a hint hiding in the interaction coefficients

a second clue: the fat right tail

a bedroom is worth more in a luxury apartment

log transform straightens the curve

multiplicative interpretation

same multiplier, different dollars

log on x: when \(x\) spans orders of magnitude

elasticity: \(\log(y) \sim \log(x)\)

the four combinations

adjusted R²

multicollinearity

symptoms of multicollinearity

residual diagnostic vocabulary

Airbnb residuals: level vs log

key takeaways

what we still can’t answer

logistics

one-minute feedback