Feature Engineering and Regression Diagnostics

Code
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.colors as mcolors
import matplotlib.ticker as mticker
import seaborn as sns
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.metrics import r2_score
from sklearn.model_selection import train_test_split
import warnings
warnings.filterwarnings('ignore')
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (8, 5)
plt.rcParams['font.size'] = 12

# Load data
DATA_DIR = 'data'

In Chapter 4, we fit a regression predicting Airbnb prices from bedrooms and bathrooms — numeric features that regression handles directly. But the Airbnb dataset has boroughs, room types, listing descriptions. How do we use those?

Regression takes numbers. Real data has categories, text, and nonlinear relationships. Feature engineering bridges that gap — from hand-crafted inputs to AI-generated ones. The central idea: a model can be linear in its parameters while capturing wildly nonlinear relationships in its features. Every technique in this chapter exploits that fact.

The data: Airbnb NYC listings

Code
listings = pd.read_csv(f'{DATA_DIR}/airbnb/listings.csv', low_memory=False)

cols = ['id', 'name', 'description', 'price', 'bedrooms', 'bathrooms', 'beds', 'room_type',
        'neighbourhood_group_cleansed', 'accommodates', 'number_of_reviews']
df = listings[cols].dropna(subset=['price', 'bedrooms', 'bathrooms']).copy()
df = df.rename(columns={'neighbourhood_group_cleansed': 'borough'})

# Convert price to numeric
df['price'] = pd.to_numeric(df['price'].astype(str).str.replace('[$,]', '', regex=True), errors='coerce')

# Focus on reasonable prices (filter out near-free and luxury outliers)
df = df[df['price'].between(10, 500)].reset_index(drop=True)
prices = df['price']

print(f"Working with {len(df):,} listings")
df[['name', 'price', 'bedrooms', 'bathrooms', 'room_type', 'borough']].head()
Working with 28,778 listings
name price bedrooms bathrooms room_type borough
0 Cozy Stay in Queens, Easy Access to Manhattan 45 1 1.0 Private room Queens
1 Spacious room in comfortable apt. 95 1 1.0 Private room Manhattan
2 Fresh Clean & Modern: Williamsburg at its Best 60 1 1.5 Entire home/apt Brooklyn
3 Cozy Room In Perfect Location! 70 1 1.0 Private room Manhattan
4 Charming Central Park Studio 104 1 1.0 Entire home/apt Manhattan

Turning data into numbers

Log transforms change what coefficients mean

The coefficients from Chapter 4 have a clean interpretation: each bedroom adds about $33 to the predicted price. But in Chapter 2, we saw that Airbnb prices are heavily right-skewed, and a semilog scatter plot revealed a multiplicative relationship — each additional guest multiplied the price by a roughly fixed factor. When you log-transform the outcome or the features, the coefficient interpretation changes.

ImportantInterpreting coefficients with log transforms
Model Coefficient \(\beta_1\) means Example
\(y \sim x\) 1-unit increase in \(x\)\(\beta_1\) change in \(y\) each bedroom adds $50
\(\log(y) \sim x\) 1-unit increase in \(x\) → multiply \(y\) by \(e^{\beta_1}\) each bedroom multiplies price by 1.08 (8% increase)
\(y \sim \log(x)\) 1% increase in \(x\)\(\beta_1 / 100\) change in \(y\) 10% more sqft → $5 higher price
\(\log(y) \sim \log(x)\) 1% increase in \(x\)\(\beta_1\)% change in \(y\) (elasticity) 10% more sqft → 5% higher price

The first row is the model we fit in Chapter 4. The second row explains what happens when you regress \(\log(\text{price})\) on bedrooms: the coefficient lives on the log scale, so exponentiating converts it to a multiplicative factor. Let’s see this concretely.

Code
# Fit log(price) ~ bedrooms and interpret the coefficient
df['log_price'] = np.log(df['price'])

model_log = LinearRegression()
model_log.fit(df[['bedrooms']], df['log_price'])

beta1 = model_log.coef_[0]
multiplier = np.exp(beta1)
pct_change = (multiplier - 1) * 100

print(f"log(price) ~ bedrooms:  β₁ = {beta1:.4f}")
print(f"Each bedroom multiplies price by e^{beta1:.4f} = {multiplier:.3f}")
print(f"That's a {pct_change:.1f}% increase per bedroom.")
log(price) ~ bedrooms:  β₁ = 0.2834
Each bedroom multiplies price by e^0.2834 = 1.328
That's a 32.8% increase per bedroom.

The multiplicative structure explains why the semilog scatter plot in Chapter 2 showed a roughly straight line: the relationship between accommodates and price is approximately multiplicative, which is exactly what a \(\log(y) \sim x\) model captures.

One-hot encoding

Regression needs numbers. But two of our most informative features — borough and room_type — are strings:

Code
print("borough values:", df['borough'].unique())
print("room_type values:", df['room_type'].unique())
borough values: <StringArray>
['Queens', 'Manhattan', 'Brooklyn', 'Bronx', 'Staten Island']
Length: 5, dtype: str
room_type values: <StringArray>
['Private room', 'Entire home/apt', 'Shared room']
Length: 3, dtype: str

We can’t feed “Manhattan” or “Private room” into a regression equation. The solution is one-hot encoding: turn each category into a binary (0/1) column. A listing in Manhattan gets a 1 in the borough_Manhattan column and 0s elsewhere.

ImportantDefinition: Reference level

We use drop_first=True to avoid perfect collinearity. Why? If a listing isn’t in any of the other boroughs, it must be in the dropped category (the reference level). Including all dummies would make the columns sum to the intercept column — a linearly dependent column space (recall the column space from Chapter 4). The dropped category becomes the baseline that all other coefficients are measured against.

TipRegularization as an alternative to dropping a category

Dropping the first category is one way to resolve collinearity. Regularization is another. Even a mild L2 penalty (ridge regression) breaks the perfect linear dependence among dummy columns, making all coefficients well-defined and interpretable — no category needs to serve as a hidden baseline.

The intuition: collinearity is not inherently dangerous. It simply means the model has redundant parameters, so infinitely many coefficient vectors produce the same predictions. Regularization picks one of those vectors (the one with smallest norm), resolving the ambiguity without discarding any information. We develop regularization in Chapter 7.

Code
# One-hot encode categorical features
cat_features = pd.get_dummies(df[['room_type', 'borough']], drop_first=True)
num_features = df[['bedrooms', 'bathrooms']]
X_base = pd.concat([num_features, cat_features], axis=1)

# See what one-hot encoding produced
print("Encoded columns:")
print(cat_features.columns.tolist())
print(f"\nX shape: {X_base.shape}")
cat_features.head(3)
Encoded columns:
['room_type_Private room', 'room_type_Shared room', 'borough_Brooklyn', 'borough_Manhattan', 'borough_Queens', 'borough_Staten Island']

X shape: (28778, 8)
room_type_Private room room_type_Shared room borough_Brooklyn borough_Manhattan borough_Queens borough_Staten Island
0 True False False False True False
1 True False False True False False
2 False False True False False False

Each dummy column shifts predictions for that category. Let’s fit a regression and see.

Code
model_base = LinearRegression().fit(X_base, prices)
r2_base = model_base.score(X_base, prices)

# Build coefficient table
feature_names = list(X_base.columns)
coef_df = pd.DataFrame({
    'feature': feature_names,
    'coefficient': model_base.coef_
}).sort_values('coefficient', key=abs, ascending=False)
coef_df['coefficient'] = coef_df['coefficient'].round(2)
print(f"R² = {r2_base:.3f}\n")
print(coef_df.to_string(index=False))
print(f"\nIntercept (baseline = Bronx, Entire home/apt): {model_base.intercept_:.2f}")
R² = 0.519

               feature  coefficient
 room_type_Shared room      -108.11
room_type_Private room       -82.35
     borough_Manhattan        61.12
              bedrooms        36.77
             bathrooms        24.61
      borough_Brooklyn        18.14
 borough_Staten Island       -11.75
        borough_Queens         8.70

Intercept (baseline = Bronx, Entire home/apt): 62.23

How to read this: the coefficient on borough_Manhattan tells you how much more (or less) a Manhattan listing costs compared to the baseline category (Bronx), holding all else equal. Shared rooms are cheaper; Manhattan is pricier.

TipThink About It: Cancer staging encoded as a number

Cancer is staged I through IV, with higher stages indicating more advanced disease. Suppose you encode these as integers (Stage I = 1, Stage II = 2, Stage III = 3, Stage IV = 4) and fit a linear regression predicting years survived. If Stage I patients average 4 years and Stage II patients average 2 years, least squares gives \(\hat{y} = 6 - 2x\). Now predict Stage IV: \(6 - 2(4) = -2\) years. The model predicts negative survival time.

The problem: encoding stages as 1–4 assumes the outcome changes by the same amount per stage and that the spacing between stages is uniform. Neither is true. One-hot encoding (a separate dummy for each stage) lets each stage have its own effect, free from these assumptions.

(Source: Udell, Feature Engineering, Cornell ORIE 4741, 2022.)

Boolean threshold encoding for ordinals

One-hot encoding solves the cancer staging problem, but it discards the ordering: the model doesn’t know that Stage III is “between” Stage II and Stage IV. Boolean threshold encoding preserves the ordering without forcing a linear relationship. For an ordinal variable with levels \(1, 2, \ldots, K\), create \(K-1\) binary features:

\[\varphi(x) = \bigl[\mathbb{1}(x \ge 2),\; \mathbb{1}(x \ge 3),\; \ldots,\; \mathbb{1}(x \ge K)\bigr]\]

For cancer staging: \(\varphi(x) = [\mathbb{1}(x \ge \text{II}),\; \mathbb{1}(x \ge \text{III}),\; \mathbb{1}(x \ge \text{IV})]\). A Stage III patient gets the vector \([1, 1, 0]\). Each coefficient captures the incremental effect of crossing that threshold — the jump from Stage I to II can differ from the jump from II to III. The encoding uses the same number of features as one-hot (\(K-1\)), but the coefficients are cumulative rather than independent, which can improve predictions when the ordering is meaningful and training data is limited.

Code
# Bar chart of coefficients
fig, ax = plt.subplots(figsize=(10, 5))
coef_sorted = coef_df.sort_values('coefficient')
colors = ['#d62728' if c < 0 else '#2171b5' for c in coef_sorted['coefficient']]
ax.barh(coef_sorted['feature'], coef_sorted['coefficient'], color=colors, edgecolor='white')
ax.axvline(0, color='gray', linewidth=1)
ax.set_xlabel('Coefficient (effect on price, $)')
ax.set_title('Regression coefficients: bedrooms + bathrooms + room type + borough')
plt.tight_layout()
plt.show()

Polynomial features

What if the relationship isn’t linear in the original features? Maybe the jump from 1 to 2 bedrooms matters more than from 4 to 5. We can capture this by adding polynomial features: \(x\), \(x^2\), \(x^3\).

Here is the mantra of feature engineering: linear in parameters, not in features. We transform the inputs; the model stays linear in \(\beta\).

\[\hat{y} = \beta_0 + \beta_1 x + \beta_2 x^2\]

The equation above is nonlinear in \(x\) but linear in the parameters \((\beta_0, \beta_1, \beta_2)\). Least squares still applies — the optimization is identical to ordinary regression, just with a richer feature matrix.

Code
X_bedrooms = df[['bedrooms']]

results = []
for degree in [1, 2, 3, 4]:
    poly = PolynomialFeatures(degree=degree, include_bias=False)
    X_poly = poly.fit_transform(X_bedrooms)
    m = LinearRegression().fit(X_poly, prices)
    r2 = m.score(X_poly, prices)
    results.append({'degree': degree, 'n_features': X_poly.shape[1], 'R²': round(r2, 4)})
    print(f"Degree {degree}: R² = {r2:.4f}  ({X_poly.shape[1]} features)")
Degree 1: R² = 0.1555  (1 features)
Degree 2: R² = 0.1774  (2 features)
Degree 3: R² = 0.2257  (3 features)
Degree 4: R² = 0.2493  (4 features)

The R-squared improvements are modest. The more serious problem with polynomials is not diminishing returns — it is overfitting and wild extrapolation. A degree-4 polynomial, fit on a range where most listings have 1–2 bedrooms, can produce absurd predictions outside that range.

Code
# Extrapolate polynomial fits beyond the training range
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Left: polynomial fits extrapolated to 12 bedrooms
ax = axes[0]
ax.scatter(df['bedrooms'], prices, alpha=0.05, s=5, color='gray', label='data')
x_extrap = np.linspace(0, 12, 300).reshape(-1, 1)

for degree, color, ls in [(1, '#2171b5', '-'), (3, '#d62728', '--'), (4, '#2ca02c', ':')]:
    poly = PolynomialFeatures(degree=degree, include_bias=False)
    X_p = poly.fit_transform(X_bedrooms)
    m = LinearRegression().fit(X_p, prices)
    pred_extrap = m.predict(poly.transform(x_extrap))
    ax.plot(x_extrap, pred_extrap, color=color, linewidth=2.5, linestyle=ls,
            label=f'Degree {degree}')

ax.axhline(0, color='black', linewidth=0.8, linestyle='-')
ax.set_xlabel('Bedrooms')
ax.set_ylabel('Predicted price ($)')
ax.set_title('Polynomial extrapolation beyond the data')
ax.legend()
ax.set_ylim(-200, 500)

# Right: table of predictions at 9, 10, 11 bedrooms
ax = axes[1]
ax.axis('off')
bed_vals = [1, 3, 5, 7, 9, 10, 11]
table_data = []
for degree in [1, 3, 4]:
    poly = PolynomialFeatures(degree=degree, include_bias=False)
    X_p = poly.fit_transform(X_bedrooms)
    m = LinearRegression().fit(X_p, prices)
    preds = m.predict(poly.transform(np.array(bed_vals).reshape(-1, 1)))
    table_data.append([f'${p:.0f}' for p in preds])

table = ax.table(
    cellText=table_data,
    rowLabels=['Degree 1', 'Degree 3', 'Degree 4'],
    colLabels=[f'{b} BR' for b in bed_vals],
    loc='center', cellLoc='center'
)
table.auto_set_font_size(False)
table.set_fontsize(10)
table.scale(1.2, 1.8)
ax.set_title('Predicted price by bedroom count', fontsize=12, pad=20)

plt.tight_layout()
plt.show()

The degree-3 and degree-4 models predict negative prices for listings with 9 or more bedrooms. This result is not a quirk of one dataset. Polynomial extrapolation is governed by the highest-order term, which eventually dominates and sends predictions to \(\pm\infty\). The linear model, while less flexible, at least gives sensible predictions outside the training range.

This failure mode is not hypothetical. In May 2020, the Institute for Health Metrics and Evaluation (IHME) at the University of Washington published influential COVID-19 death forecasts built on a cubic polynomial fit. The Washington Post reported that the model projected deaths would fall to near zero by summer — a prediction driven by the polynomial’s downward curve, not by epidemiology. The forecast was wildly wrong: by summer, US deaths were still climbing. The cubic eventually turns downward regardless of the data, just as our degree-4 bedroom model turns negative. Polynomial features fit training data well but extrapolate according to their algebraic shape, not the real world.

The secondary issue is diminishing returns: polynomials on bedrooms alone barely improve R-squared. The real gains come from incorporating other features.

TipThink About It

A real estate startup asks you to predict prices for 5-bedroom listings using this model. Your training data is almost entirely 1–2 bedrooms. Would you ship the model for that use case? What would you do instead?

Interaction terms

Is an extra bedroom worth the same in Manhattan and the Bronx? So far our model assumes yes — there’s one bedrooms coefficient that applies everywhere. An interaction term lets the effect of one feature depend on another:

\[\hat{y} = \beta_0 + \beta_1 \cdot \text{bedrooms} + \beta_2 \cdot \mathbb{1}_{\text{Manhattan}} + \beta_3 \cdot (\text{bedrooms} \times \mathbb{1}_{\text{Manhattan}}) + \ldots\]

The interaction coefficient \(\beta_3\) captures how much more (or less) a bedroom is worth in Manhattan compared to the reference borough.

Code
# Create interaction terms: bedrooms × each borough dummy
borough_dummies = pd.get_dummies(df['borough'], drop_first=True)
interactions = borough_dummies.multiply(df['bedrooms'], axis=0)
interactions.columns = [f'bedrooms_x_{col}' for col in interactions.columns]

# Full model with interactions
X_interact = pd.concat([X_base, interactions], axis=1)
model_interact = LinearRegression().fit(X_interact, prices)
r2_interact = model_interact.score(X_interact, prices)

print(f"Without interactions: R² = {r2_base:.3f}  ({X_base.shape[1]} features)")
print(f"With interactions:    R² = {r2_interact:.3f}  ({X_interact.shape[1]} features)")
print(f"Improvement: {r2_interact - r2_base:.4f}")
Without interactions: R² = 0.519  (8 features)
With interactions:    R² = 0.525  (12 features)
Improvement: 0.0056

The interaction terms improve \(R^2\), confirming that the value of a bedroom depends on location. Let’s visualize the per-borough regression lines.

Code
# Visualize: different slopes per borough
fig, ax = plt.subplots(figsize=(10, 6))
colors = sns.color_palette('Set2', n_colors=df['borough'].nunique())

for color, (borough, group) in zip(colors, df.groupby('borough')):
    ax.scatter(group['bedrooms'], group['price'], alpha=0.03, s=5, color=color)
    # Fit within-borough regression line
    if len(group) > 50:
        mb = LinearRegression().fit(group[['bedrooms']], group['price'])
        x_range = np.array([[0], [group['bedrooms'].max()]])
        ax.plot(x_range, mb.predict(x_range), color=color, linewidth=2.5,
                label=f'{borough} (${mb.coef_[0]:.0f}/bedroom)')

ax.set_xlabel('Bedrooms')
ax.set_ylabel('Price ($)')
ax.set_title('Price per bedroom varies by borough')
ax.legend(loc='upper left', fontsize=9)
ax.set_xlim(-0.5, 6.5)
plt.tight_layout()
plt.show()

The slopes are quite different! An extra bedroom in Manhattan is worth much more than in the Bronx or Staten Island. Without interaction terms, our model uses a single average slope that may not represent any borough well.

Code
# Show R² progression
fig, ax = plt.subplots(figsize=(8, 4))
models = ['bedrooms only', '+ bath + room + borough', '+ interactions']
r2_list = [
    LinearRegression().fit(df[['bedrooms']], prices).score(df[['bedrooms']], prices),
    r2_base,
    r2_interact
]
bars = ax.bar(models, r2_list, color=['#aec7e8', '#6baed6', '#2171b5'], edgecolor='white')
for bar, r2_val in zip(bars, r2_list):
    ax.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.01,
            f'R²={r2_val:.3f}', ha='center', fontsize=11)
ax.set_ylabel('R²')
ax.set_title('Feature engineering progressively improves the model')
ax.set_ylim(0, 0.6)
plt.tight_layout()
plt.show()

Feature engineering is where domain knowledge meets statistics. Knowing that location affects the value of space is domain knowledge; encoding it as an interaction term is statistics. Notice the recurring pattern: dummies, polynomials, and interactions all produce new columns for the feature matrix, yet the model remains linear in parameters. Least squares handles every variant identically.

As we add features, R^2 can only increase — even when the new features are noise. Adjusted R^2 corrects for model complexity:

\[R^2_{\text{adj}} = 1 - \frac{s^2_{\text{residuals}} / (n-k-1)}{s^2_{\text{outcome}} / (n-1)}\]

where \(k\) is the number of predictors. A useless feature increases \(k\) without sufficiently reducing residual variance, so adjusted R^2 drops. Let’s compare our models:

Code
from sklearn.metrics import r2_score as _r2

def adjusted_r2(r2, n, k):
    return 1 - (1 - r2) * (n - 1) / (n - k - 1)

n = len(prices)
models_adj = {
    'bedrooms only': (LinearRegression().fit(df[['bedrooms']], prices).score(df[['bedrooms']], prices), 1),
    '+ bath + room + borough': (r2_base, X_base.shape[1]),
    '+ interactions': (r2_interact, X_interact.shape[1]),
}

for name, (r2, k) in models_adj.items():
    print(f"{name:30s}  R²={r2:.4f}  adj R²={adjusted_r2(r2, n, k):.4f}  (k={k})")
bedrooms only                   R²=0.1555  adj R²=0.1555  (k=1)
+ bath + room + borough         R²=0.5190  adj R²=0.5188  (k=8)
+ interactions                  R²=0.5245  adj R²=0.5243  (k=12)

Missing values as features

A missing value can carry signal. If a hospital’s readmission count is suppressed (“Too Few to Report”), the suppression itself tells you the hospital is small — and hospital size correlates with outcomes. Dropping that row throws away useful information.

For categorical variables, the simplest approach: treat “missing” as its own category.

Code
# Example: create a 'bedroom_known' indicator from Airbnb data
import pandas as pd
bedrooms = pd.Series([1, 2, None, 3, None, 1])
bedrooms_filled = bedrooms.fillna('Missing')
print(bedrooms_filled)
0        1.0
1        2.0
2    Missing
3        3.0
4    Missing
5        1.0
dtype: object

For numeric variables, create a binary missingness indicator alongside an imputed value:

Code
# Indicator: was the value missing?
bedrooms_missing = bedrooms.isna().astype(int)
# Impute with median for the numeric column
bedrooms_imputed = bedrooms.fillna(bedrooms.median())

result = pd.DataFrame({
    'bedrooms_imputed': bedrooms_imputed,
    'bedrooms_missing': bedrooms_missing
})
print(result)
   bedrooms_imputed  bedrooms_missing
0               1.0                 0
1               2.0                 0
2               1.5                 1
3               3.0                 0
4               1.5                 1
5               1.0                 0

The imputed value lets the model use the column normally. The indicator lets the model learn whether missingness itself predicts the outcome — for example, listings without a bedroom count might systematically differ in price.

When does a missingness indicator help?

Not all missing data carries signal. Whether a missingness indicator improves your model depends on why the data is missing — the MCAR/MAR/MNAR distinction from Chapter 3.

Informative missingness (MAR or MNAR): In ambulance data, a missing heart rate often means the EMTs were too busy saving the patient’s life to record vitals. The patients with missing heart rates have systematically worse outcomes — the missingness is the signal. Creating a binary feature heart_rate_missing gives the model access to that information, and it turns out to be one of the strongest predictors of mortality.

Uninformative missingness (MCAR): A weather station’s battery dies at random, causing missing temperature readings for a few hours. The battery failure has nothing to do with the temperature outside — a hot day is no more likely to cause a dead battery than a cold one. Creating a temp_missing indicator adds a column of noise: it’s uncorrelated with the outcome and wastes a degree of freedom.

The rule of thumb: when missingness is informative (MAR or MNAR), a binary indicator for “is this value missing?” can improve your model. When missingness is MCAR — purely random — the indicator adds noise, not signal. In practice, the test set tells you: if adding the indicator improves test-set performance, the missingness was informative.

WarningDon’t impute blindly

Imputation replaces missing values with plausible substitutes (median, mean, or a model-based prediction). Without a missingness indicator, the model can’t distinguish real values from imputed ones. Always pair imputation with an indicator column when missingness might be informative (MNAR or MAR — see Chapter 3).

LLM-based featurization

What about unstructured data like text? Our Airbnb listings have names and descriptions that might contain useful information — mentions of luxury, cleanliness, neighborhood character — but a regression can’t read English.

Let’s look at a few examples:

Code
# Show 3 raw listings with names and descriptions
sample = df[df['description'].notna()].sample(3, random_state=42)
for _, row in sample.iterrows():
    print(f"Name: {row['name']}")
    desc = str(row['description'])[:200]  # Truncate for display
    print(f"Description: {desc}...")
    print(f"Price: ${row['price']:.0f}")
    print()
Name: Spacious & Convenient Midtown Apt
Description: This homey apt is conveniently located in midtown east of Manhattan with subway and train access within blocks. The spacious studio has our beloved artwork, hand-picked furniture and fond memories Ton...
Price: $120

Name: Luxury 2BR Midtown East-Near the UN
Description: Located just a few minutes from the UN, Carnegie Hall and Grand Central, this is a superb Eastside address in the heart of lively midtown Manhattan. It's also near many of the most exclusive shops, re...
Price: $299

Name: Cozy room in artsy Bushwick home
Description: My place is close to The Bushwick Collective, Paper Box Music & Art Inc, Arrogant Swine, House of YES, numerous restaurants and cool Bushwick bars . My place is good for solo adventurers and business ...
Price: $35

An LLM (large language model) can read each description and extract structured features — numbers and categories that a regression can use. The approach is a form of weak supervision: instead of hand-labeling thousands of listings, you write a prompt once and let the model do the labeling. The resulting features feed into a linear model just like bedrooms or borough dummies — still linear in parameters, even though the feature extraction itself is wildly nonlinear.

Choosing what to extract

Not every feature an LLM can produce will help a regression. Good LLM features are:

  • Predictive of the outcome. A “mentions luxury” flag likely correlates with price; a “uses exclamation marks” flag probably does not.
  • Not redundant with existing features. The LLM could extract “number of bedrooms” from the text, but we already have that column. Focus on information that structured fields miss.
  • Well-defined and unambiguous. Asking for a “quality score from 0-10” invites inconsistency; asking “does the listing mention a doorman? (true/false)” gives a concrete extraction task.

For our Airbnb data, we extracted five features from each listing description:

Feature Type Rationale
mentions_luxury bool Luxury keywords signal higher-end listings
amenity_count int More amenities mentioned suggests a more complete listing
sentiment_score float (0–1) Enthusiastic descriptions may correlate with quality
neighborhood_vibe category Captures location character beyond borough name
cleanliness_mentioned bool Cleanliness emphasis may signal quality or concern

The prompt

The extraction prompt must be specific, include the expected output format, and define each field precisely. Here is the prompt that generated our features (the full extraction script is in the course code repository):

Code
# The extraction prompt used to generate LLM features:
example_prompt = """Extract structured features from this Airbnb listing description.

LISTING DESCRIPTION:
{description}

Return a JSON object with exactly these fields:
- mentions_luxury (bool): Does the description use words like "luxury", "upscale",
  "designer", "high-end", "premium", or similar?
- amenity_count (int): How many distinct amenities are explicitly mentioned?
  Count items like WiFi, AC, washer, gym, pool, doorman, etc.
- sentiment_score (float, 0 to 1): Overall tone of the description.
  0 = negative/apologetic, 0.5 = neutral/factual, 1 = enthusiastic/glowing.
- neighborhood_vibe (string, one of: "artsy", "trendy", "bustling_urban", "quiet_residential",
  "family_friendly", "upscale", "unknown"):
  The dominant character of the neighborhood as described.
- cleanliness_mentioned (bool): Does the description explicitly mention cleanliness,
  sanitization, or hygiene?

Return ONLY the JSON object, no other text."""

print(example_prompt.format(description="<listing text goes here>"))
Extract structured features from this Airbnb listing description.

LISTING DESCRIPTION:
<listing text goes here>

Return a JSON object with exactly these fields:
- mentions_luxury (bool): Does the description use words like "luxury", "upscale",
  "designer", "high-end", "premium", or similar?
- amenity_count (int): How many distinct amenities are explicitly mentioned?
  Count items like WiFi, AC, washer, gym, pool, doorman, etc.
- sentiment_score (float, 0 to 1): Overall tone of the description.
  0 = negative/apologetic, 0.5 = neutral/factual, 1 = enthusiastic/glowing.
- neighborhood_vibe (string, one of: "artsy", "trendy", "bustling_urban", "quiet_residential",
  "family_friendly", "upscale", "unknown"):
  The dominant character of the neighborhood as described.
- cleanliness_mentioned (bool): Does the description explicitly mention cleanliness,
  sanitization, or hygiene?

Return ONLY the JSON object, no other text.

In practice, you’d call the LLM API on each listing description and parse the JSON response. The code below shows the pattern (the features for our dataset are pre-computed).

Code
# How the API call works (not run here — features are pre-computed):
#
# from anthropic import Anthropic
# client = Anthropic()
#
# response = client.messages.create(
#     model="claude-haiku-4-5-20251001",  # Fast + cheap: ~$0.01 per 1000 listings
#     max_tokens=200,
#     messages=[{"role": "user", "content": prompt_with_description}],
# )
# features = json.loads(response.content[0].text)
#
# Practical tips for projects:
# - Use Claude Haiku for bulk extraction (fast, cheap, sufficient for structured tasks)
# - Use Claude Sonnet when extraction requires complex reasoning
# - Process listings in batches; save results incrementally
# - Always include "Return ONLY the JSON" to avoid preamble in the response

# Load pre-computed LLM features
llm_features = pd.read_csv(f'{DATA_DIR}/airbnb/llm_features.csv')
print(f"LLM features shape: {llm_features.shape}")
print(f"Columns: {llm_features.columns.tolist()}")
llm_features.head()
LLM features shape: (29142, 6)
Columns: ['id', 'mentions_luxury', 'amenity_count', 'sentiment_score', 'neighborhood_vibe', 'cleanliness_mentioned']
id mentions_luxury amenity_count sentiment_score neighborhood_vibe cleanliness_mentioned
0 20091785 False 4 0.75 unknown True
1 3710661 False 4 0.84 bustling_urban False
2 15055244 True 4 0.79 trendy True
3 19640913 False 3 0.80 quiet_residential True
4 11888948 True 1 0.98 upscale True

Why some features help and some don’t

Not every extracted feature will improve the model. sentiment_score sounds promising — surely enthusiastic descriptions command higher prices? — but sentiment may reflect the host’s writing style more than the listing’s quality. Meanwhile, mentions_luxury directly signals a price tier that structured fields (bedrooms, borough) cannot capture. The test set tells us which features earn their place.

Code
# Join LLM features to the main dataframe on listing id
df_llm = df.merge(llm_features, on='id', how='inner')
print(f"Listings with LLM features: {len(df_llm):,} (of {len(df):,} total)")

# Show the distribution of LLM-extracted features
# Note: .describe() shows only numeric columns by default; boolean and string columns are omitted
llm_cols = [c for c in llm_features.columns if c != 'id']
print(f"\nLLM feature summary (numeric columns only):")
print(df_llm[llm_cols].describe().round(2))
Listings with LLM features: 28,778 (of 28,778 total)

LLM feature summary (numeric columns only):
       amenity_count  sentiment_score
count       28778.00         28778.00
mean            4.44             0.83
std             2.78             0.13
min             1.00             0.17
25%             2.00             0.77
50%             4.00             0.86
75%             6.00             0.93
max            23.00             0.99

Now let’s build a regression with and without the LLM features, using train_test_split to measure honest test performance.

Code
# Build feature matrix with LLM features included
y_llm = df_llm['price']

# Numeric + categorical features (same as before, on the joined subset)
cat_llm = pd.get_dummies(df_llm[['room_type', 'borough']], drop_first=True)
num_llm = df_llm[['bedrooms', 'bathrooms']]

# Without LLM features
X_no_llm = pd.concat([num_llm, cat_llm], axis=1)

# With LLM features
llm_numeric = df_llm[llm_cols].apply(pd.to_numeric, errors='coerce').fillna(0)
X_with_llm = pd.concat([num_llm, cat_llm, llm_numeric], axis=1)

# Train/test split on the LLM subset
X_tr_no, X_te_no, y_tr_llm, y_te_llm = train_test_split(
    X_no_llm, y_llm, test_size=0.3, random_state=42
)
X_tr_with = X_with_llm.loc[X_tr_no.index]
X_te_with = X_with_llm.loc[X_te_no.index]

# Fit models with and without LLM features
m_no_llm = LinearRegression().fit(X_tr_no, y_tr_llm)
m_with_llm = LinearRegression().fit(X_tr_with, y_tr_llm)

r2_no = r2_score(y_te_llm, m_no_llm.predict(X_te_no))
r2_with = r2_score(y_te_llm, m_with_llm.predict(X_te_with))

print(f"Without LLM features: Test R² = {r2_no:.3f}  ({X_no_llm.shape[1]} features)")
print(f"With LLM features:    Test R² = {r2_with:.3f}  ({X_with_llm.shape[1]} features)")
print(f"Improvement:           {r2_with - r2_no:+.3f}")
Without LLM features: Test R² = 0.524  (8 features)
With LLM features:    Test R² = 0.541  (13 features)
Improvement:           +0.018

The LLM features provide a modest improvement. Let’s see which extracted features carry the most weight by examining the regression coefficients.

Code
# Show which LLM features matter most
all_features_llm = list(X_with_llm.columns)
coef_llm_df = pd.DataFrame({
    'feature': all_features_llm,
    'coefficient': m_with_llm.coef_
})
# Highlight LLM features
coef_llm_df['source'] = ['LLM' if f in llm_cols else 'original' for f in all_features_llm]
coef_llm_df = coef_llm_df.sort_values('coefficient', key=abs, ascending=False)

fig, ax = plt.subplots(figsize=(10, 6))
colors = ['#d62728' if s == 'LLM' else '#2171b5' for s in coef_llm_df['source']]
ax.barh(range(len(coef_llm_df)), coef_llm_df['coefficient'], color=colors, edgecolor='white')
ax.set_yticks(range(len(coef_llm_df)))
ax.set_yticklabels(coef_llm_df['feature'], fontsize=9)
ax.axvline(0, color='gray', linewidth=1)
ax.set_xlabel('Coefficient (effect on price, $)')
ax.set_title('Regression coefficients: original features (blue) + LLM features (red)')
plt.tight_layout()
plt.show()

WarningLLM features can be wrong

LLM featurization replaces an entire natural language processing pipeline with a single prompt. The resulting features are imperfect — LLMs can hallucinate or misinterpret text — but cross-validation tells you whether they help. Always validate LLM-extracted features on held-out data before trusting them in a model.

When you’ve gone too far

R² gets better as we add features. But does the model get better? Adding redundant or noisy features can hurt predictions on new data. This section develops the diagnostic toolkit.

Multicollinearity

What if two of your features are nearly the same? Let’s compare bedrooms and beds. We create a subset df_mc with the columns we need for this analysis, dropping rows where beds is missing.

Code
# Subset for multicollinearity analysis
df['price_clean'] = df['price']  # alias for consistency with ch4 code
df_mc = df[['bedrooms', 'bathrooms', 'beds', 'price_clean',
            'accommodates', 'number_of_reviews']].dropna()
print(f"Working with {len(df_mc):,} listings for multicollinearity analysis")
df_mc.head(10)
Working with 28,761 listings for multicollinearity analysis
bedrooms bathrooms beds price_clean accommodates number_of_reviews
0 1 1.0 1.0 45 1 3
1 1 1.0 1.0 95 2 101
2 1 1.5 1.0 60 2 19
3 1 1.0 1.0 70 2 8
4 1 1.0 1.0 104 2 4
5 1 1.0 1.0 90 2 10
6 2 1.0 5.0 374 5 16
7 2 1.0 3.0 100 5 4
8 1 3.0 2.0 60 3 37
9 0 1.0 1.0 175 2 1

Each column of the data frame is now a vector in \(\mathbb{R}^n\), where \(n\) is the number of listings. Let’s inspect the first few entries of each feature column.

Code
# Look at the feature columns as vectors
print(f"Each column is a vector in R^{len(df_mc)} (one entry per listing):")
print()
for col in ['bedrooms', 'bathrooms', 'beds']:
    print(f"  {col}: first 8 values = {df_mc[col].values[:8]}")
Each column is a vector in R^28761 (one entry per listing):

  bedrooms: first 8 values = [1 1 1 1 1 1 2 2]
  bathrooms: first 8 values = [1.  1.  1.5 1.  1.  1.  1.  1. ]
  beds: first 8 values = [1. 1. 1. 1. 1. 1. 5. 3.]

Bedrooms and beds look similar — most listings have roughly one bed per bedroom. The 2D histogram on the left confirms this: most listings cluster along the diagonal. The correlation heatmap on the right quantifies how tightly each pair of features tracks.

Code
fig, axes = plt.subplots(1, 2, figsize=(11, 4.8),
                         gridspec_kw={'width_ratios': [1.15, 1]})

# Left: 2D histogram — log color scale so the dense diagonal pops
h = axes[0].hist2d(
    df_mc['bedrooms'], df_mc['beds'],
    bins=[np.arange(-0.5, 7), np.arange(-0.5, 10)],
    cmap='BuPu', cmin=1,
    norm=mcolors.LogNorm(vmin=1, vmax=20000),
    edgecolors='white', linewidths=0.5)
cb = plt.colorbar(h[3], ax=axes[0], shrink=0.82, pad=0.02)
cb.set_label('Number of listings', fontsize=10)
axes[0].plot([0, 6], [0, 6], color='#d62728', ls='--', lw=2.2,
             label='beds = bedrooms', zorder=5)
axes[0].set_xlabel('Bedrooms')
axes[0].set_ylabel('Beds')
axes[0].set_title('Bedrooms vs. beds', fontweight='bold')
axes[0].legend(fontsize=9.5, loc='upper left', framealpha=0.95,
               edgecolor='#999999', fancybox=False)
axes[0].set_xlim(-0.5, 6.5)
axes[0].set_ylim(-0.5, 9.5)
axes[0].set_xticks(range(0, 7))
axes[0].set_yticks(range(0, 10))
axes[0].annotate('most listings cluster\nnear the diagonal',
                 xy=(1, 1), xytext=(3.5, 7),
                 fontsize=9, fontstyle='italic', color='#333333',
                 arrowprops=dict(arrowstyle='->', color='#555555', lw=1.2),
                 ha='center', bbox=dict(boxstyle='round,pad=0.3',
                 facecolor='white', edgecolor='#cccccc', alpha=0.92))

# Right: correlation heatmap with key correlation highlighted
corr = df_mc[['bedrooms', 'bathrooms', 'beds']].corr()
sns.heatmap(corr, annot=True, fmt='.2f', cmap='RdBu_r',
            vmin=-1, vmax=1, ax=axes[1], square=True,
            linewidths=2, linecolor='white',
            annot_kws={'fontsize': 15, 'fontweight': 'bold'},
            cbar_kws={'shrink': 0.82, 'pad': 0.02, 'label': 'Correlation'})
labels = list(corr.columns)
bed_i, br_i = labels.index('beds'), labels.index('bedrooms')
for r, c in [(br_i, bed_i), (bed_i, br_i)]:
    axes[1].add_patch(plt.Rectangle((c, r), 1, 1, fill=False,
                                    edgecolor='#d62728', lw=3, zorder=10))
axes[1].set_title('Feature correlations', fontweight='bold')
axes[1].set_xticklabels(axes[1].get_xticklabels(), rotation=0)
axes[1].set_yticklabels(axes[1].get_yticklabels(), rotation=0)

plt.tight_layout(w_pad=3)
plt.show()

print(f"\nCorrelation between bedrooms and beds: {df_mc['bedrooms'].corr(df_mc['beds']):.3f}")
print(f"Correlation between bedrooms and bathrooms: {df_mc['bedrooms'].corr(df_mc['bathrooms']):.3f}")


Correlation between bedrooms and beds: 0.652
Correlation between bedrooms and bathrooms: 0.311

Bedrooms and beds are positively correlated (r ≈ 0.65). In linear algebra terms, the beds vector is nearly a scalar multiple of the bedrooms vector:

\[\text{beds} \approx \text{bedrooms}\]

The span barely grows when you add beds as a feature. Two vectors pointing in nearly the same direction span almost a line, not a plane.

The phenomenon is called multicollinearity: redundant features that don’t expand the column space. High correlation between feature columns corresponds to near-collinearity in the column vectors — hence the name.

Code
# How parallel are these feature vectors?
# We center first so the angle corresponds to correlation.
bedrooms_centered = df_mc['bedrooms'].values - df_mc['bedrooms'].mean()
beds_centered = df_mc['beds'].values - df_mc['beds'].mean()
bathrooms_centered = df_mc['bathrooms'].values - df_mc['bathrooms'].mean()

cos_beds = np.dot(bedrooms_centered, beds_centered) / (
    np.linalg.norm(bedrooms_centered) * np.linalg.norm(beds_centered))
cos_bath = np.dot(bedrooms_centered, bathrooms_centered) / (
    np.linalg.norm(bedrooms_centered) * np.linalg.norm(bathrooms_centered))

angle_beds = np.degrees(np.arccos(np.clip(cos_beds, -1, 1)))
angle_bath = np.degrees(np.arccos(np.clip(cos_bath, -1, 1)))

print(f"Angle between bedrooms and beds:      {angle_beds:.1f}° (moderate angle)")
print(f"Angle between bedrooms and bathrooms:  {angle_bath:.1f}° (more independent)")
print()
print("When vectors are nearly parallel, adding the second one")
print("barely expands the span. Your model doesn't gain much.")
Angle between bedrooms and beds:      49.3° (moderate angle)
Angle between bedrooms and bathrooms:  71.9° (more independent)

When vectors are nearly parallel, adding the second one
barely expands the span. Your model doesn't gain much.

How much independent information does each feature contribute beyond bedrooms? We can project beds and bathrooms onto the bedrooms direction and examine the residual – the part that bedrooms cannot explain.

Code
# What's left after removing the bedrooms component?
# beds ≈ 1.0 * bedrooms + something_else
# The "something_else" is the part that bedrooms can't explain.
# If it's tiny, beds is mostly redundant.

proj_coeff_beds = np.dot(beds_centered, bedrooms_centered) / np.dot(bedrooms_centered, bedrooms_centered)
residual_beds = beds_centered - proj_coeff_beds * bedrooms_centered

proj_coeff_bath = np.dot(bathrooms_centered, bedrooms_centered) / np.dot(bedrooms_centered, bedrooms_centered)
residual_bath = bathrooms_centered - proj_coeff_bath * bedrooms_centered

from scipy import stats

fig, axes = plt.subplots(1, 2, figsize=(10, 4), sharex=True)

# Shared bins on the same x-range so width difference is visible
clip_val = np.percentile(
    np.abs(np.concatenate([residual_beds, residual_bath])), 99)
x_lim = clip_val * 1.1
bin_edges = np.linspace(-x_lim, x_lim, 70)

hist_color = '#A8C4E0'
kde_color = '#2B5C8A'

for ax, resid, label in zip(
    axes, [residual_beds, residual_bath], ['beds', 'bathrooms']
):
    resid_vis = resid[np.abs(resid) <= clip_val]
    ax.hist(resid_vis, bins=bin_edges, alpha=0.45, color=hist_color,
            edgecolor='white', linewidth=0.3)
    kde = stats.gaussian_kde(resid_vis, bw_method=0.2)
    kde_x = np.linspace(-x_lim, x_lim, 400)
    kde_y = kde(kde_x) * len(resid_vis) * (bin_edges[1] - bin_edges[0])
    ax.plot(kde_x, kde_y, color=kde_color, linewidth=2)
    ax.fill_between(kde_x, kde_y, alpha=0.12, color=kde_color)
    ax.text(0.96, 0.93, f'std = {resid.std():.2f}',
            transform=ax.transAxes, ha='right', va='top',
            fontsize=13, fontweight='bold',
            bbox=dict(boxstyle='round,pad=0.35', facecolor='white',
                      edgecolor='#aaaaaa', alpha=0.92))
    ax.set_title(f'{label} after removing bedrooms', fontsize=11.5)
    ax.set_xlabel('Residual')
    ax.spines['top'].set_visible(False)
    ax.spines['right'].set_visible(False)

axes[0].set_ylabel('Count')
plt.tight_layout()
plt.show()

Visualizing collinearity: 2D span vs near-1D span

Let’s make this concrete with a small example. When two feature vectors are linearly independent — meaning neither is a scalar multiple of the other — they span a plane. When they’re nearly parallel, the “plane” collapses to almost a line.

Code
fig, axes = plt.subplots(1, 2, figsize=(12, 5.5))

panels = [
    (np.array([1, 0.3]), np.array([0.2, 1]),
     'Independent features\n(bedrooms, bathrooms)'),
    (np.array([1, 0.3]), np.array([1.1, 0.35]),
     'Redundant features\n(bedrooms, beds)'),
]

v1_color = '#2d70b3'
v2_color = '#c74440'
span_colors = ['#e8d44d', '#ffaaaa']

for idx, (ax, (va, vb, title)) in enumerate(zip(axes, panels)):
    ax.set_facecolor('#fafafa')

    # Shaded span region (parallelogram from large linear combinations)
    K = 5
    corners = []
    for a, b in [(-K, -K), (-K, K), (K, K), (K, -K)]:
        corners.append(a * va + b * vb)
    corners = np.array(corners)
    ax.fill(corners[:, 0], corners[:, 1], alpha=0.18,
            color=span_colors[idx], zorder=0)

    # Lattice dots at integer linear combinations a*v1 + b*v2
    lattice_pts = []
    for a in np.arange(-8, 9, 1):
        for b in np.arange(-8, 9, 1):
            pt = a * va + b * vb
            if -3.6 <= pt[0] <= 3.6 and -3.6 <= pt[1] <= 3.6:
                lattice_pts.append(pt)
    lattice_pts = np.array(lattice_pts)
    dists = np.sqrt(lattice_pts[:, 0]**2 + lattice_pts[:, 1]**2)
    max_dist = max(dists.max(), 1)
    alphas = np.clip(0.65 - 0.35 * (dists / max_dist), 0.15, 0.65)
    ax.scatter(lattice_pts[:, 0], lattice_pts[:, 1],
               s=10, color='#444444', alpha=alphas, zorder=2, edgecolors='none')

    # Bold vector arrows
    scale = 2.5
    for vi, (v, color, label) in enumerate(
            [(va, v1_color, r'$\mathbf{v}_1$'),
             (vb, v2_color, r'$\mathbf{v}_2$')]):
        ax.annotate('', xy=v*scale, xytext=(0, 0),
                    arrowprops=dict(arrowstyle='->', color=color, lw=3.5,
                                   mutation_scale=20), zorder=4)
        perp = np.array([-v[1], v[0]])
        perp = perp / np.linalg.norm(perp)
        sign = (1 if vi == 0 else -1) if idx == 1 else 1
        ax.text(*(v*scale + perp*sign*0.5), label, fontsize=15, color=color,
                fontweight='bold', ha='center', va='center', zorder=5,
                bbox=dict(boxstyle='round,pad=0.1', facecolor='white',
                          edgecolor='none', alpha=0.7))

    ax.plot(0, 0, 'ko', markersize=5, zorder=5)
    ax.axhline(0, color='#cccccc', lw=0.5, zorder=1)
    ax.axvline(0, color='#cccccc', lw=0.5, zorder=1)
    ax.set_xlim(-3.8, 3.8)
    ax.set_ylim(-3.8, 3.8)
    ax.set_aspect('equal')
    ax.set_title(title, fontsize=14, fontweight='bold', pad=10)
    ax.set_xlabel('Prediction component 1', fontsize=11)
    ax.set_ylabel('Prediction component 2', fontsize=11)
    ax.tick_params(labelsize=9)
    ax.spines['top'].set_visible(False)
    ax.spines['right'].set_visible(False)

fig.text(0.25, 0.01, 'Span fills the plane', fontsize=12,
         ha='center', color='#2d7d2d', fontweight='bold')
fig.text(0.75, 0.01, 'Span collapses to a line', fontsize=12,
         ha='center', color='#c74440', fontweight='bold')
plt.tight_layout(rect=[0, 0.04, 1, 1])
plt.show()

On the left, two independent features let you reach any point in the plane — you have full 2D freedom. On the right, nearly parallel features only let you move along one direction. The span collapsed. Adding the second feature barely helped.

Multicollinearity limits your model: redundant features provide redundant information, not new predictive power.

TipThink About It

If you’re an Airbnb host deciding which features to advertise, does it help to list both “bedrooms” AND “beds”? What would you list instead?

Rank: how many truly independent features do you have?

Vectors are linearly independent if none of them can be written as a combination of the others. For example, if \(\text{bedrooms} = 0.5 \cdot \text{bathrooms} + 0.3 \cdot \text{beds}\), then bedrooms is linearly dependent on the other two — it’s redundant.

How many truly independent columns does our Airbnb feature matrix have? Let’s check.

Code
# Build feature matrices and check rank
features_3 = df_mc[['bedrooms', 'bathrooms', 'beds']].values
features_2 = df_mc[['bedrooms', 'bathrooms']].values

rank_3 = np.linalg.matrix_rank(features_3)
rank_2 = np.linalg.matrix_rank(features_2)

print(f"Matrix with [bedrooms, bathrooms, beds]: shape {features_3.shape}, rank = {rank_3}")
print(f"Matrix with [bedrooms, bathrooms]:        shape {features_2.shape}, rank = {rank_2}")
print()
print(f"All 3 are technically independent (beds != c * bedrooms exactly),")
print(f"so the rank is {rank_3}. But near-collinearity means the 3rd direction")
print(f"adds very little. Singular values reveal how much each direction contributes.")
Matrix with [bedrooms, bathrooms, beds]: shape (28761, 3), rank = 3
Matrix with [bedrooms, bathrooms]:        shape (28761, 2), rank = 2

All 3 are technically independent (beds != c * bedrooms exactly),
so the rank is 3. But near-collinearity means the 3rd direction
adds very little. Singular values reveal how much each direction contributes.

The number of independent columns has a name:

ImportantDefinition: Rank

The rank of a matrix \(X\) is the dimension of its column space — equivalently, the number of linearly independent columns.

  • If \(X\) has 3 columns but rank 2, one column is redundant.
  • If rank = number of columns, all features contribute non-redundant information.

Rank is a binary concept: a column is either exactly dependent or it isn’t. But in practice, near-redundancy matters more than exact dependence. The singular values of \(X\) (computed by np.linalg.svd) quantify how much variance each independent direction captures. We’ll study SVD formally in Chapter 14; for now, the bar chart tells the story.

Code
# Singular values: how much does each independent direction contribute?
# (We'll return to singular values formally in Chapter 14 on PCA —
#  for now, just look at the bar chart.)
_, singular_values, _ = np.linalg.svd(
    features_3 - features_3.mean(axis=0), full_matrices=False)

# Variance explained is proportional to squared singular values
sv_pct = singular_values**2 / (singular_values**2).sum() * 100

fig, ax = plt.subplots(figsize=(6, 3.5))

labels = ['Direction 1', 'Direction 2', 'Direction 3']
colors = ['#2563eb', '#2563eb', '#cbd5e1']
edgecolors = ['#1e40af', '#1e40af', '#94a3b8']
label_colors = ['#1e3a5f', '#1e3a5f', '#94a3b8']

bars = ax.bar(labels, sv_pct, width=0.55, color=colors,
              edgecolor=edgecolors, linewidth=0.8)

for bar, pct, lc in zip(bars, sv_pct, label_colors):
    ax.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 1.5,
            f'{pct:.1f}%', ha='center', va='bottom', fontsize=13,
            fontweight='bold', color=lc)

# Cumulative line
cumulative = np.cumsum(sv_pct)
x_positions = range(len(sv_pct))
ax.plot(x_positions, cumulative, 'o-', color='#dc2626', markersize=6,
        linewidth=1.8, zorder=5)
ax.text(1 + 0.15, cumulative[1] - 1, f'{cumulative[1]:.0f}%',
        fontsize=10, color='#dc2626', ha='left', va='top')
ax.text(2 - 0.15, cumulative[2] + 1, f'{cumulative[2]:.0f}%',
        fontsize=10, color='#dc2626', ha='right', va='bottom')

ax.set_ylabel('Variance explained (%)', fontsize=11)
ax.set_title('Two directions capture nearly all the variance',
             fontsize=13, fontweight='bold')
ax.set_ylim(0, 112)
ax.yaxis.set_major_locator(mticker.MultipleLocator(25))
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
ax.grid(axis='y', alpha=0.25, linewidth=0.5)
ax.set_axisbelow(True)
plt.tight_layout()
plt.show()

print("The 3rd direction contributes very little — beds is mostly redundant with bedrooms.")

The 3rd direction contributes very little — beds is mostly redundant with bedrooms.
TipThink About It

A model with 50 features but rank 3 — what does that tell you about the data? How many features are really doing the work?

What predictions are possible?

The opening of this chapter asked: given your features, what predictions are possible? The column space of \(X\) is the set of all achievable predictions. Any target vector outside the column space cannot be reached exactly — the best a linear model can do is find the closest point inside the column space.

Let’s project actual Airbnb prices onto the column space of two feature sets and measure the gap.

Code
# Build two design matrices: one with 2 features, one with 4
X_small = df_mc[['bedrooms', 'bathrooms']].values
X_large = df_mc[['bedrooms', 'bathrooms', 'accommodates', 'number_of_reviews']].values
y = df_mc['price_clean'].values

# Project y onto each column space using least squares
beta_small = np.linalg.lstsq(X_small, y, rcond=None)[0]
beta_large = np.linalg.lstsq(X_large, y, rcond=None)[0]

y_hat_small = X_small @ beta_small
y_hat_large = X_large @ beta_large

# Residuals: what the model cannot reach
resid_small = y - y_hat_small
resid_large = y - y_hat_large

print("Projection of actual prices onto the column space:")
print(f"  2 features (bedrooms, bathrooms):")
print(f"    residual norm = {np.linalg.norm(resid_small):,.0f}")
print(f"    avg |residual| = ${np.mean(np.abs(resid_small)):.0f} per listing")
print(f"  4 features (+ accommodates, reviews):")
print(f"    residual norm = {np.linalg.norm(resid_large):,.0f}")
print(f"    avg |residual| = ${np.mean(np.abs(resid_large)):.0f} per listing")
print()
print("More independent features → larger column space → smaller residual.")
Projection of actual prices onto the column space:
  2 features (bedrooms, bathrooms):
    residual norm = 13,387
    avg |residual| = $60 per listing
  4 features (+ accommodates, reviews):
    residual norm = 11,916
    avg |residual| = $51 per listing

More independent features → larger column space → smaller residual.

The scatter plots below compare predicted prices (the projection onto the column space, computed by np.linalg.lstsq) against actual prices. Points on the dashed \(y = x\) line would be perfect predictions. Color intensity shows how many listings land in each hex bin.

Code
# Visualize: actual prices vs projected prices
fig, axes = plt.subplots(1, 2, figsize=(12, 5.2))
limit = 500
norm = mcolors.LogNorm(vmin=1, vmax=600)
ss_tot = np.sum((y - np.mean(y))**2)

for ax, y_hat, title_str in [
    (axes[0], y_hat_small, '2 features: bedrooms + bathrooms'),
    (axes[1], y_hat_large, '4 features: + accommodates + reviews')
]:
    r2 = 1 - np.sum((y - y_hat)**2) / ss_tot
    mae = np.mean(np.abs(y - y_hat))
    mask = (y_hat >= 0) & (y_hat <= limit) & (y >= 0) & (y <= limit)

    hb = ax.hexbin(y_hat[mask], y[mask], gridsize=45, cmap='YlOrRd',
                   mincnt=1, norm=norm, extent=[0, limit, 0, limit],
                   linewidths=0.1, edgecolors='face')
    ax.plot([0, limit], [0, limit], color='#222', ls='--', lw=2, zorder=5)
    ax.text(limit * 0.76, limit * 0.70, 'y = x', fontsize=9, color='#222',
            rotation=45, ha='center',
            bbox=dict(facecolor='white', edgecolor='none', alpha=0.8, pad=1))
    ax.text(0.04, 0.96,
            f'$R^2 = {r2:.2f}$\nmean |residual| = ${mae:.0f}',
            transform=ax.transAxes, fontsize=10.5, va='top',
            bbox=dict(boxstyle='round,pad=0.4', facecolor='white',
                      edgecolor='#999', alpha=0.95))
    ax.set_xlabel('predicted price ($)')
    ax.set_ylabel('actual price ($)')
    ax.set_title(title_str, fontsize=11.5, pad=6)
    ax.set_xlim(0, limit)
    ax.set_ylim(0, limit)
    ax.set_aspect('equal')

fig.subplots_adjust(right=0.88, wspace=0.25)
cbar_ax = fig.add_axes([0.90, 0.15, 0.02, 0.7])
fig.colorbar(hb, cax=cbar_ax, label='listings per bin')
plt.show()

Actual vs predicted Airbnb prices for two models. A larger column space (right) pulls predictions closer to actual prices.

The gap between the \(y = x\) line and the point cloud is the residual — the part of actual prices that lives outside the column space. A richer column space (more independent features) narrows the gap, but some gap always remains. No linear combination of these features can perfectly reproduce the full complexity of Airbnb pricing.

Checking the residuals

The normal equations guarantee \(X^Te = 0\) — that’s a mathematical fact. But a good model should also have residuals that look like random noise. The orthogonality is guaranteed; the randomness is not. Let’s check.

Code
# Residual plot — using the main df with interaction model from Part 1
y_main = prices.values
y_hat_main = model_interact.predict(X_interact)
residuals_main = y_main - y_hat_main

fig, axes = plt.subplots(1, 2, figsize=(10, 4))

axes[0].scatter(y_hat_main, residuals_main, alpha=0.05, s=10)
axes[0].axhline(0, color='red', lw=1.5)
axes[0].set_xlabel('Predicted price ($)')
axes[0].set_ylabel('Residual ($)')
axes[0].set_title('Residuals vs predicted values')

axes[1].hist(residuals_main, bins=60, edgecolor='white', alpha=0.7)
axes[1].set_xlabel('Residual ($)')
axes[1].set_ylabel('Count')
axes[1].set_title('Distribution of residuals')

plt.tight_layout()
plt.show()

print("Notice: the spread of residuals increases with predicted price.")
print("This is heteroscedasticity — the model's errors are not constant.")

Notice: the spread of residuals increases with predicted price.
This is heteroscedasticity — the model's errors are not constant.
WarningHeteroscedasticity

When the spread of residuals changes with the predicted value, we have heteroscedasticity. Coefficient estimates remain valid, but standard errors from the usual OLS formula become unreliable under heteroscedasticity. We’ll revisit this issue in Chapter 12.

Preview

More features can make predictions worse on new data — even when training R² keeps improving. In Chapter 6, we meet a model that engineers features automatically. In Chapter 7, we develop the tools to tell when we’ve gone too far.

Key takeaways

  • Feature engineering transforms raw data into numbers regression can use. Categories become dummies, text becomes structured fields, nonlinear relationships become polynomial columns.
  • Log transforms change coefficient interpretation — additive (\(y \sim x\)), multiplicative (\(\log y \sim x\)), or elasticity (\(\log y \sim \log x\)). Choose the transform that matches the data-generating process.
  • One-hot encoding converts categories to binary columns. drop_first=True avoids collinearity; the dropped category is the reference level.
  • Polynomial features and interactions let linear models capture nonlinear patterns. The mantra: “linear in parameters, not in features.”
  • Adjusted R² penalizes complexity; raw R² always increases with more features, even noise features.
  • Multicollinearity: nearly parallel features don’t expand the column space. High correlation between feature columns means the second feature adds little new information.
  • Rank tells you how many truly independent features you have. Singular values quantify each direction’s contribution.
  • Residual patterns diagnose missing structure or wrong functional form. Heteroscedasticity (fan-shaped residuals) signals non-constant error variance.

Study guide

Key ideas

  • Feature engineering — creating new input variables from raw data to improve a model’s predictions.
  • Log-transform interpretation — log-transforming \(y\), \(x\), or both changes what a coefficient means: additive effect, multiplicative effect, or elasticity. See the interpretation table in this chapter.
  • One-hot encoding converts a categorical variable with \(k\) categories into \(k-1\) binary (0/1) columns (the dropped category is the reference level). Each dummy coefficient measures a shift relative to that baseline.
  • Boolean threshold encoding — for ordinal variables with \(K\) levels, create \(K-1\) binary features \([\mathbb{1}(x \ge 2), \ldots, \mathbb{1}(x \ge K)]\). Preserves ordering without forcing a linear relationship or discarding order like one-hot encoding.
  • Polynomial features — powers of a variable (\(x^2, x^3, \ldots\)) that let a linear model capture curved relationships. The model is still linear in \(\beta\), not in \(x\).
  • Interaction terms — products of two features (e.g., bedrooms \(\times\) \(\mathbb{1}_{\text{Manhattan}}\)) that let one feature’s effect depend on another.
  • Adjusted R^2 penalizes model complexity, preventing R^2 from rewarding useless features. Use it to compare models with different numbers of predictors.
  • Multicollinearity — when feature vectors are nearly parallel, so one is approximately a scalar multiple of another. Redundant features don’t expand the column space.
  • Rank — the number of linearly independent columns in a matrix. Singular values tell you how much each independent direction contributes.
  • Missing values as features — missing values can be turned into features. For categoricals, treat “missing” as its own category. For numerics, pair an imputed value with a binary missingness indicator so the model can learn whether missingness itself predicts the outcome. A missingness indicator helps when missingness is informative (MAR or MNAR) and adds noise when it’s MCAR.
  • Weak supervision — using noisy, programmatic labels (e.g., from an LLM) instead of expensive hand labels. LLM featurization converts unstructured text into structured features via a well-specified prompt. Good prompts define each field precisely, request a structured output format, and target features that complement (rather than duplicate) existing columns.
  • Heteroscedasticity — when the spread of residuals changes with the predicted value. Coefficient estimates remain valid, but standard errors become unreliable.
  • Residual diagnostics — patterns in residual plots (fan shapes, curves) reveal missing structure or wrong functional form.

Computational tools

  • pd.get_dummies(df, drop_first=True) — one-hot encode categorical columns, dropping one to avoid collinearity
  • PolynomialFeatures(degree=d) — generates polynomial features up to degree \(d\); still linear in parameters
  • df['col'].isna().astype(int) — create a missingness indicator column
  • np.linalg.matrix_rank(X) — rank of a matrix
  • np.linalg.svd(X) — singular value decomposition (more in Chapter 14)
  • df.corr() — correlation matrix (related: high correlation ↔︎ near-collinearity)
  • train_test_split(X, y, test_size=0.3, random_state=42) — splits data for honest evaluation

For the quiz

  • Know what one-hot encoding produces and why drop_first=True is needed.
  • Be able to write the formula for a model with polynomial or interaction features and explain why it’s still “linear regression.”
  • Interpret a coefficient from a log-transformed regression (e.g., if \(\beta_1 = 0.08\) in \(\log(y) \sim x\), what does that mean in original units?).
  • Be able to explain what LLM featurization does, why it’s called weak supervision, and what makes a good extraction prompt.
  • Know when a missingness indicator helps (informative/MNAR) vs. adds noise (MCAR).
  • Be able to write the boolean threshold encoding for an ordinal variable and explain how it differs from one-hot encoding.
  • Know what multicollinearity means and how it limits a model’s predictive power.
  • Understand the connection between rank and the number of independent features.
  • Recognize heteroscedasticity in a residual plot and state its consequences.