Linear Models: From Data to Geometry

Code
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.colors as mcolors
import matplotlib.ticker as mticker
import seaborn as sns
from sklearn.linear_model import LinearRegression
from mpl_toolkits.mplot3d import Axes3D
import warnings
warnings.filterwarnings('ignore')
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (8, 5)
plt.rcParams['font.size'] = 12

# Load data
DATA_DIR = 'data'

How much is a bathroom worth?

An Airbnb host adds a second bathroom. How much more should they charge per night? Regression gives the answer — and linear algebra tells us why it works. This chapter builds both from scratch: what a linear model can predict, what it can’t, and how regression finds the best prediction.

Vectors are everywhere in data

To answer the pricing question, we need to represent listings as mathematical objects we can compute with. The right abstraction: vectors.

There are two natural ways to see vectors in a dataset:

  1. Each row is a vector. A single Airbnb listing with (bedrooms, bathrooms, price) is a point in \(\mathbb{R}^3\). (\(\mathbb{R}^3\) just means a list of 3 real numbers — so (2, 1, 150) is a point in \(\mathbb{R}^3\).)
  2. Each column is a vector. The “bedrooms” column across all \(n\) listings is a vector in \(\mathbb{R}^n\) (one entry per listing).

You’ll use both views constantly — and they answer different questions. Let’s start with a tiny example to build intuition, then look at real data.

Code
# A tiny dataset: 5 Airbnb listings with 2 features
bedrooms  = np.array([1, 2, 3, 2, 1])
bathrooms = np.array([1, 1, 2, 2, 1])

# Each listing is a point in R^2
print(f"Listing 1: bedrooms={bedrooms[0]}, bathrooms={bathrooms[0]}")
print(f"Listing 3: bedrooms={bedrooms[2]}, bathrooms={bathrooms[2]}")
print()
print(f"The 'bedrooms' column is a vector in R^5: {bedrooms}")
print(f"The 'bathrooms' column is a vector in R^5: {bathrooms}")
Listing 1: bedrooms=1, bathrooms=1
Listing 3: bedrooms=3, bathrooms=2

The 'bedrooms' column is a vector in R^5: [1 2 3 2 1]
The 'bathrooms' column is a vector in R^5: [1 1 2 2 1]

Let’s plot the row view: each listing as a point in 2D space, with bedrooms on one axis and bathrooms on the other.

Code
# View 1: Each listing as a point in R^2
fig, ax = plt.subplots(figsize=(8, 5))

# Light background with subtle integer grid
ax.set_facecolor('#fafafa')
for x in range(0, 5):
    ax.axvline(x, color='#e0e0e0', linewidth=0.5, zorder=0)
for y in range(0, 4):
    ax.axhline(y, color='#e0e0e0', linewidth=0.5, zorder=0)

point_color = '#2d6a9f'
highlight_color = '#c44e52'
drop_color = '#999999'

# Drop-lines for Listing 3 to reinforce coordinate reading
highlight_idx = 2
hx, hy = bedrooms[highlight_idx], bathrooms[highlight_idx]
ax.plot([hx, hx], [0, hy], color=drop_color, linewidth=0.8,
        linestyle=(0, (4, 3)), zorder=2)
ax.plot([0, hx], [hy, hy], color=drop_color, linewidth=0.8,
        linestyle=(0, (4, 3)), zorder=2)

# Regular points
mask = np.ones(len(bedrooms), dtype=bool)
mask[highlight_idx] = False
ax.scatter(bedrooms[mask], bathrooms[mask], s=90, color=point_color,
           edgecolors='white', linewidths=1.0, zorder=5)

# Highlighted point
ax.scatter([hx], [hy], s=110, color=highlight_color,
           edgecolors='white', linewidths=1.0, zorder=5)

# Labels with manual offsets to avoid overlaps
label_specs = [
    (0, (12, -2), 'left', '#333333', 'normal'),
    (1, (10, 6), 'left', '#333333', 'normal'),
    (2, (10, 6), 'left', highlight_color, 'semibold'),
    (3, (-12, 8), 'right', '#333333', 'normal'),
    (4, (12, 10), 'left', '#333333', 'normal'),
]
for idx, offset, ha, color, weight in label_specs:
    ax.annotate(f'Listing {idx+1}', (bedrooms[idx], bathrooms[idx]),
                fontsize=10, color=color, fontweight=weight,
                xytext=offset, textcoords='offset points', ha=ha, va='center')

ax.annotate(f'({hx}, {hy})', (hx, hy), fontsize=9.5, color='#666666',
            xytext=(10, -10), textcoords='offset points',
            ha='left', va='top', fontstyle='italic')

ax.set_xlabel('Bedrooms', fontsize=12, labelpad=8)
ax.set_ylabel('Bathrooms', fontsize=12, labelpad=8)
ax.set_title('Nearby listings have similar features', fontsize=14, pad=12)
ax.set_xlim(-0.2, 4.2)
ax.set_ylim(-0.2, 3.2)
ax.set_xticks([0, 1, 2, 3, 4])
ax.set_yticks([0, 1, 2, 3])
for spine in ['top', 'right']:
    ax.spines[spine].set_visible(False)
for spine in ['bottom', 'left']:
    ax.spines[spine].set_color('#555555')
    ax.spines[spine].set_linewidth(0.8)
ax.tick_params(colors='#555555', labelsize=10)
plt.tight_layout()
plt.show()

How different are two listings? Subtract their feature vectors entry-by-entry to get a difference vector.

Code
# How different are listings 1 and 3?
listing_1 = np.array([bedrooms[0], bathrooms[0]])  # (1, 1)
listing_3 = np.array([bedrooms[2], bathrooms[2]])  # (3, 2)

diff = listing_3 - listing_1
print(f"Listing 1: {listing_1}")
print(f"Listing 3: {listing_3}")
print(f"Difference: {listing_3} - {listing_1} = {diff}")
print(f"\nListing 3 has {diff[0]} more bedrooms and {diff[1]} more bathroom.")
Listing 1: [1 1]
Listing 3: [3 2]
Difference: [3 2] - [1 1] = [2 1]

Listing 3 has 2 more bedrooms and 1 more bathroom.

The difference tells us how the listings differ — but it’s still two numbers. To collapse it into one number that measures “how far apart,” we need the length of the difference vector. The norm generalizes the Pythagorean theorem to measure the length of any vector.

Code
# The norm measures length — a generalization of the Pythagorean theorem
for i, name in [(0, 'Listing 1'), (2, 'Listing 3')]:
    vec = np.array([bedrooms[i], bathrooms[i]])
    length = np.sqrt(vec[0]**2 + vec[1]**2)
    print(f"{name}: {vec}, length = sqrt({vec[0]}^2 + {vec[1]}^2) = {length:.2f}")
Listing 1: [1 1], length = sqrt(1^2 + 1^2) = 1.41
Listing 3: [3 2], length = sqrt(3^2 + 2^2) = 3.61
ImportantDefinition: Norm

The norm (or length) of a vector \(v\) is \[\|v\| = \sqrt{v \cdot v} = \sqrt{\sum_{i=1}^n v_i^2}\] The norm measures magnitude — how large the vector is. In code: np.linalg.norm(v).

Now we can scalarize the difference. The distance between two listings is the norm of their difference vector.

Code
# Distance = norm of the difference vector
dist = np.linalg.norm(diff)
print(f"Difference vector: {listing_3} - {listing_1} = {diff}")
print(f"Distance = norm of difference = {dist:.2f}")
Difference vector: [3 2] - [1 1] = [2 1]
Distance = norm of difference = 2.24
ImportantDefinition: Distance

The distance between two vectors \(u\) and \(v\) is the norm of their difference: \[d(u, v) = \|u - v\|\] Smaller distance means more similar. Larger distance means more different.

Distance tells us how far apart two listings are. But sometimes we want a different question: do two listings have similar proportions of features, regardless of scale? An apartment with 4 bedrooms and 2 bathrooms has the same 2:1 ratio as one with 2 bedrooms and 1 bathroom — but they’re far apart. To measure this kind of directional similarity, we need a new tool.

The inner product multiplies corresponding entries and adds them up.

Code
# Inner product: a measure of similarity
similarity = listing_1[0] * listing_3[0] + listing_1[1] * listing_3[1]
print(f"Listing 1: {listing_1}")
print(f"Listing 3: {listing_3}")
print(f"Entry-by-entry product sum: {listing_1[0]}*{listing_3[0]} + {listing_1[1]}*{listing_3[1]} = {similarity}")
Listing 1: [1 1]
Listing 3: [3 2]
Entry-by-entry product sum: 1*3 + 1*2 = 5

The inner product is positive here because both listings have positive feature values. To see negative and zero inner products, center the vectors by subtracting the mean — this reveals which listings deviate in the same direction vs. opposite directions from the average.

Code
# Center the listing vectors: how does each deviate from the average listing?
mean_vec = np.array([bedrooms.mean(), bathrooms.mean()])
centered_1 = listing_1 - mean_vec  # below average in both
centered_3 = listing_3 - mean_vec  # above average in both
centered_2 = np.array([bedrooms[1], bathrooms[1]]) - mean_vec  # avg bedrooms, below avg baths

print(f"Mean listing: ({mean_vec[0]:.1f}, {mean_vec[1]:.1f})")
print(f"Listing 1 centered: ({centered_1[0]:.1f}, {centered_1[1]:.1f})  — below average in both")
print(f"Listing 3 centered: ({centered_3[0]:.1f}, {centered_3[1]:.1f})  — above average in both")
print(f"Listing 2 centered: ({centered_2[0]:.1f}, {centered_2[1]:.1f})")
print()

dot_neg = np.dot(centered_1, centered_3)
dot_zero = np.dot(centered_1, centered_2)
print(f"Centered dot product (1 vs 3): {dot_neg:.2f}  — NEGATIVE (opposite directions from mean)")
print(f"Centered dot product (1 vs 2): {dot_zero:.2f}  — ZERO (perpendicular deviations)")
Mean listing: (1.8, 1.4)
Listing 1 centered: (-0.8, -0.4)  — below average in both
Listing 3 centered: (1.2, 0.6)  — above average in both
Listing 2 centered: (0.2, -0.4)

Centered dot product (1 vs 3): -1.20  — NEGATIVE (opposite directions from mean)
Centered dot product (1 vs 2): -0.00  — ZERO (perpendicular deviations)
ImportantDefinition: Inner Product (Dot Product)

The inner product (or dot product) of two vectors \(u, v \in \mathbb{R}^n\) is the scalar \[u \cdot v = u^T v = \sum_{i=1}^n u_i v_i\] A positive inner product indicates the vectors tend to point in the same direction; a negative value indicates opposite directions; zero indicates perpendicularity.

The raw inner product depends on both direction and magnitude. A listing with 10 bedrooms and 10 bathrooms would have a large inner product with almost any other listing, simply from its scale. To isolate directional similarity, divide by the norms of both vectors. (Since we already defined norms above, this formula should look natural.)

Code
# Cosine similarity: inner product divided by both norms
cos_sim = similarity / (np.linalg.norm(listing_1) * np.linalg.norm(listing_3))
print(f"Inner product: {similarity}")
print(f"Norms: {np.linalg.norm(listing_1):.2f} and {np.linalg.norm(listing_3):.2f}")
print(f"Cosine similarity: {similarity} / ({np.linalg.norm(listing_1):.2f} * {np.linalg.norm(listing_3):.2f}) = {cos_sim:.3f}")
print(f"\nA value near 1 means these listings have similar proportions of bedrooms to bathrooms.")
Inner product: 5
Norms: 1.41 and 3.61
Cosine similarity: 5 / (1.41 * 3.61) = 0.981

A value near 1 means these listings have similar proportions of bedrooms to bathrooms.
ImportantDefinition: Cosine Similarity

The cosine similarity between two vectors \(\mathbf{a}\) and \(\mathbf{b}\) measures the cosine of the angle between them:

\[\text{cosine similarity} = \frac{\mathbf{a} \cdot \mathbf{b}}{\|\mathbf{a}\| \, \|\mathbf{b}\|}\]

Values range from \(-1\) (opposite directions) to \(+1\) (same direction). A value of \(0\) means the vectors are perpendicular (orthogonal). Cosine similarity measures direction, ignoring magnitude — two vectors that point the same way have cosine similarity 1 regardless of their lengths.

Linear functions and linear combinations

What does a linear model actually do? It takes each feature column, multiplies it by a weight, and adds them up. Let’s watch.

Code
# Two feature vectors (columns of our data matrix)
bedrooms_vec = np.array([1, 2, 3, 2, 1])  # bedrooms column
bathrooms_vec = np.array([1, 1, 2, 2, 1])  # bathrooms column

# A linear combination: 100*bedrooms + 50*bathrooms
beta1, beta2 = 100, 50
prediction = beta1 * bedrooms_vec + beta2 * bathrooms_vec
print(f"Weights: beta1={beta1}, beta2={beta2}")
print(f"bedrooms:    {bedrooms_vec}")
print(f"bathrooms:   {bathrooms_vec}")
print(f"prediction:  {prediction}")
print()
print(f"Listing 1: {beta1}*{bedrooms_vec[0]} + {beta2}*{bathrooms_vec[0]} = ${prediction[0]}")
print(f"Listing 3: {beta1}*{bedrooms_vec[2]} + {beta2}*{bathrooms_vec[2]} = ${prediction[2]}")
Weights: beta1=100, beta2=50
bedrooms:    [1 2 3 2 1]
bathrooms:   [1 1 2 2 1]
prediction:  [150 250 400 300 150]

Listing 1: 100*1 + 50*1 = $150
Listing 3: 100*3 + 50*2 = $400

Scaling by a weight and adding — that’s all a linear model does. This operation has a name:

ImportantDefinition: Linear Combination

A linear combination of vectors \(v_1, v_2\) is any expression \(\beta_1 v_1 + \beta_2 v_2\) where \(\beta_1, \beta_2\) are scalars (plain numbers).

A linear model \(\hat{y} = \beta_1 x_1 + \beta_2 x_2\) says: “the prediction is a linear combination of the feature columns.”

Why do linear combinations matter? Because they behave predictably: doubling a weight doubles its contribution. There are no interaction surprises, no thresholds, no sudden jumps. This predictability has a formal name:

ImportantDefinition: Linear Function

A function \(f\) is linear if it satisfies two properties: \[f(ax + by) = a\,f(x) + b\,f(y)\] for all vectors \(x, y\) and all scalars \(a, b\). Equivalently: scaling the input scales the output, and adding inputs adds outputs.

Familiar examples of linear functions: multiplying by a constant (\(f(x) = 3x\)), matrix multiplication (\(f(x) = Ax\)), and the inner product with a fixed vector (\(f(x) = w \cdot x\)). Non-examples: squaring (\(f(x) = x^2\), since \((2x)^2 \neq 2x^2\)) and adding a constant (\(f(x) = x + 1\), since \(f(0) \neq 0\)).

Different weights produce different prediction vectors. The plot below shows several linear combinations of two vectors \(v_1\) and \(v_2\) in \(\mathbb{R}^2\). Each black dot is one choice of weights, and the dashed arrows show the construction: follow \(\beta_1 v_1\) from the origin, then \(\beta_2 v_2\) to reach the result. The background grid shows all integer-weight combinations – notice how they tile the plane.

Code
v1 = np.array([2, 1])
v2 = np.array([1, 2])
fig, ax = plt.subplots(figsize=(8, 6))
c1, c2, c_result, c_lattice = '#2d8a4e', '#c44e52', '#2b2b2b', '#9999bb'
a_range, b_range = np.arange(-4, 6), np.arange(-4, 6)
for a in a_range:
    for b in b_range:
        pt = a * v1 + b * v2
        if -3 <= pt[0] <= 7.5 and -2.5 <= pt[1] <= 6.5:
            ax.plot(*pt, 'o', color=c_lattice, ms=3.5, alpha=0.4, zorder=1)
for a in a_range:
    line = np.array([a * v1 + t * v2 for t in np.linspace(-4, 5, 100)])
    mask = (line[:, 0] >= -3) & (line[:, 0] <= 7.5) & (line[:, 1] >= -2.5) & (line[:, 1] <= 6.5)
    idx = np.where(mask)[0]
    if len(idx) > 0:
        ax.plot(line[idx, 0], line[idx, 1], color=c_lattice, lw=0.35, alpha=0.25, zorder=0)
for b in b_range:
    line = np.array([t * v1 + b * v2 for t in np.linspace(-4, 5, 100)])
    mask = (line[:, 0] >= -3) & (line[:, 0] <= 7.5) & (line[:, 1] >= -2.5) & (line[:, 1] <= 6.5)
    idx = np.where(mask)[0]
    if len(idx) > 0:
        ax.plot(line[idx, 0], line[idx, 1], color=c_lattice, lw=0.35, alpha=0.25, zorder=0)
ax.axhline(0, color='#999999', lw=0.6, zorder=1)
ax.axvline(0, color='#999999', lw=0.6, zorder=1)
arrow_kw = dict(arrowstyle='->', mutation_scale=18)
ax.annotate('', xy=v1, xytext=(0, 0), arrowprops=dict(**arrow_kw, color=c1, lw=3), zorder=10)
ax.annotate('', xy=v2, xytext=(0, 0), arrowprops=dict(**arrow_kw, color=c2, lw=3), zorder=10)
ax.text(v1[0]+0.15, v1[1]-0.35, r'$\mathbf{v}_1$ (bedrooms)', fontsize=13, color=c1, fontweight='bold', zorder=11)
ax.text(v2[0]-0.65, v2[1]+0.15, r'$\mathbf{v}_2$ (bathrooms)', fontsize=13, color=c2, fontweight='bold', zorder=11)
def fmt_label(a, b):
    def coeff(x, var):
        if x == 1: return var
        if x == -1: return f'-{var}'
        if x == int(x): return f'{int(x)}{var}'
        return f'{x}{var}'
    pa = coeff(a, r'\mathbf{v}_1')
    pb = coeff(abs(b), r'\mathbf{v}_2')
    return f'${pa} + {pb}$' if b >= 0 else f'${pa} - {pb}$'
examples = [(2, 1, (8, -6), 'left'), (-0.5, 2, (-10, 6), 'right'), (1, -1, (8, 2), 'left')]
dash_kw = dict(arrowstyle='->', lw=1.5, linestyle=(0, (5, 3)), shrinkA=0, shrinkB=2)
for a, b, text_off, ha in examples:
    sv1 = a * v1
    result = sv1 + b * v2
    ax.annotate('', xy=sv1, xytext=(0, 0), arrowprops=dict(**dash_kw, color=c1, alpha=0.55), zorder=4)
    ax.annotate('', xy=result, xytext=sv1, arrowprops=dict(**dash_kw, color=c2, alpha=0.55), zorder=4)
    ax.plot(*result, 'o', color=c_result, ms=8, zorder=8, markeredgecolor='white', markeredgewidth=1.2)
    ax.annotate(fmt_label(a, b), xy=result, fontsize=11, color=c_result,
                xytext=text_off, textcoords='offset points', ha=ha, zorder=9)
ax.set_xlim(-2.5, 7)
ax.set_ylim(-2, 6.5)
ax.set_aspect('equal')
ax.set_xlabel('listing 1', fontsize=12)
ax.set_ylabel('listing 2', fontsize=12)
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
ax.spines['left'].set_color('#cccccc')
ax.spines['bottom'].set_color('#cccccc')
ax.tick_params(colors='#666666', which='both')
plt.tight_layout()
plt.show()

TipThink About It

In the plot above, can you reach every point in \(\mathbb{R}^2\) using linear combinations of \(v_1\) and \(v_2\)? What if \(v_1\) and \(v_2\) pointed in the same direction?

Span: the set of all possible predictions

What’s the set of ALL points you can reach by choosing different weights? For the two vectors in the plot, you can reach every point in the plane — but only because \(v_1\) and \(v_2\) point in different directions. If they were parallel, you’d be stuck on a line.

For a linear model, this set is the set of all possible predictions. It tells you what your model can and cannot predict, before you fit anything. Let’s visualize it in 3D, where the geometry is easier to see. We use np.column_stack to assemble the feature columns into a matrix, and np.linalg.lstsq to find the closest point in the span to a target vector.

Code
v1_3d = np.array([2, 1, 0])
v2_3d = np.array([0, 1, 2])
target = np.array([1, 3, 1])

# Project target onto span(v1, v2) via least squares
V = np.column_stack([v1_3d, v2_3d])
coeffs = np.linalg.lstsq(V, target, rcond=None)[0]
proj = V @ coeffs
residual = target - proj

# --- Colors ---
PLANE_COLOR = '#b3d9ff'
GRID_CLR = '#5dade2'
V1_COLOR = '#d35400'
V2_COLOR = '#1a8a4a'
TARGET_COLOR = '#c0392b'
PROJ_COLOR = '#b8860b'
RESID_COLOR = '#555555'

fig = plt.figure(figsize=(10, 7.5))
ax = fig.add_subplot(111, projection='3d')
ax.view_init(elev=25, azim=225)

# Translucent plane (the span)
s_range = np.linspace(-0.5, 1.7, 30)
t_range = np.linspace(-0.5, 1.7, 30)
S, T = np.meshgrid(s_range, t_range)
ax.plot_surface(S * v1_3d[0] + T * v2_3d[0],
                S * v1_3d[1] + T * v2_3d[1],
                S * v1_3d[2] + T * v2_3d[2],
                alpha=0.2, color=PLANE_COLOR, edgecolor='none', shade=False)

# Subtle grid lines on the plane
for c in np.linspace(-0.2, 1.5, 7):
    for base, other in [(v1_3d, v2_3d), (v2_3d, v1_3d)]:
        p0, p1 = -0.4 * base + c * other, 1.6 * base + c * other
        ax.plot([p0[0], p1[0]], [p0[1], p1[1]], [p0[2], p1[2]],
                color=GRID_CLR, alpha=0.12, lw=0.5)

# Basis vectors (bold arrows)
for v, color in [(v1_3d, V1_COLOR), (v2_3d, V2_COLOR)]:
    ax.quiver(0, 0, 0, *v, color=color, arrow_length_ratio=0.06, lw=3.2)

# Line from origin to projection (shows it's a linear combination)
ax.plot([0, proj[0]], [0, proj[1]], [0, proj[2]],
        color=PROJ_COLOR, ls='-', lw=1.5, alpha=0.3)

# Points: target, projection, origin
ax.scatter(*proj, color=PROJ_COLOR, s=140, zorder=10,
           edgecolors='#6d5b00', linewidths=2.0)
ax.scatter(*target, color=TARGET_COLOR, s=140, zorder=10,
           edgecolors='#641e16', linewidths=2.0)
ax.scatter(0, 0, 0, color='black', s=30, zorder=5)

# Residual (dashed line from target to projection)
ax.plot([target[0], proj[0]], [target[1], proj[1]], [target[2], proj[2]],
        color=RESID_COLOR, ls=(0, (5, 3)), lw=2.8, alpha=0.8)

# Right-angle mark at projection (shows orthogonality)
resid_dir = residual / np.linalg.norm(residual)
in_plane = v1_3d / np.linalg.norm(v1_3d)
sz = 0.3
rp1 = proj + sz * resid_dir
rp2 = proj + sz * resid_dir + sz * in_plane
rp3 = proj + sz * in_plane
ax.plot([rp1[0], rp2[0], rp3[0]], [rp1[1], rp2[1], rp3[1]],
        [rp1[2], rp2[2], rp3[2]], color=RESID_COLOR, lw=1.5, alpha=0.6)

# Labels with thin leader lines to avoid overlap
ax.text(*(v1_3d * 1.12 + [0.1, 0.1, -0.15]),
        '$v_1$', fontsize=17, fontweight='bold', color=V1_COLOR)
ax.text(*(v2_3d * 1.12 + [-0.15, 0, 0.2]),
        '$v_2$', fontsize=17, fontweight='bold', color=V2_COLOR)

tgt_lbl = target + np.array([-0.4, 0.7, 0.3])
ax.plot([target[0], tgt_lbl[0]], [target[1], tgt_lbl[1]],
        [target[2], tgt_lbl[2]], color=TARGET_COLOR, lw=0.8, alpha=0.4)
ax.text(*tgt_lbl, 'target $y$', fontsize=14,
        fontweight='bold', color=TARGET_COLOR, ha='center', va='bottom')

proj_lbl = proj + np.array([0.5, -0.8, -0.4])
ax.plot([proj[0], proj_lbl[0]], [proj[1], proj_lbl[1]],
        [proj[2], proj_lbl[2]], color=PROJ_COLOR, lw=0.8, alpha=0.4)
ax.text(*proj_lbl, 'projection $\\hat{y}$', fontsize=13,
        fontweight='bold', color=PROJ_COLOR, ha='center', va='top')

mid = (target + proj) / 2
ax.text(*(mid + [0.6, 0.3, 0.3]), 'residual\n$y - \\hat{y}$',
        fontsize=10, color=RESID_COLOR, style='italic', ha='left', va='center')

ax.text(*(1.2 * v1_3d + 1.3 * v2_3d + [0, -0.2, 0]),
        'span($v_1, v_2$)', fontsize=10, color='#2471a3', alpha=0.55,
        style='italic')
ax.text(-0.15, -0.35, -0.1, '$0$', fontsize=11, color='gray')

# Clean up axes (geometry speaks for itself)
for axis in [ax.xaxis, ax.yaxis, ax.zaxis]:
    axis.pane.set_alpha(0.02)
    axis.line.set_alpha(0.1)
ax.set_xticklabels([])
ax.set_yticklabels([])
ax.set_zticklabels([])
ax.tick_params(axis='both', which='both', length=0)
ax.grid(True, alpha=0.06)

fig.suptitle('Projection onto the span of two vectors',
             fontsize=15, fontweight='bold', y=0.95)
fig.text(0.5, 0.02,
         'The closest reachable point to $y$ is its orthogonal '
         'projection $\\hat{y}$ onto the plane',
         fontsize=11, ha='center', color='#555555')
plt.subplots_adjust(top=0.92, bottom=0.06)
plt.show()

Two vectors in \(\mathbb{R}^3\) span a plane, not all of 3D space. That means some target values \(y\) are impossible to reach exactly — there are targets that no linear combination of your features can produce. The blue surface in the plot is the set of all reachable points — it has a name:

ImportantDefinition: Span

The span of a set of vectors is the set of ALL their linear combinations: \[\text{span}(v_1, v_2) = \{\beta_1 v_1 + \beta_2 v_2 : \beta_1, \beta_2 \in \mathbb{R}\}\]

The dashed line in the plot above is the residual: the gap between the target and the closest reachable point. Later in this chapter, we’ll see that regression finds exactly this closest point — the projection of \(y\) onto the column space. The residual is what’s left over.

Column space: from features to matrices

A DataFrame is already organized as columns of features. To do linear algebra, we stack those columns into a matrix \(X\) — each column is a feature, each row is an observation. The function np.column_stack does this from arrays; with pandas, it’s just selecting columns. Once we have \(X\), the matrix-vector product X @ beta computes the linear combination for all listings at once.

Code
# Build the design matrix from our tiny example
X = np.column_stack([bedrooms_vec, bathrooms_vec])
print("Design matrix X (5 listings x 2 features):")
print(X)
print()
print(f"Column 1 (bedrooms): {X[:, 0]}")
print(f"Column 2 (bathrooms): {X[:, 1]}")
print()

# Any prediction is X @ beta
beta = np.array([100, 50])
y_hat = X @ beta
print(f"With beta = {beta}:")
print(f"Prediction y_hat = X @ beta = {y_hat}")
Design matrix X (5 listings x 2 features):
[[1 1]
 [2 1]
 [3 2]
 [2 2]
 [1 1]]

Column 1 (bedrooms): [1 2 3 2 1]
Column 2 (bathrooms): [1 1 2 2 1]

With beta = [100  50]:
Prediction y_hat = X @ beta = [150 250 400 300 150]

Let’s unpack the notation \(\hat{y} = X\beta\). For a single listing \(i\), the prediction is a weighted sum of its features:

\[(X\beta)_i = \sum_{j=1}^{p} X_{ij}\,\beta_j = \beta_1 \, \text{bedrooms}_i + \beta_2 \, \text{bathrooms}_i\]

The matrix product computes this for every listing simultaneously. Equivalently, \(X\beta = \beta_1 x_1 + \beta_2 x_2\), where \(x_1\) and \(x_2\) are the columns of \(X\). Every prediction is a linear combination of the columns of \(X\). The set of all such predictions — for every possible \(\beta\) — is the span of the columns:

\[X = \begin{bmatrix} | & | \\ x_1 & x_2 \\ | & | \end{bmatrix}\]

ImportantDefinition: Column Space

The column space of \(X\) is the span of its columns: \[\text{col}(X) = \text{span}(x_1, x_2) = \{X\beta : \beta \in \mathbb{R}^2\}\] This set is exactly the collection of all possible predictions \(\hat{y} = X\beta\) for a linear model. Different \(\beta\) = different predictions, but they all live in the column space.

The column space of \(X\) determines what your model can predict. More features (more columns) = larger column space = more possible predictions. But there’s a catch…

WarningLinear models can predict impossible values

Nothing stops \(X\beta\) from predicting a negative price. The column space doesn’t know that prices must be positive. Linear models predict what the math allows, not what makes sense. (We’ll revisit this limitation when we encounter logistic regression in Chapter 13.)

TipThink About It

If you have 3 features but one is an exact linear combination of the other two, what is the dimension of the column space? (Hint: it’s not 3.)

Simple regression on real data

The column space tells us what a linear model can predict. Let’s fit one on real data and see.

We load the NYC Airbnb dataset with pd.read_csv, drop rows with missing values using .dropna(), and filter to reasonable prices.

Code
# Load and clean Airbnb data
airbnb = pd.read_csv(f'{DATA_DIR}/airbnb/listings.csv', low_memory=False)

# Price column — convert to numeric (strip any formatting characters)
airbnb['price_clean'] = pd.to_numeric(
    airbnb['price'].astype(str).str.replace('[$,]', '', regex=True),
    errors='coerce')

# Working subset
cols = ['bedrooms', 'bathrooms', 'beds', 'price_clean',
        'accommodates', 'number_of_reviews', 'room_type']
n_before = len(airbnb)
df = airbnb[cols].dropna()
print(f"Dropped {n_before - len(df):,} rows with missing values in {cols}")
df = df[(df['price_clean'] > 0) & (df['price_clean'] <= 500)]
print(f"{len(df):,} listings, price range: ${df['price_clean'].min():.0f} – ${df['price_clean'].max():.0f}")
df.head()
Dropped 17 rows with missing values in ['bedrooms', 'bathrooms', 'beds', 'price_clean', 'accommodates', 'number_of_reviews', 'room_type']
28,761 listings, price range: $10 – $500
bedrooms bathrooms beds price_clean accommodates number_of_reviews room_type
0 1 1.0 1.0 45 1 3 Private room
1 1 1.0 1.0 95 2 101 Private room
2 1 1.5 1.0 60 2 19 Entire home/apt
3 1 1.0 1.0 70 2 8 Private room
4 1 1.0 1.0 104 2 4 Entire home/apt

Simple regression: price ~ bedrooms

Let’s start simple. How does price change with the number of bedrooms? We create a LinearRegression() object from scikit-learn, call .fit() to find the best weights, and then .predict() to generate predictions. The fitted model stores the slope in .coef_ and the intercept in .intercept_.

Code
# Scatter plot + regression line
fig, ax = plt.subplots(figsize=(8, 5))
ax.scatter(df['bedrooms'], df['price_clean'], alpha=0.05, s=10)

# Fit simple regression
model_simple = LinearRegression()
model_simple.fit(df[['bedrooms']], df['price_clean'])
x_line = np.linspace(0, 6, 100)
y_line = model_simple.predict(x_line.reshape(-1, 1))
ax.plot(x_line, y_line, 'r-', lw=2.5, label=f'ŷ = {model_simple.intercept_:.0f} + {model_simple.coef_[0]:.0f} × bedrooms')

ax.set_xlabel('Bedrooms')
ax.set_ylabel('Price ($/night)')
ax.set_title('Each bedroom adds ~$40/night')
ax.legend(fontsize=12)
plt.tight_layout()
plt.show()

print(f"Each additional bedroom is associated with ${model_simple.coef_[0]:.0f} more per night.")
print(f"A 0-bedroom listing (studio) is predicted at ${model_simple.intercept_:.0f}/night — the intercept.")

Each additional bedroom is associated with $49 more per night.
A 0-bedroom listing (studio) is predicted at $71/night — the intercept.

The intercept captures the base price of a listing, independent of its size. Without it, the regression line would be forced through the origin — predicting $0 for a studio. Geometrically, including an intercept adds a constant vector \(\mathbf{1} = (1, 1, \ldots, 1)\) to the column space. LinearRegression() includes an intercept by default; to omit it, pass fit_intercept=False.

Code
# What happens without an intercept?
model_no_intercept = LinearRegression(fit_intercept=False)
model_no_intercept.fit(df[['bedrooms']], df['price_clean'])

print(f"With intercept:    ŷ = {model_simple.intercept_:.0f} + {model_simple.coef_[0]:.0f} × bedrooms  (R² = {model_simple.score(df[['bedrooms']], df['price_clean']):.3f})")
print(f"Without intercept: ŷ = {model_no_intercept.coef_[0]:.0f} × bedrooms               (R² = {model_no_intercept.score(df[['bedrooms']], df['price_clean']):.3f})")
print()
print("Forcing the line through the origin loses predictive power and misrepresents studios.")
With intercept:    ŷ = 71 + 49 × bedrooms  (R² = 0.156)
Without intercept: ŷ = 95 × bedrooms               (R² = -0.035)

Forcing the line through the origin loses predictive power and misrepresents studios.
NoteWhy is it called “regression”?

The name comes from Francis Galton, who in the 1880s studied the heights of fathers and sons. Unusually tall fathers tended to have sons who were tall — but less tall than their fathers. Unusually short fathers had sons who were short — but less short. Heights “regressed toward the mean.”

Galton’s method for quantifying this trend — fitting a line through the data — became “regression,” and the name stuck even though we now use it for problems that have nothing to do with reverting to averages. We formalize regression to the mean below, once we connect \(r\) to \(R^2\). (For the full story, see Salsburg, The Lady Tasting Tea, Ch. 4.)

That’s the regression line. But why this line? There are infinitely many lines we could draw through this cloud. Regression picks the one that minimizes the total squared distance from the data to the line. Why squared distances, not absolute distances or something else?

Why squared error? It estimates the conditional mean.

Squared error leads to a linear algebra problem — the projection — with a clean closed-form solution. But there’s a deeper reason: it estimates the conditional mean \(\mathbb{E}[Y \mid X]\).

To see why, consider a simpler question: what constant \(c\) minimizes \(\mathbb{E}[(Y - c)^2]\)? Taking the derivative and setting it to zero:

\[\frac{d}{dc} \mathbb{E}[(Y-c)^2] = -2\,\mathbb{E}[Y-c] = 0 \quad \Longrightarrow \quad c = \mathbb{E}[Y]\]

The constant that minimizes mean squared error is the mean of \(Y\). The same argument applies conditional on \(X\): minimizing \(\mathbb{E}[(Y - f(X))^2]\) over all functions \(f\) gives \(f(X) = \mathbb{E}[Y \mid X]\). Regression restricts \(f\) to be linear, so the regression line is the best linear approximation to the conditional mean.

Other loss functions estimate other summaries: absolute error gives the conditional median, and quantile loss gives conditional quantiles. We could choose a different loss — and we will, in Chapter 13 for classification. For now, squared error gives us the most elegant geometry.

We can verify this visually: if regression estimates the conditional mean, the regression line should pass through the average price at each value of bedrooms.

Code
# Conditional means: average price for each number of bedrooms
cond_means = df.groupby('bedrooms')['price_clean'].mean()
cond_means = cond_means[cond_means.index <= 6]

fig, ax = plt.subplots(figsize=(8, 5))
ax.scatter(df['bedrooms'], df['price_clean'], alpha=0.03, s=10, color='#cccccc')
ax.scatter(cond_means.index, cond_means.values, s=150, color='#c44e52',
           edgecolors='white', lw=2, zorder=10,
           label=r'conditional mean $\mathbb{E}[\mathrm{price} \mid \mathrm{bedrooms}]$')
x_line = np.linspace(0, 6, 100)
y_line = model_simple.predict(x_line.reshape(-1, 1))
ax.plot(x_line, y_line, '-', color='#2d6a9f', lw=2.5, label='regression line')
ax.set_xlabel('Bedrooms')
ax.set_ylabel('Price ($/night)')
ax.legend(fontsize=11, loc='upper left')
plt.tight_layout()
plt.show()

The red dots (conditional means) sit close to the regression line, confirming that least squares approximates \(\mathbb{E}[\text{price} \mid \text{bedrooms}]\). The fit isn’t exact because the regression line is constrained to be linear — the true conditional mean function may be nonlinear.

How good is the fit? Defining \(R^2\)

How much of the variation in prices does the model explain? We quantify this with \(R^2\).

The prediction \(\hat{y}\) splits the target into two pieces: what the model captures (\(\hat{y}\)) and what it misses (the residual \(e = y - \hat{y}\)). The total variation in \(y\) (measured from its mean \(\bar{y}\)) decomposes as:

\[R^2 = 1 - \frac{\sum_i (y_i - \hat{y}_i)^2}{\sum_i (y_i - \bar{y})^2} = 1 - \frac{\text{residual variance}}{\text{total variance}}\]

An \(R^2\) of 0 means the model does no better than predicting the mean for every listing. An \(R^2\) of 1 means the model explains all the variation — every prediction matches exactly.

ImportantDefinition: \(R^2\) (coefficient of determination)

\[R^2 = 1 - \frac{\sum_i (y_i - \hat{y}_i)^2}{\sum_i (y_i - \bar{y})^2}\]

\(R^2\) measures the fraction of variance in \(y\) “explained” by the model. It ranges from 0 (the model explains nothing beyond the mean) to 1 (perfect fit). We’ll see a geometric interpretation — \(R^2\) as a projection ratio — later in this chapter.

\(R^2 = r^2\) in simple regression

Before adding more features, let’s connect \(R^2\) to a quantity you’ve seen in introductory statistics: the correlation coefficient \(r\).

The correlation coefficient \(r\) measures the strength and direction of the linear association between two variables. We compute it with .corr(). It ranges from \(-1\) (perfect negative linear relationship) to \(+1\) (perfect positive). The regression line can be written as \(\hat{y} = r \cdot \frac{SD_y}{SD_x}(x - \bar{x}) + \bar{y}\), which makes \(r\) the slope in standardized units.

Code
# Compute the correlation coefficient between price and bedrooms
r = df['price_clean'].corr(df['bedrooms'])
print(f"Correlation between price and bedrooms: r = {r:.4f}")
print(f"r² = {r**2:.4f}")
print()

# Compare to R² from the simple regression
y_hat_simple = model_simple.predict(df[['bedrooms']])
ss_res = np.sum((df['price_clean'] - y_hat_simple)**2)
ss_tot = np.sum((df['price_clean'] - df['price_clean'].mean())**2)
r2_simple_model = 1 - ss_res / ss_tot
print(f"R² from simple regression: {r2_simple_model:.4f}")
print(f"They match: r² = R² in simple regression.")
Correlation between price and bedrooms: r = 0.3944
r² = 0.1556

R² from simple regression: 0.1556
They match: r² = R² in simple regression.

For a model with one predictor, \(R^2\) is literally the square of the correlation coefficient \(r\). Why? The simple regression prediction is \(\hat{y}_i = \bar{y} + r \cdot \frac{SD_y}{SD_x}(x_i - \bar{x})\), so the explained variance is \(r^2\) times the total variance. With multiple predictors, \(R^2\) still measures the fraction of variance explained, but there’s no single \(r\) to square.

The connection between \(r\) and \(R^2\) also explains Galton’s original observation. For real data, \(|r| < 1\), so the predicted value \(\hat{y}\) is always closer to the mean than \(x\) is (in standardized units). Fathers who are 2 SDs above average height have sons whose predicted height is only \(r \times 2\) SDs above average — less extreme, pulled back toward the mean.

ImportantDefinition: Regression to the mean

Regression to the mean is the phenomenon whereby extreme observations tend to be followed by less extreme ones. It arises whenever the correlation between successive measurements is imperfect (\(|r| < 1\)) and is a purely statistical phenomenon — not a biological or physical one. Galton’s observation of heights “regressing” toward the average is the origin of the term “regression” itself.

Multiple regression: more features, bigger column space

What happens when we add more features? The column space gets bigger — and so does \(R^2\).

Our model says each bedroom adds some amount per night. But bigger apartments also have more bathrooms and can accommodate more guests. The bedroom coefficient might be picking up the effect of all of these correlated features. To isolate the bedroom effect, we need to hold other features constant.

Each new (independent) feature expands the column space, giving us a richer set of possible predictions.

To include a categorical variable like room_type, we use pd.get_dummies() to create binary (0/1) columns — one per category. We drop one category (the reference level) to avoid redundancy; drop_first=True handles this. In Chapter 5, we’ll explore this technique and many others in detail.

Code
# Multiple regression: price ~ bedrooms + bathrooms + room_type
# One-hot encode room_type
df_model = df.copy()
df_model = pd.get_dummies(df_model, columns=['room_type'], drop_first=True, dtype=float)

features = ['bedrooms', 'bathrooms', 'room_type_Private room', 'room_type_Shared room']
X_multi = df_model[features]
y_multi = df_model['price_clean']

model_multi = LinearRegression()
model_multi.fit(X_multi, y_multi)

print("Multiple regression: price ~ bedrooms + bathrooms + room_type")
print(f"  Intercept: ${model_multi.intercept_:.2f}")
print()
for feat, coef in zip(features, model_multi.coef_):
    print(f"  {feat:30s}  ${coef:+.2f}")
Multiple regression: price ~ bedrooms + bathrooms + room_type
  Intercept: $109.23

  bedrooms                        $+32.96
  bathrooms                       $+21.86
  room_type_Private room          $-90.04
  room_type_Shared room           $-112.26

Coefficient interpretation:

  • Each additional bedroom is associated with about $33 more per night, holding other features constant.
  • Each additional bathroom is associated with about $22 more per night.
  • A private room costs roughly $90 less than an entire home/apt, all else equal.

Now we can answer the opening question: our host should charge about $22 more for that second bathroom, holding bedrooms and room type fixed.

The “holding constant” part is crucial — multiple regression provides something simple regression cannot. The coefficient on bedrooms means that if we compare two listings with the same number of bathrooms, the same room type, and the same everything else — but one has one more bedroom — the model predicts the extra bedroom is worth about $33/night.

These coefficients are point estimates — we’ll quantify their uncertainty with confidence intervals in Chapter 12.

The geometric view: orthogonality and the normal equations

We’ve seen that the column space determines what a linear model can predict. Now let’s understand how regression picks the best prediction in that space — and why orthogonality is the key.

Orthogonality in real features

We saw that collinear features (small angle) carry redundant information. What about the opposite extreme — features at 90 degrees to each other? Let’s check which Airbnb features are closest to perpendicular.

Code
# Orthogonality computation: are any Airbnb features approximately orthogonal?
# Center the vectors (subtract the mean) so dot product corresponds to correlation
beds_c = df['beds'].values - df['beds'].mean()
reviews_c = df['number_of_reviews'].values - df['number_of_reviews'].mean()
bedrooms_c = df['bedrooms'].values - df['bedrooms'].mean()

# Dot products (unnormalized)
dot_beds_reviews = np.dot(beds_c, reviews_c)
dot_beds_bedrooms = np.dot(beds_c, bedrooms_c)

# Cosine similarity = correlation for centered vectors
cos_beds_reviews = dot_beds_reviews / (np.linalg.norm(beds_c) * np.linalg.norm(reviews_c))
cos_beds_bedrooms = dot_beds_bedrooms / (np.linalg.norm(beds_c) * np.linalg.norm(bedrooms_c))

print(f"beds vs number_of_reviews:")
print(f"  dot product = {dot_beds_reviews:,.0f}")
print(f"  cosine similarity (= correlation) = {cos_beds_reviews:.3f}")
print(f"  angle = {np.degrees(np.arccos(np.clip(cos_beds_reviews, -1, 1))):.1f}°")
print()
print(f"beds vs bedrooms:")
print(f"  dot product = {dot_beds_bedrooms:,.0f}")
print(f"  cosine similarity (= correlation) = {cos_beds_bedrooms:.3f}")
print(f"  angle = {np.degrees(np.arccos(np.clip(cos_beds_bedrooms, -1, 1))):.1f}°")
print()
print("Beds and reviews are nearly orthogonal — they carry independent information.")
print("Beds and bedrooms are far from orthogonal — they carry redundant information.")
beds vs number_of_reviews:
  dot product = 106,078
  cosine similarity (= correlation) = 0.095
  angle = 84.6°

beds vs bedrooms:
  dot product = 12,905
  cosine similarity (= correlation) = 0.652
  angle = 49.3°

Beds and reviews are nearly orthogonal — they carry independent information.
Beds and bedrooms are far from orthogonal — they carry redundant information.

Beds and reviews aren’t exactly perpendicular — the cosine similarity is near zero, not zero. But the contrast with beds-vs-bedrooms is stark. The exact case, where the dot product is precisely zero, has a name:

ImportantDefinition: Orthogonal Vectors

Two vectors \(u\) and \(v\) are orthogonal (perpendicular) if their inner product is zero: \[u \cdot v = 0\] For centered data vectors, orthogonality corresponds to zero correlation — the features are uncorrelated. Uncorrelated features carry non-redundant information; neither can be predicted from the other by a linear model.

Orthogonal features (90 degrees apart) expand the column space maximally; parallel features (0 degrees) don’t expand it at all. Orthogonality isn’t just a property of features — it’s the key to understanding why regression works.

Prediction lives in the column space

Here’s the key idea:

  • Your feature matrix \(X\) has columns \(x_1, x_2, \ldots\)
  • Any prediction \(\hat{y} = X\beta\) is a linear combination of those columns
  • So \(\hat{y}\) must live in the column space of \(X\)

The true prices \(y\) probably do NOT live in the column space. So regression finds the closest point in the column space to \(y\). That closest point is the projection of \(y\) onto \(\text{col}(X)\).

As George Box put it: “All models are wrong, but some are useful.” The prediction \(\hat{y}\) won’t equal \(y\) exactly — the model is wrong — but projection gives us the best approximation within the column space.

TipThink About It

If \(\hat{y}\) is the closest point in the column space to \(y\), what direction must \(y - \hat{y}\) point? (Hint: think about the closest point on a line to a point not on the line.)

Code
# Visualization code — focus on the output, not the plotting details.
# This draws a 2D cartoon: y is a point, col(X) is a line,
# and ŷ is the projection of y onto that line.

fig, ax = plt.subplots(figsize=(7, 6))

# The column space (a line through the origin in this picture)
t = np.linspace(-0.5, 2.5, 100)
ax.plot(t, 0.5 * t, 'C0-', lw=2, label='Column space of $X$')

# y (the target)
y_pt = np.array([1.5, 1.8])
ax.plot(*y_pt, 'ro', ms=10, zorder=5)
ax.annotate('$y$ (true prices)', xy=y_pt, fontsize=13,
            xytext=(15, 10), textcoords='offset points', color='red')

# ŷ (the projection)
# Project y onto the line (direction [1, 0.5])
d = np.array([1, 0.5])
d_unit = d / np.linalg.norm(d)
yhat_pt = np.dot(y_pt, d_unit) * d_unit
ax.plot(*yhat_pt, 'C0o', ms=10, zorder=5)
ax.annotate('$\hat{y}$ (prediction)', xy=yhat_pt, fontsize=13,
            xytext=(-80, -25), textcoords='offset points', color='C0')

# Residual (orthogonal)
ax.annotate('', xy=y_pt, xytext=yhat_pt,
            arrowprops=dict(arrowstyle='->', color='green', lw=2, ls='--'))
ax.text(1.25, 1.3, '$e = y - \hat{y}$\n(residual)', fontsize=12, color='green')

# Right angle marker
from matplotlib.patches import Rectangle
angle_size = 0.1
corner = yhat_pt
perp = (y_pt - yhat_pt)
perp_unit = perp / np.linalg.norm(perp)
par_unit = d_unit
sq = np.array([corner, corner + angle_size*par_unit,
               corner + angle_size*par_unit + angle_size*perp_unit,
               corner + angle_size*perp_unit, corner])
ax.plot(sq[:, 0], sq[:, 1], 'k-', lw=1)

ax.set_xlim(-0.5, 2.5)
ax.set_ylim(-0.5, 2.5)
ax.set_aspect('equal')
ax.set_xlabel('Dimension 1')
ax.set_ylabel('Dimension 2')
ax.set_title('Regression = projection onto column space')
ax.legend(loc='lower right', fontsize=11)
plt.tight_layout()
plt.show()

The residual \(e = y - \hat{y}\) is orthogonal to the column space. Orthogonality is the defining property of the least squares solution.

Why orthogonal? The perpendicular from a point to a subspace gives the closest point — just as dropping a perpendicular from a point to a line gives the nearest point on the line.

Why orthogonality gives us the answer

We’re about to derive a formula for the regression coefficients that explains WHY sklearn gives the answer it does. The argument has three steps: name the error, see why orthogonality is the optimality condition, and write it as an equation.

Step 1: Name the error. Call the residual \(e = y - X\beta\) — the gap between the true prices and what the model predicts. Different \(\beta\) produce different residuals; we want the \(\beta\) that makes \(e\) as small as possible.

Step 2: The improvement argument. Suppose the residual \(e\) is not orthogonal to some feature column \(x_j\) — that is, \(x_j \cdot e \neq 0\). Then the errors still have a systematic pattern aligned with \(x_j\). We could reduce the error by nudging \(\beta_j\) in the direction of \(x_j \cdot e\), absorbing that pattern into the prediction. So the best \(\beta\) cannot leave any such pattern behind: it must satisfy \(x_j \cdot e = 0\) for every feature \(j\).

Step 3: Write it as one equation. Stacking the conditions \(x_j \cdot e = 0\) for all features into a single matrix equation:

\[X^T(y - X\beta) = 0\]

Each row of this equation says “the residual has zero dot product with one column of \(X\).” Rearranging:

\[X^T X \beta = X^T y \qquad \Longrightarrow \qquad \beta = (X^T X)^{-1} X^T y\]

ImportantDefinition: Normal Equations

The normal equations \(X^TX\beta = X^Ty\) give the least squares solution \(\beta = (X^TX)^{-1}X^Ty\). The formula requires \(X^TX\) to be invertible, which means the columns of \(X\) must be linearly independent (recall the column space section above). Perfectly collinear features make the matrix singular, so no unique solution exists.

You may have heard “regression minimizes the sum of squared errors.” Now you can see why: the prediction that minimizes \(\|y - \hat{y}\|^2\) is exactly the orthogonal projection of \(y\) onto \(\text{col}(X)\).

The normal equations give us the exact solution in one step. For larger problems or different loss functions where no closed-form solution exists, we’ll need an iterative approach called gradient descent (Chapter 13).

Code
# Verify: the normal equations give the same answer as sklearn

# Simple regression with an intercept: price = beta0 + beta1 * bedrooms
X_bedrooms = np.column_stack([np.ones(len(df)), df['bedrooms'].values])
prices = df['price_clean'].values

# Normal equations: beta = (X^T X)^{-1} X^T y
# We use np.linalg.solve rather than np.linalg.inv — computing matrix
# inverses explicitly is numerically unstable. For even better numerical
# stability, production code uses the QR factorization (VMLS Ch 13).
beta_normal = np.linalg.solve(X_bedrooms.T @ X_bedrooms, X_bedrooms.T @ prices)

print("Normal equations:  intercept = {:.2f}, slope = {:.2f}".format(*beta_normal))
print("sklearn:           intercept = {:.2f}, slope = {:.2f}".format(
    model_simple.intercept_, model_simple.coef_[0]))
Normal equations:  intercept = 71.30, slope = 48.59
sklearn:           intercept = 71.30, slope = 48.59

The two solutions match. Now let’s verify the orthogonality property directly: the residuals should have zero dot product with every column of \(X\).

Code
# Verify orthogonality: residuals are orthogonal to each column of X
prices_hat = X_bedrooms @ beta_normal
residuals = prices - prices_hat

dot_with_ones = np.dot(residuals, X_bedrooms[:, 0])     # dot with intercept column
dot_with_bedrooms = np.dot(residuals, X_bedrooms[:, 1])  # dot with bedrooms column

print("Residuals dot (column of 1s):  {:.6f}  (≈ 0 ✓)".format(dot_with_ones))
print("Residuals dot (bedrooms):      {:.6f}  (≈ 0 ✓)".format(dot_with_bedrooms))
print()
print("The residual is orthogonal to every column of X.")
print("This IS the normal equation: X^T(y - Xβ) = 0.")
print("The errors have no pattern left that could be captured by the features.")
Residuals dot (column of 1s):  -0.000000  (≈ 0 ✓)
Residuals dot (bedrooms):      -0.000000  (≈ 0 ✓)

The residual is orthogonal to every column of X.
This IS the normal equation: X^T(y - Xβ) = 0.
The errors have no pattern left that could be captured by the features.

\(R^2\) as a projection ratio

We defined \(R^2\) above as a ratio of sums of squares. Now that we’ve seen the geometry of projection, we can understand \(R^2\) as an angle.

After centering \(y\) (subtracting the mean), the Pythagorean theorem gives an exact decomposition:

\[\|y - \bar{y}\|^2 = \|\hat{y} - \bar{y}\|^2 + \|e\|^2\]

The two terms on the right are orthogonal — \(\hat{y} - \bar{y}\) lives in the column space, and \(e\) is perpendicular to it. This equality holds when the model includes an intercept (which ours does); the intercept ensures residuals sum to zero via the normal equation \(\mathbf{1}^T e = 0\).

Dividing both sides by \(\|y - \bar{y}\|^2\):

\[1 = \frac{\|\hat{y} - \bar{y}\|^2}{\|y - \bar{y}\|^2} + \frac{\|e\|^2}{\|y - \bar{y}\|^2} = R^2 + (1 - R^2)\]

This is precisely \(1 = \cos^2\theta + \sin^2\theta\), where \(\theta\) is the angle between the centered response \(y - \bar{y}\) and the centered predictions \(\hat{y} - \bar{y}\). So \(R^2 = \cos^2\theta\): it measures how close \(y\) is to the column space, as an angle.

Code
# R² as a projection ratio: the Pythagorean picture
fig, ax = plt.subplots(figsize=(6, 6))

# Choose vectors so the right angle is visually obvious.
# A steeper column-space direction makes the triangle less flat.
origin = np.array([0.0, 0.0])
col_dir = np.array([2.0, 1.0])        # direction of column space
y_vec = np.array([1.8, 2.8])          # y - ȳ (centered response)
# Compute ŷ as the true orthogonal projection onto the column space direction.
yhat_vec = col_dir * (np.dot(y_vec, col_dir) / np.dot(col_dir, col_dir))
e_vec = y_vec - yhat_vec               # residual (orthogonal by construction)

# Column space line
slope = col_dir[1] / col_dir[0]
t = np.linspace(-0.3, 3.5, 100)
ax.plot(t, slope * t, color='#2d8a4e', lw=2.5, alpha=0.25)
ax.text(3.1, slope * 3.1 + 0.12, 'col($X$)', fontsize=11,
        color='#2d8a4e', alpha=0.7, style='italic')

# ŷ - ȳ vector (projection, along column space)
ax.annotate('', xy=yhat_vec, xytext=origin,
            arrowprops=dict(arrowstyle='->', color='#2d6a9f', lw=2.8))
ax.text(yhat_vec[0] + 0.08, yhat_vec[1] - 0.25,
        '$\\hat{y} - \\bar{y}$', fontsize=14, color='#2d6a9f', fontweight='bold')

# y - ȳ vector (hypotenuse)
ax.annotate('', xy=y_vec, xytext=origin,
            arrowprops=dict(arrowstyle='->', color='#c0392b', lw=2.8))
ax.text(y_vec[0] - 0.55, y_vec[1] + 0.12, '$y - \\bar{y}$',
        fontsize=14, color='#c0392b', fontweight='bold')

# Residual e (orthogonal to column space)
ax.annotate('', xy=y_vec, xytext=yhat_vec,
            arrowprops=dict(arrowstyle='->', color='#555555', lw=2.2, ls='--'))
mid_e = (y_vec + yhat_vec) / 2
ax.text(mid_e[0] + 0.15, mid_e[1] + 0.05, '$e$',
        fontsize=15, color='#555555', style='italic')

# Right angle marker (large enough to be clearly visible)
sz = 0.2
p_dir = col_dir / np.linalg.norm(col_dir)   # unit vector along column space
r_dir = e_vec / np.linalg.norm(e_vec)        # unit vector along residual
sq = np.array([yhat_vec + sz * p_dir,
               yhat_vec + sz * p_dir + sz * r_dir,
               yhat_vec + sz * r_dir])
ax.plot(sq[:, 0], sq[:, 1], 'k-', lw=1.2)

# Angle arc for theta at the origin
angle_y = np.arctan2(y_vec[1], y_vec[0])
angle_yhat = np.arctan2(yhat_vec[1], yhat_vec[0])
arc_r = 0.7
theta_range = np.linspace(angle_yhat, angle_y, 40)
ax.plot(arc_r * np.cos(theta_range), arc_r * np.sin(theta_range),
        color='#8e44ad', lw=2)
mid_angle = (angle_y + angle_yhat) / 2
ax.text(arc_r * np.cos(mid_angle) + 0.08,
        arc_r * np.sin(mid_angle) + 0.05,
        '$\\theta$', fontsize=15, color='#8e44ad')

# Formula box (top-right, away from the vectors)
ax.text(2.4, 3.15,
        '$R^2 = \\cos^2\\theta'
        ' = \\dfrac{\\|\\hat{y} - \\bar{y}\\|^2}{\\|y - \\bar{y}\\|^2}$',
        fontsize=13, color='#333333',
        bbox=dict(boxstyle='round,pad=0.4', facecolor='#f0f0f0',
                  edgecolor='#cccccc'))

# Origin label
ax.plot(0, 0, 'ko', ms=5, zorder=5)
ax.text(-0.18, -0.18, '$\\bar{y}$', fontsize=12, color='gray')

ax.set_xlim(-0.6, 3.8)
ax.set_ylim(-0.5, 3.6)
ax.set_aspect('equal')
for spine in ax.spines.values():
    spine.set_visible(False)
ax.set_xticks([])
ax.set_yticks([])
ax.set_title('$R^2$ measures how close $y$ is to the column space',
             fontsize=12, pad=10)
plt.tight_layout()
plt.show()

Code
# Compare simple vs multiple regression
model_bed_only = LinearRegression().fit(df[['bedrooms']], df['price_clean'])
y_hat_simple = model_bed_only.predict(df[['bedrooms']])
y_hat_multi = model_multi.predict(X_multi)

r2_simple = 1 - np.sum((y_multi - y_hat_simple)**2) / np.sum((y_multi - y_multi.mean())**2)
r2_multi = 1 - np.sum((y_multi - y_hat_multi)**2) / np.sum((y_multi - y_multi.mean())**2)

print(f"R² with bedrooms only:                          {r2_simple:.3f}")
print(f"R² with bedrooms + bathrooms + room_type:        {r2_multi:.3f}")
print()
print("More independent features → larger column space → better projection → higher R²")
R² with bedrooms only:                          0.156
R² with bedrooms + bathrooms + room_type:        0.446

More independent features → larger column space → better projection → higher R²

The Pythagorean theorem gives us an exact decomposition: total variance = explained variance + residual variance.

Code
# Verify the Pythagorean decomposition
y_vals = y_multi.values
y_bar = y_vals.mean()
y_hat_vals = y_hat_multi

ss_total = np.sum((y_vals - y_bar)**2)        # ||y - ȳ||²
ss_model = np.sum((y_hat_vals - y_bar)**2)    # ||ŷ - ȳ||²
ss_resid = np.sum((y_vals - y_hat_vals)**2)   # ||e||²

print(f"SS_total = {ss_total:,.0f}")
print(f"SS_model = {ss_model:,.0f}")
print(f"SS_resid = {ss_resid:,.0f}")
print(f"SS_model + SS_resid = {ss_model + ss_resid:,.0f}  (≈ SS_total ✓)")
print()
print(f"R² = SS_model / SS_total = {ss_model/ss_total:.4f}")
print(f"R² = 1 - SS_resid/SS_total = {1 - ss_resid/ss_total:.4f}")
SS_total = 200,842,403
SS_model = 89,496,826
SS_resid = 111,345,577
SS_model + SS_resid = 200,842,403  (≈ SS_total ✓)

R² = SS_model / SS_total = 0.4456
R² = 1 - SS_resid/SS_total = 0.4456

An \(R^2\) around 0.45 means our model explains about 45% of the variation in prices. The remaining 55% comes from factors we didn’t include — neighborhood, amenities, photos, host reputation.

TipThink About It

Can R-squared ever decrease when you add a feature? No. Adding a feature expands the column space — the projection onto a larger subspace can only get closer to \(y\), never farther. So training R² can only increase (or stay the same). But as we’ll see in Chapter 7, test R² can absolutely decrease.

Checking the residuals

The normal equations guarantee \(X^Te = 0\) — that’s a mathematical fact. But a good model should also have residuals that look like random noise. Why? If you can see a pattern in the residuals — a trend, a curve, a fan shape — then the residuals are predictable. And if you can predict the residual, you can add that prediction to \(\hat{y}\) and get a better model. Residuals that look like noise mean there’s nothing left to extract.

The orthogonality condition only guarantees no linear pattern with the included features. It says nothing about nonlinear patterns or features you left out. We check with a scatter plot of residuals vs. predicted values and a histogram (.hist()) of the residual distribution.

Code
# Residual plot
residuals_multi = y_multi - y_hat_multi

fig, axes = plt.subplots(1, 2, figsize=(10, 4))

axes[0].scatter(y_hat_multi, residuals_multi, alpha=0.05, s=10)
axes[0].axhline(0, color='red', lw=1.5)
axes[0].set_xlabel('Predicted price ($)')
axes[0].set_ylabel('Residual ($)')
axes[0].set_title('Residual spread increases with predicted price')

axes[1].hist(residuals_multi, bins=60, edgecolor='white', alpha=0.7)
axes[1].set_xlabel('Residual ($)')
axes[1].set_ylabel('Count')
axes[1].set_title('Residuals are roughly symmetric but heavy-tailed')

plt.tight_layout()
plt.show()

print("Notice: the spread of residuals increases with predicted price.")
print("This is heteroscedasticity — the model's errors are not constant.")

Notice: the spread of residuals increases with predicted price.
This is heteroscedasticity — the model's errors are not constant.

The residual plot reveals a fan shape: the spread of residuals increases with predicted price. The residual histogram is right-skewed, not symmetric. Both patterns signal that a simple linear model on raw prices has room for improvement — transformations, additional features, or a different model family might help. We’ll develop these diagnostic tools in Chapter 5.

Preview

In Chapter 5, we’ll transform raw data — categories, text, nonlinear patterns — into features a linear model can use. We’ll also develop diagnostic tools for when things go wrong: multicollinearity, overfitting, and residual patterns that signal missing structure.

Key takeaways

  • Vectors in data: each feature column is a vector, each data point is a vector.
  • Linear combinations: a linear model predicts \(\hat{y} = \beta_1 x_1 + \beta_2 x_2 + \cdots\) — a linear combination of feature columns.
  • Span and column space: the set of all possible predictions is the column space of \(X\).
  • Regression finds the closest point in the column space to \(y\) — the projection.
  • The residual is orthogonal to the column space — this is why the formula works.
  • \(R^2\) measures how much of \(y\) lives in the column space — the fraction of variance explained.
  • More features = larger column space = better fit (on training data) — but not always better predictions on new data.

Further reading

Want to build more intuition? The 3Blue1Brown “Essence of Linear Algebra” series is excellent and requires no prerequisites — watch at least episodes 1–4.

For a deeper formal treatment: VMLS (Boyd & Vandenberghe), Chapters 1 (Vectors) and 5 (Linear independence).

Study guide

Key ideas

  • Vector: an ordered list of numbers. Each feature column and each data row is a vector.
  • Linear function: \(f\) is linear if \(f(ax + by) = af(x) + bf(y)\). Linear models are linear in the weights \(\beta\).
  • Inner product (dot product): \(u \cdot v = \sum u_i v_i\) — measures similarity between vectors (but depends on magnitude; use cosine similarity to isolate direction).
  • Cosine similarity: \(\cos\theta = \frac{u \cdot v}{\|u\|\|v\|}\) — directional similarity, independent of magnitude. Equals correlation for centered vectors.
  • Norm: \(\|v\| = \sqrt{v \cdot v}\) — the length/magnitude of a vector.
  • Distance: \(d(u,v) = \|u - v\|\) — measures dissimilarity between vectors.
  • Linear combination: \(\beta_1 v_1 + \beta_2 v_2\) — a weighted sum of vectors. A scalar is a single number used as a weight.
  • Span: the set of all linear combinations of a set of vectors.
  • Column space: the span of the columns of a matrix \(X\) — equivalently, the set of all possible predictions \(X\beta\).
  • Orthogonal: vectors with zero inner product; for centered data, orthogonality means zero correlation (uncorrelated).
  • Projection — the closest point in a subspace to a given vector. Regression finds the orthogonal projection of \(y\) onto the column space of \(X\).
  • Residual — the error \(e = y - \hat{y}\); orthogonal to the column space. The residual is orthogonal to every column of \(X\) — that’s the normal equation.
  • Normal equations\(X^TX\beta = X^Ty\), the algebraic form of the orthogonality condition.
  • \(R^2\) — fraction of variance explained; \(R^2 = 1 - \|e\|^2 / \|y - \bar{y}\|^2\). It always increases with more features (on training data).
  • Coefficient interpretation — “holding other features constant,” the predicted change in \(y\) per unit change in one feature. A regression coefficient measures association, not causation.
  • \(R^2 = r^2\) in simple regression — the correlation coefficient \(r\) measures linear association (\(-1\) to \(+1\)); \(R^2\) measures variance explained (\(0\) to \(1\)). For one predictor, \(R^2\) is just \(r\) squared. With multiple predictors, only \(R^2\) applies.
  • Regression to the mean — extreme observations tend to be followed by less extreme ones whenever correlation is imperfect (\(|r| < 1\)). A statistical phenomenon, not a causal one.

Computational tools

  • np.dot(u, v) or u @ v — inner product (dot product)
  • np.linalg.norm(v) — Euclidean norm (length) of a vector
  • np.column_stack([a, b]) — build a design matrix from column vectors
  • np.linalg.solve(A, b) — solve \(Ax = b\) without forming \(A^{-1}\)
  • LinearRegression() — create a linear regression model
  • .fit(X, y) — fit the model to data
  • .predict(X) — generate predictions
  • .coef_ — fitted coefficients (slopes)
  • .intercept_ — fitted intercept
  • pd.get_dummies() — one-hot encode categorical variables

For the quiz

  • Be able to determine whether a vector is in the span of given vectors.
  • Know what happens to the column space when you add a redundant feature.
  • Be able to write the normal equations and explain what they mean geometrically.
  • Given a regression output, interpret a coefficient in context (“holding other features constant…”).
  • Explain why training \(R^2\) can never decrease when you add a feature.