MSE 125 — Slides – Lecture 14: PCA and Dimensionality Reduction

logistics

project midterm report due Friday May 15
HW 4 out later this week: trees + PCA + clustering + Kaggle challenge

the brief

risk analyst, $5B asset manager

your book: S&P 100 stocks.

last March: every name crashed together

CIO asks: what risk are we really exposed to?

The decision frame: a risk analyst whose CIO wants to know what the book is exposed to. “95 stocks” is not what a CIO wants to hear — they want the systematic exposures: market beta, sector tilt, style tilt. PCA’s answer (block 2): PC1 is the market factor, PC2 is cyclicals-vs-defensives, PC3 is growth-vs-yield. Three numbers per stock, an interpretable risk decomposition for the one-pager.

The hook plot (right column) shows five S&P 100 stocks across 2020 — AAPL, JPM, XOM, KO, MSFT. For January and February the lines drift apart; mid-March they all crash together (red dashed = COVID shutdown); by summer they diverge — tech up, energy lagging. The COVID moment is the live demonstration of “every name crashed together” — sets up the question of how to summarize that risk.

Don’t spoil the punchline (PC1 explains ~40% of variance after standardization, and a 3-factor model wins on held-out portfolio variance). Hold for blocks 2 and 3.

About 1 min.

today

the picture: geometry of the best line, then the best subspace
reading the components: what PC1 / PC2 / PC3 mean for the S&P 100
the payoff: minimum-variance portfolios in the p > n regime

bridge from supervised learning

	regression / classification	PCA
input	features X + label y	features X only
goal	predict y from X	summarize X in k \ll p dims
evaluation	held-out prediction error	(today: held-out portfolio variance)
family	supervised	unsupervised

no y, no “right answer”. the algorithm finds structure on its own.

the picture: best line, best subspace

Pearson 1901: same picture, drawn before computers

Pearson, Phil. Mag. 1901

given a cloud of points, what line minimizes the sum of squared perpendicular distances from points to the line?

Pearson answered this in any number of dimensions, with just elementary geometry

same picture, real data: JPMorgan vs Bank of America

PC1 by hand: two stocks, a few lines

# 2x2 covariance, then top eigenvector (sorted largest first)
C = np.cov(jpm_centered, bac_centered, ddof=1)
eigvals, eigvecs = np.linalg.eigh(C)
order = np.argsort(eigvals)[::-1]
v1 = eigvecs[:, order[0]]                  # PC1 direction

# project the example day onto PC1
score = (np.array([6.0, 3.0]) - mean_xy) @ v1

PC1 direction       : (+0.653, +0.757)
PC1 variance share  : 94.4%
PC2 variance share  :  5.6%
Example day score on PC1: +6.12

scaling up: 95 stocks, same problem

returns = pd.read_csv('sp100_daily_returns_2014_2024.csv',
                      index_col=0, parse_dates=True)
print(returns.shape)

(2770, 95)

the data matrix X has shape n \times p: rows = 2770 days, columns = 95 stocks

now what? in 2D we drew a picture. in 95D we run the same eigendecomposition (or its rectangular cousin, the SVD; coming up).

predict before we run it.

we run PCA on the 95 stocks without standardizing (raw returns).

which stocks dominate PC1?

1. the largest companies (Apple, Microsoft, Berkshire)
1. the most volatile stocks (Tesla, semiconductors)
1. the financials (banks, insurance)
1. PC1 will be roughly equal across all 95

without standardization: high-vol stocks dominate

top 5 by |PC1 loading|

raw		standardized
AMD	0.164	BLK	0.134
NVDA	0.162	HON	0.131
COF	0.153	MS	0.131
TSLA	0.152	JPM	0.130
BA	0.148	C	0.128

→ raw: semis, EV, aerospace. the loud names, top 5 hold ~12%

→ standardized: financials. the most market-correlated, top 5 hold ~8.5% (baseline 5.3%)

the fix: standardize before PCA

standardization (z-score)

subtract each feature’s mean, divide by its standard deviation

z_{ij} = \frac{x_{ij} - \mu_j}{\sigma_j}

after standardization, every column has mean 0 and variance 1

→ PCA finds structure in the correlations, not the raw magnitudes

→ for returns, this is almost always what we want

reading the components

the scree plot: variance per PC

PC1 alone: ~40% of variance
70% needs ~20 PCs; 80% needs ~35
one big factor + a long flat tail: typical of equity returns

elbow method: a heuristic, not a formula

elbow method

choose k where the scree bars stop dropping quickly: the bend in the plot

before the elbow: meaningful structure
after the elbow: noise

Warning

the elbow can mislead. on real data, even a sharp elbow may not pick the k that wins on a downstream task.

interpret PC1: the market factor

top: every stock loads positively → PC1 is the common up-down move

bottom (PC2, sectors): financials + energy on top; staples + utilities on bottom

→ PC2 = cyclicals vs defensives

The named-factor reveal. Each PC is a vector of weights, one per stock — the loadings.

Top panel (PC1): every loading is positive. A stock’s PC1 loading is its sensitivity to the market — a high-beta stock swings hard when the index swings; a defensive name has a smaller loading. PC1 is the textbook market factor.

Bottom panel (PC2, colored by GICS sector): sharp pattern by sector. Purple bars (Financials — JPM, BAC, WFC, C, COF) are positive. Red bars (Energy) also positive. Green bars (Consumer Staples — PG, KO) and dark blue bars (Utilities) are negative. PC2 is the cyclical-vs-defensive contrast — when the economy’s strong and cyclicals rally, PC2 is up; when investors flee to safety in staples and utilities, PC2 is down.

PCA discovered this without anyone telling it about sectors. The factors fell out of the eigendecomposition.

About 2 min.

interpret PC3: growth vs yield

positive: NVDA, AMZN, META, GOOGL, ADBE → growth tech

negative: utilities (DUK, SO), staples (KO), high-dividend telecom (VZ) → yield

by PC5 or PC6, the economic story runs out

we have named factors. did PCA work?

PC1 looks like the market. PC2 splits cyclicals from defensives. PC3 splits growth from yield.

have we proved PCA worked on the S&P 100?

discuss with your neighbor (3 min).

DISCUSSION: think-pair-share (3 min). 1 min think + 2 min pair. Set up the chapter’s central methodological lesson.

Target answer: no. Beautifully named factors are appealing — they pass the human-narrative test — but they’re not validation. The factors named themselves because we already know the universe: we know what financials are, we know what utilities are, so when PC2 lines up with “financials high, utilities low” we recognize the story.

What if we’d run PCA on a less-familiar dataset — gene expression, image pixels, customer behavior? We wouldn’t have the named-factor crutch. We need a method-agnostic test of whether the components carry useful information.

The chapter’s answer (block 3): use the components on a downstream task and evaluate on data we never used to fit them. Interpretability is not validation.

If a student says “yes, the named factors prove it” — that’s exactly the position the chapter argues against. Hold that and let block 3 unsettle it.

the math: what is PCA solving?

PCA (variational form)

among all k-dimensional subspaces of \mathbb{R}^p, PCA picks the one closest to the data:

S^\star = \arg\min_{\dim S = k}\; \sum_{i=1}^n \|x_i - P_S x_i\|^2

equivalently (by Pythagoras, after centering): the subspace that maximizes the total variance of the projections

with k=1: the best line through the cloud (Pearson’s picture)

with k=2: the best plane. with k general: the best k-dim subspace.

the math: how do we solve it?

data matrix X \in \mathbb{R}^{n \times p}, where row i = x_i

2 stocks (earlier): eigendecomposed the 2 \times 2 covariance
general X: truncated SVD X \approx U_k S_k V_k^T (works directly on X; no covariance to form)

what each piece means

symbol	size	meaning
V_k	p \times k	principal directions, basis for S^\star
U_k S_k	n \times k	projected scores, each row = one day’s coordinates
\Lambda_k = S_k^2/(n{-}1)	k \times k	PC variances

→ truncated SVD = best rank-k approximation to X in Frobenius error (Eckart–Young–Mirsky)

verify: `PCA()` is the truncated SVD

U, S_svd, Vt_svd = np.linalg.svd(Z, full_matrices=False)

# Match up to a sign per component (sign of a PC is arbitrary)
signs = np.sign((Vt_svd[:3] * pca.components_[:3]).sum(axis=1))
print("max difference in directions:",
      np.abs(Vt_svd[:3] * signs[:, None] - pca.components_[:3]).max())
print("max difference in variances:",
      np.abs(S_svd[:3]**2 / (n-1) - pca.explained_variance_[:3]).max())

max difference in directions: 1.4e-15
max difference in variances:  3.5e-17

PCA() and np.linalg.svd give the same answer to numerical precision

the payoff: portfolios in the p > n regime

why minimum variance? compounding

two portfolios with the same average return don’t earn the same dollars
a portfolio that loses 50% then gains 50% finishes down 25%, not flat
compounding punishes losses more than gains

→ cutting volatility while holding return constant grows real wealth over time

→ low-volatility ETFs and pension-fund mandates: this is exactly their target

full Markowitz problem trades off return and variance; we focus on variance because that’s where covariance estimation, and PCA, bites

The motivation slide. Students will instinctively want to maximize return, not minimize variance. The honest answer is: high variance kills long-run wealth.

Walk through the 50%/50% example: $1 → $0.50 → $0.75. Down 25%, not flat. Compounding penalizes losses more than gains because the loss eats your base. The general statement: E[\log(1 + r)] \approx \mu - \sigma^2/2 — variance subtracts directly from compound growth (“volatility drag”).

Practical: low-volatility ETFs (USMV, SPLV) are billion-dollar funds. Pension funds and risk-parity strategies are explicitly variance-minimizing. So minimum-variance isn’t a quirky academic objective — it’s a major chunk of how real money actually invests.

The parenthetical at the bottom acknowledges what we’re not covering: the full Markowitz problem trades off return AND variance. We focus on the variance side because PCA is what helps us estimate covariance — and you need that to do anything else with \Sigma.

About 90 sec.

minimum-variance portfolio: Markowitz 1952

minimize portfolio variance, subject to weights summing to 1:

\min_w \, w^T \Sigma w \quad \text{s.t.} \quad \mathbf{1}^T w = 1

closed form: w_{\text{min-var}} \propto \Sigma^{-1} \mathbf{1}

read it qualitatively:

low-variance stocks get larger weights: less risk per dollar
natural hedges (negative covariances) get larger weights: cancel each other
volatile + correlated names get smaller weights: redundant risk

→ to use this, we need \Sigma, and to invert it

Markowitz, J. Finance 1952

the failure mode: p > n

60-day window with 95 stocks: p = 95, n = 60

sample covariance is rank-deficient

59 nonzero eigenvalues out of 95 → singular → inverse doesn’t exist

even with a tiny ridge (\lambda = 10^{-6}), the “min-variance” portfolio is unusable:

max single-stock weight	+32%
min single-stock weight	−29%
gross exposure (long + short)	809%
max weight change after shifting window 1 day	9% per stock

→ not investable. we need a structured estimate of \Sigma

PCA gives us \Sigma: the factor model

approximate \Sigma as k factors + diagonal noise:

\hat\Sigma_k = V_k \Lambda_k V_k^T + \mathrm{diag}(D)

V_k = top k PC directions (the factors)
\Lambda_k = diagonal of PC variances
\mathrm{diag}(D) = stock-specific residual; ensures \mathrm{diag}(\hat\Sigma_k) = 1 on standardized data

→ \hat\Sigma_k^{-1} exists, plugs cleanly into Markowitz

→ how big should k be? ask cross-validation

the four estimators

method	formula	role
equal-weight	w = \mathbf{1}/p	baseline (no covariance)
ridge sample	\Sigma + \lambda I	invertibility hack
random projection	random V_k + diag correction	sanity check
PCA factor	top-k PC factors + diag correction	the contender

we backtest with walk-forward evaluation: train on the past, hold for the next 21 days, slide

The four estimators we’ll compare. The first two are the standard competitors any new estimator has to beat. The third — random projection — tests whether PCA’s variance-aligned directions actually matter, or whether any low-rank approximation would do. If random does as well as PCA, then the “structure” PCA finds is illusory.

The walk-forward evaluation: train on the first 60 days, hold for the next 21 (about a month between rebalances), then slide the window forward and repeat. Time-respecting CV — never train on data that comes after the test window. We’ll come back to walk-forward properly in Ch 16.

Why walk-forward instead of random 5-fold? Returns are correlated across time (volatility clusters, market regimes persist). A random fold puts March 15 in training and March 14, 16 in validation — same regime, leaks information. Random-fold validation underestimates true generalization error.

About 90 sec.

the result: variance vs k

PCA factor (blue) drops fast 1→3, flat-ish 3→10, rises past 10
CV picks k = 3: well below scree’s “elbow at 1”
random projection (gray) lies above PCA throughout: variance-aligned directions matter

The CV plot. Validation-half mean daily portfolio variance vs. number of components k, log-scale on k.

Blue (PCA factor): drops sharply from k=1 to k=3 as PC2 and PC3 stabilize the covariance estimate. Flat between k=3 and k=10. Rises past k=10 as we start fitting noise — same bias-variance trade-off as in regression, but for covariance estimation.

Gray (random projection): consistently above PCA. A random low-rank approximation captures some of the structure, but always less than PCA’s variance-aligned directions. So PCA is finding signal, not just any low-rank summary.

Red dashed (ridge): worst of all. The ridge estimate is so unstable that even at any k, PCA wins by a wide margin.

Green dashed (equal-weight): beats ridge but lags PCA. Equal-weight is a tough baseline — it’s parameter-free and robust.

The CV pick is k=3. Three factors is enough to capture the systematic structure; more components start adding noise.

About 2 min.

the result: cumulative variance over time

PCA factor portfolio sits lowest the entire test period

every method jumps at COVID; PCA’s jump is smallest

interpretability is not validation

PC1 = market, PC2 = cyclicals / defensives, PC3 = growth / yield

→ a clean story. but a story is not the test.

the test: held-out portfolio variance, on data we never used to fit

→ pick the model that wins on the task you care about

cautionary tale: genes mirror geography

197K SNPs, 1,387 Europeans

no geographic info given to the algorithm

→ it drew a map of Europe (r^2 \approx 0.7)

. . .

→ math result: PCA on spatial data with distance-decaying similarity generically produces these patterns

→ recovering an obvious pattern is not evidence of hidden structure

Novembre 2008; Novembre & Stephens 2008

The cautionary callout from the chapter. Novembre et al. ran PCA on ~197K SNPs across 1,387 Europeans, gave the algorithm zero geographic information, and the first two principal components — when plotted against each other — reproduced the map of Europe with high fidelity. PC1 lined up with latitude, PC2 with longitude.

The temptation is to read this as “PCA discovered hidden European population structure.” The same research community published the math showing otherwise: Novembre & Stephens (2008, Nature Genetics) proved that PCA on any spatial data with distance-decaying similarity will produce gradient/sinusoidal patterns generically — no historical demographic event required.

So the famous figure was always going to come out — the math made it inevitable given the sampling. The structure was built into the input geography, not discovered as European prehistory.

The lesson — and a parallel to today’s named-factor discussion: a PCA figure that reproduces something obvious about the data (a map, a known taxonomy) is not by itself evidence that PCA discovered hidden structure. Same lesson next lecture, on clustering. Genes mirror geography → PredPol mirrors policing → MBTI mirrors cutpoints. Three fields, same artifact.

About 2 min.

PCA or pick five?

instead of computing principal components, could we just pick the five best stocks and form a portfolio from them?

when would PCA be the right move? when would a sparse subset (Lasso) be the right move?

discuss with your neighbor (3 min).

DISCUSSION: think-pair-share (3 min). 1 min think + 2 min pair + brief debrief.

Target answers:

Lasso is the right move when the truth really is sparse — only a handful of stocks matter for predicting some outcome, the rest are noise. Example: “predict next-quarter earnings from one of 200 specific accounting ratios.” You want a short list.

PCA is the right move when many features carry useful information and you want a low-dimensional summary that pools them. Example: today’s covariance matrix where every stock is partly driven by a common market factor. A sparse subset would miss the market hitting all of them.

The third option — sparse PCA — splits the difference: principal components in which most loadings are zero. Less common, but it exists.

The deeper point: PCA and Lasso answer different questions. Lasso says “which features matter?” PCA says “what’s a low-dim summary?” Both are valid; pick by what your downstream task needs.

demo: PCA in the notebook

colab.research.google.com/…/lec14-pca.ipynb

what to watch:

toggle standardization on/off: watch PC1 flip from “TSLA-dominated” to “market”
vary k from 1 to 20: watch held-out variance drop, flatten, rise
swap in your own time window: does CV’s pick of k change?

summary

PCA finds the directions of maximum variance: best line, best plane, best k-dim subspace
standardize first: otherwise the largest-magnitude features dominate
PC1, PC2, PC3 on the S&P 100 are the market, cyclical-vs-defensive, growth-vs-yield
truncated SVD is what PCA computes (Eckart–Young–Mirsky: best rank-k approximation)
factor-model covariance \hat\Sigma_k = V_k \Lambda_k V_k^T + \text{diag} unlocks portfolios when p > n
interpretability is not validation: pick k by held-out task, not the scree elbow

next: clustering, with a different constraint

PCA compresses each row into continuous coordinates in a k-dim subspace

K-means compresses each row into one of k discrete labels

same compress-into-k idea, different geometry

→ Monday: K-means on Airbnb (and the same “your features pick your structure” lesson, on a different method)

feedback

forms.gle/feedback

what worked? what didn’t? what’s still confusing?

Lecture 14: PCA and Dimensionality Reduction

logistics

the brief

risk analyst, $5B asset manager

today

bridge from supervised learning

Pearson 1901: same picture, drawn before computers

same picture, real data: JPMorgan vs Bank of America

PC1 by hand: two stocks, a few lines

scaling up: 95 stocks, same problem

without standardization: high-vol stocks dominate

the fix: standardize before PCA

the scree plot: variance per PC

elbow method: a heuristic, not a formula

interpret PC1: the market factor

interpret PC3: growth vs yield

the math: what is PCA solving?

the math: how do we solve it?

what each piece means

verify: PCA() is the truncated SVD

why minimum variance? compounding

minimum-variance portfolio: Markowitz 1952

the failure mode: p > n

PCA gives us \Sigma: the factor model

the four estimators

the result: variance vs k

the result: cumulative variance over time

interpretability is not validation

cautionary tale: genes mirror geography

demo: PCA in the notebook

summary

next: clustering, with a different constraint

feedback

verify: `PCA()` is the truncated SVD