Practice Quiz 7: PCA + Clustering

MSE 125 — Lectures 14–15

Use this practice quiz to prepare for Quiz 7 (Wednesday, May 20). The real quiz will have 2 questions in 10 minutes, closed-book. This practice set has 10 questions covering Lectures 14–15: standardization before PCA, reading a scree plot, interpreting loadings, PCA vs. Lasso, the SVD / Eckart-Young-Mirsky theorem, the k-means algorithm and SSE, the elbow + silhouette heuristics, random-initialization sensitivity, how feature choice determines clusters, and comparing clusterings with ARI.

Every concept tested on the real quiz appears somewhere on this practice set, with a different scenario.

Question 1. A wine analyst runs PCA on 1,000 wines with 5 features. The bar chart below shows the raw variance of each feature (log scale).

Without standardizing, what fraction of the variance do you expect PC1 to capture — closer to 20%, 50%, or 95%? Briefly explain.
After standardizing, PC1 captures 30% of the variance. Is standardization making PCA worse? Explain in 2–3 sentences.

Solution

(a) Closer to 95%. Without standardization, PCA decomposes the covariance matrix, and PC1 aligns with the largest-variance direction. Here price has variance \(\approx 4{,}500\) — more than two orders of magnitude larger than the next-biggest (sugar at 25). The total variance is dominated by price, so PC1 ≈ price axis, capturing essentially all of the raw variance (about 99% of the variance lives in the price coordinate alone).

(b) No. The pre-standardization 95% was an artifact of measurement scale — price was measured in dollars and ranged into the hundreds, while color intensity was on a 0-1 scale, so price’s raw variance was thousands of times larger only because of the unit choice. After standardizing, every feature gets variance 1 and contributes equally. PC1’s 30% honestly reflects how much shared structure the features carry once the scale artifact is gone. Standardization didn’t damage PCA; it stopped PCA from being fooled by units.

Question 2. A trainer runs PCA on 800 Pokémon with 6 standardized stats: HP, Attack, Defense, Special Attack (Sp. Atk), Special Defense (Sp. Def), and Speed. The PC1 + PC2 loadings:

Describe what PC1 represents in plain language. (1–2 sentences.)
Describe what PC2 represents in plain language, referencing the sign pattern. (2–3 sentences.)

Solution

(a) PC1 is “overall power” (or “overall stat total”). Every loading on PC1 is positive and roughly the same size (0.36-0.45), so a Pokémon scores high on PC1 if it has high values across the board — strong on every stat. A weak Pokémon scores low on PC1 on every stat at once.

(b) PC2 is “offense vs. defense” (with a hint of “physical vs. special” on the offense side, but the dominant pattern is offense–vs.–defense). The positive loadings are on Attack (+0.55), Sp. Atk (+0.48), and Speed (+0.28) — the offensive stats. The negative loadings are on Defense (-0.46) and Sp. Def (-0.42) — the defensive stats. A Pokémon with high PC2 is a glass cannon (great offense, poor defense); a Pokémon with low PC2 is a tank (great defense, weak offense). PCA discovered this trade-off without anyone telling it about “offense” vs. “defense.”

Question 3. A team runs PCA on a 10-feature dataset and produces the scree + cumulative variance plot below.

How many PCs are needed to capture at least 70% of the variance? 80%? Read both off the right panel.
Where is the elbow in the left panel? How many PCs would the elbow method recommend? (1 sentence.)

Solution

(a) 70% threshold: 4 PCs (cumulative reaches 73% at \(k=4\)). 80% threshold: 6 PCs (cumulative reaches 84% at \(k=6\)). Read both off the right-panel cumulative curve where it crosses the dashed thresholds.

(b) The bend is around PC3-PC4: PC1, PC2, PC3 each contribute meaningfully (35%, 18%, 12%); the drops after that get small (PC4 at 8%, PC5 at 6%, then mostly 4-5% each). The elbow method is a heuristic, not a formula, and reasonable readings include \(k=2\), \(k=3\), or \(k=4\). Any of those answers with a stated justification (“PC1 alone captures 35%, drop is biggest after that,” or “the bend is sharpest after PC3,” etc.) is acceptable. The honest answer for picking \(k\) on a downstream task is cross-validation, not the scree elbow.

Question 4. A geneticist has a dataset of 10,000 gene expression levels measured on 500 patients, and wants to predict a binary disease label. She is considering two methods:

Lasso regression: fit a logistic regression with an L1 penalty.
PCA preprocessing: project the 10,000 features onto the top-\(k\) principal components, then fit a logistic regression on the \(k\) PC scores.

Suppose the disease is driven by a small handful of specific genes (maybe 5–10), and the rest are irrelevant noise. Which method is the better fit? Why? (1–2 sentences.)
Suppose instead the disease reflects a broad shift in expression across hundreds of correlated genes (no individual gene matters much, but the joint signal is strong). Which method is the better fit? Why? (1–2 sentences.)

Solution

(a) Lasso. When the truth is sparse — a few genes carry all the signal — Lasso’s L1 penalty drives most coefficients to exactly zero and keeps only the genes that matter. The geneticist also gets an interpretable shortlist of which specific genes are predictive (useful for follow-up biology). PCA would mix the relevant genes into combined components and lose the named-gene interpretation.

(b) PCA. When the signal is spread across many correlated features, no individual feature is “important” enough for Lasso to pick out, and aggressive L1 might drop everything. PCA constructs a small number of summary features (combinations of many genes) that pool the diffuse signal — one principal component can capture the joint shift across hundreds of correlated genes. This is exactly the setting PCA was built for: many features carry partial information, and the structure lives in linear combinations of them.

Question 5. PCA is implemented by the truncated SVD: \(X \approx U_k S_k V_k^T\).

The Eckart-Young-Mirsky theorem says the truncated SVD is “the best rank-\(k\) approximation” to \(X\). Best in what sense? (1 sentence.)
Why does this matter for dimensionality reduction? Connect to PCA’s variational definition. (1–2 sentences.)

Solution

(a) Best in squared error. Among all rank-\(k\) matrices, the truncated SVD minimizes \(\|X - \hat{X}\|_F^2\) — the sum of squared entries of the residual matrix (the Frobenius norm of the error). No other rank-\(k\) matrix can do better.

(b) PCA’s variational definition picks the \(k\)-dimensional subspace that minimizes the total squared distance from data points to their projections (equivalently, maximizes the variance of the projections). The truncated SVD literally produces that subspace — \(V_k\) holds the optimal directions and \(U_k S_k\) holds the projected coordinates. So Eckart-Young-Mirsky guarantees that no other \(k\)-component summary can reconstruct the data with smaller squared error: PCA is provably optimal in this sense.

Question 6. You are running one iteration of k-means by hand. The plot below shows 6 points (labeled A–F) and 2 initial centroids (the stars).

Which centroid is each point assigned to in the first iteration? (List by letter.)
After the recompute step, where do the two centroids land? Compute by hand.

Solution

(a) Each point is assigned to the nearer centroid (Euclidean distance). Centroid 1 is at \((3, 3)\); Centroid 2 is at \((6, 5)\).

A at \((1, 1)\): distance to C1 \(= \sqrt{4 + 4} \approx 2.83\); to C2 \(= \sqrt{25 + 16} \approx 6.40\). A \(\to\) C1.
B at \((2, 1)\): dist to C1 \(\approx 2.24\); to C2 \(\approx 5.66\). B \(\to\) C1.
C at \((1, 2)\): dist to C1 \(\approx 2.24\); to C2 \(\approx 5.83\). C \(\to\) C1.
D at \((7, 6)\): dist to C1 \(\approx 5.00\); to C2 \(\approx 1.41\). D \(\to\) C2.
E at \((8, 7)\): dist to C1 \(\approx 6.40\); to C2 \(\approx 2.83\). E \(\to\) C2.
F at \((9, 6)\): dist to C1 \(\approx 6.71\); to C2 \(\approx 3.16\). F \(\to\) C2.

Cluster 1: {A, B, C}. Cluster 2: {D, E, F}.

(b) New centroids are the mean of their cluster’s points:

New C1 = mean of A, B, C = \(\big(\tfrac{1+2+1}{3}, \tfrac{1+1+2}{3}\big) = \big(\tfrac{4}{3}, \tfrac{4}{3}\big) \approx \mathbf{(1.33, 1.33)}\).
New C2 = mean of D, E, F = \(\big(\tfrac{7+8+9}{3}, \tfrac{6+7+6}{3}\big) = \big(8, \tfrac{19}{3}\big) \approx \mathbf{(8.00, 6.33)}\).

The centroids move into the heart of their clusters; the next iteration’s assignments will be unchanged, and k-means has converged.

Question 7. You run k-means for \(k = 2, 3, \ldots, 10\) on a dataset. The elbow plot (SSE) and silhouette plot are below.

What \(k\) does the elbow plot suggest? What \(k\) does the silhouette plot suggest? (1 sentence each.)
The two heuristics disagree. Is one of them “right”? How would you actually pick \(k\)? (2–3 sentences.)

Solution

(a) Elbow plot suggests \(k = 4\). The SSE drops sharply from \(k=2\) to \(k=4\), then bends and flattens past \(k=4\) — the elbow is at \(k = 4\). Silhouette plot suggests \(k = 3\). The silhouette score peaks at \(k=3\) (about 0.58) before declining.

(b) Neither heuristic is “right” — both are heuristics that can disagree. The elbow measures within-cluster compactness; silhouette measures separation between clusters. They optimize different things and don’t have to agree. The honest way to pick \(k\) is to tie it to the downstream decision the clusters will inform: if the team needs marketable archetypes, pick \(k\) from a few candidates (\(k=3\) and \(k=4\) here) and ask which one produces actionable, interpretable groups for the actual use case. If the clustering is feeding a downstream predictive task, pick \(k\) by cross-validation on that task. The heuristics are a starting point, not a final answer.

Question 8. A data scientist runs k-means 10 times on the same dataset with 10 different random seeds and gets 10 different cluster assignments — the labels disagree and the within-cluster compositions shift from run to run.

Is k-means broken? Explain in 1–2 sentences why this happens.
Name two things the data scientist can do to address the instability (without changing the data or features).

Solution

(a) K-means is not broken — it is sensitive to initialization. The SSE objective is non-convex with many local minima, and Lloyd’s algorithm converges to whichever local minimum is closest to its starting centroids. Different random seeds start the algorithm in different places, so they converge to different local minima — and therefore different cluster assignments.

(b) Two reasonable answers (any two get full credit):

Use n_init \(> 1\) (sklearn default is n_init = 10). The algorithm runs k-means from multiple random initializations and keeps the one with the lowest SSE. Each run might land in a different local minimum, but reporting the best of many runs gives a far more stable answer than a single run.
Use k-means++ initialization (also sklearn default). Instead of choosing the initial centroids uniformly at random, k-means++ spreads them out: the first is random, then each subsequent centroid is picked with probability proportional to squared distance from the nearest existing centroid. This puts the initial centroids on different parts of the data and dramatically reduces the chance of falling into a bad local minimum.
Report stability explicitly: run k-means many times, compute ARI across pairs of runs, and present the cluster assignments only if they’re stable (high pairwise ARI) — treat instability as a finding, not a bug to hide.

Neither of these makes the call reproducible (a fixed random_state does that). They make the call more stable — less sensitive to the seed — which is a different thing.

Question 9. A city’s Department of Transportation has clustered 9 named streets under two feature sets (\(k = 3\) for each):

Set A (traffic features): daily traffic volume + accidents per year
Set B (pedestrian features): tree canopy coverage + sidewalk width

Pick one named street that ends up in a different cluster under the two feature sets, and explain in 1–2 sentences why the same street gets grouped differently.
The DOT director wants to pick “the right” clustering. What is your response? (2–3 sentences.)

Solution

(a) Example: Ash. Under Set A, Ash sits with the medium-traffic streets (Cluster 2: 6,300 vehicles/day, 8 accidents/year). Under Set B, Ash sits in the wide-sidewalk, low-canopy cluster with Main, Pine, and Cedar (Cluster 1: 12% canopy, 15-ft sidewalks). Ash isn’t contradictory — it’s both a moderate-traffic street and a wide-sidewalk street. The two feature sets place Ash near different neighbors in feature space, so k-means groups it differently. (Other valid examples: Oak, Maple, Birch, Walnut — any street whose cluster ID differs across the two panels.)

(b) Neither clustering is “the right one” — the right choice depends on the decision DOT is making. If DOT is planning traffic-calming investments (where to put speed bumps, where to lower the speed limit), Set A is the relevant grouping. If DOT is planning pedestrian-infrastructure investments (where to widen sidewalks, where to plant trees), Set B is the relevant grouping. The director should pick the feature set that matches the action — feature choice is a value judgment about what matters for the decision at hand.

Question 10. Two analysts cluster the same dataset of 200 customers with two different methods (k-means vs. hierarchical, both at \(k = 4\)). They compare the two clusterings with the Adjusted Rand Index (ARI):

\[ \text{ARI} = 0.85. \]

What does ARI = 0.85 tell you about how similar the two clusterings are? (1–2 sentences.) What ARI value would indicate identical clusterings? What value would indicate clusterings as similar as random guessing?
Suppose instead the two analysts both ran k-means with \(k=4\) but used different random seeds, and got ARI = 0.92. Is this high ARI always good news? Explain in 1–2 sentences.

Solution

(a) ARI = 0.85 means the two clusterings agree substantially but not perfectly — they place most customers into corresponding clusters, but disagree on the cluster assignment for some (about 15% of pairs of customers are assigned differently relative to chance). ARI = 1.0 indicates identical clusterings (up to label permutation). ARI = 0 indicates clusterings as similar as random guessing (no more agreement than you’d expect by chance).

(b) Yes, high ARI across seeds is generally good news — it indicates the clustering is stable (different random starts converge to essentially the same answer). High ARI alone doesn’t guarantee the clustering is meaningful — it could still be carving up noise — but stability is a necessary condition for taking the clusters seriously. Low ARI across seeds, by contrast, would be a red flag that the clusters depend on the seed and shouldn’t be reported as findings.

Other Formats