MSE 125 — Applied Statistics
Monday, May 18, 2026
goal: sign a wing who can space the floor and attack closeouts
the five-position label is no help — Curry and Westbrook are both “point guards” and play nothing alike
what is the real taxonomy of NBA players in 2024?

| PCA | k-means | |
|---|---|---|
| structure found | directions of variance | groups of rows |
| representation | continuous (low-dim coords) | discrete (one label per row) |
| standardize first | yes — distances depend on units | yes — same reason |
| labels needed | no | no |
both are unsupervised. PCA compresses; k-means partitions.
317 players, five court zones
shot-attempt counts from the 2023-24 NBA regular season
filter to players with ≥ 200 FGA → 317 players
for each player, the share of attempts in five court zones:
the five shares sum to 1 — a shot-mix fingerprint
before any algorithm runs, we commit to a notion of distance
Euclidean distance (on standardized shot mix)
d(x_i, x_j) = \sqrt{\sum_{z=1}^{5} (x_{iz} - x_{jz})^2}
square the differences zone by zone, add, square-root
two players are “close” when their five shot shares line up — regardless of height, defense, rebounding, or total volume
gray dots = all 317 players. colored lines = three famous pairs in shot-mix distance.
assign, recompute, repeat
k-means clustering
partition n points into k groups by repeating:
loop on 2-3 until assignments stop changing
k=3, two features (rim share, above-break-3 share), four iterations
sum of squared errors (SSE)
\text{SSE} = \sum_{k=1}^{K} \sum_{i \in C_k} \|x_i - \mu_k\|^2
total squared distance from each point to its assigned centroid
alternating minimization — same engine as PCA in Ch 14
k-means uses Euclidean distance
→ a feature with range 0-1000 dominates one with range 0-1
even when shot shares already sum to 1, standardize before clustering — safe default for any distance-based method
same lesson as PCA: standardize before any distance-based method
five clusters, sizes 54-76. what’s in each?
each panel = one archetype’s mean shot mix across the five zones
k-means didn’t see position labels at all — yet the midrange-masters cluster mixes a guard (Brunson), a wing (Edwards), and a forward (Durant).
one iteration of k-means — on paper.
four points: (0, 0), (1, 0), (4, 3), (5, 3).
two initial centroids: \mu_1 = (1, 1), \mu_2 = (3, 2).
what are the new centroids?
how do you pick k?
elbow method
plot SSE vs. k; look for a bend where adding clusters stops helping
silhouette score
for each point, s(i) = \dfrac{b(i) - a(i)}{\max(a(i), b(i))} \in [-1, 1]
near +1 = sits comfortably in its cluster, 0 = on a boundary, -1 = probably misassigned
each bar = one player’s silhouette score, grouped by archetype
silhouette says k=3. we picked k=5. why?
what information did we use that the silhouette score does not see?
k=3
three coarse buckets — bigs, wings, guards
k=5
five archetypes — finishers, wings, midrange, spacers, creators
don’t commit to k up front
if the right k is unclear — build a tree instead
agglomerative hierarchical clustering
cut the tree at any height → clustering at that granularity
top 3 players by FGA in each archetype labeled. neighboring leaves play similarly.
how stable is any of this?
two clusterings of the same n points — how much do they agree?
Rand index (RI)
for every pair of points, the two clusterings either agree or disagree:
\text{RI} = \frac{\text{TP} + \text{TN}}{\text{all pairs}}
adjusted Rand index (ARI)
RI inflates by chance — most pairs are “apart” under any partition. ARI corrects for that.
k-means picks random initial centroids → different starts can converge to different local optima
we run k-means 10 times at k=5 with n_init=1 — how often do they agree?
n_init=10 — runs k-means 10 times, keeps the lowest-SSE partitionk-means++ — picks initial centroids spread apart, not just random rowsCluster 3 in one run is Cluster 0 in the nextyour features pick your clusters
we have 317 players and k=5 in all three runs
shot mix
pct_RA, pct_PAINT,pct_MID, pct_C3,pct_ATB3
→ archetypes
shot volume
FGA (one feature)
→ tiers: stars to bench
shot efficiency
efg_pct (effective FG%)
→ tiers: skill ladder
same data, same k, same algorithm. three different stories.
each row = one player labeled three ways. color encodes the tier.
three sources of disagreement, ranked by ARI:
feature choice scrambles the partition an order of magnitude more than the random seed or the algorithm does
there’s no “true” clustering of these players. each recipe answers a different question; the analyst picks which question the clustering should answer.
four cases from the literature



Drysdale et al., Nat. Med. 2017, Fig. 1; Dinga et al., NeuroImage: Clinical 2019

the algorithm always returns clusters. the verdict comes from what you do after the labels come out.
before reporting any cluster, ask:
n_init=10+, multiple seeds, ARI heatmapwe’ve trained models, tested them, validated them.
what happens when the world changes after we deploy?
what worked? what didn’t? what’s still confusing?