MSE 125 — Slides – Lecture 15: Clustering

logistics

HW 4 out — HMDA Kaggle challenge, due Mon May 25
Quiz 7 Wed May 20 — PCA + clustering
project midterm returned this week

the brief

Pacers front office

goal: sign a wing who can space the floor and attack closeouts

the five-position label is no help — Curry and Westbrook are both “point guards” and play nothing alike

what is the real taxonomy of NBA players in 2024?

today

k-means — assign, recompute, repeat
choosing k, checking stability — what the standard tools tell you (and what they don’t)
your features pick your clusters — and that’s the whole game
four published case studies — when clustering works and when it doesn’t

bridge from PCA

	PCA	k-means
structure found	directions of variance	groups of rows
representation	continuous (low-dim coords)	discrete (one label per row)
standardize first	yes — distances depend on units	yes — same reason
labels needed	no	no

both are unsupervised. PCA compresses; k-means partitions.

317 players, five court zones

the data

shot-attempt counts from the 2023-24 NBA regular season

filter to players with ≥ 200 FGA → 317 players

for each player, the share of attempts in five court zones:

RA — restricted area (right at the rim)
PAINT — non-restricted paint
MID — midrange
C3 — corner three
ATB3 — above-the-break three

the five shares sum to 1 — a shot-mix fingerprint

what does “similar” mean?

before any algorithm runs, we commit to a notion of distance

Euclidean distance (on standardized shot mix)

d(x_i, x_j) = \sqrt{\sum_{z=1}^{5} (x_{iz} - x_{jz})^2}

square the differences zone by zone, add, square-root

two players are “close” when their five shot shares line up — regardless of height, defense, rebounding, or total volume

famous pairs

gray dots = all 317 players. colored lines = three famous pairs in shot-mix distance.

assign, recompute, repeat

k-means in three steps

k-means clustering

partition n points into k groups by repeating:

initialize k centroids (randomly)
assign each point to the nearest centroid
recompute each centroid as the mean of its assigned points

loop on 2-3 until assignments stop changing

centroid — the mean of all points in a cluster
converges in finite steps (finitely many assignments; SSE only goes down)
the solution is a local optimum, not necessarily the global one

watch it iterate

k=3, two features (rim share, above-break-3 share), four iterations

the objective

sum of squared errors (SSE)

\text{SSE} = \sum_{k=1}^{K} \sum_{i \in C_k} \|x_i - \mu_k\|^2

total squared distance from each point to its assigned centroid

C_k = set of points assigned to cluster k
\mu_k = centroid of cluster k (the mean of those points)
same form as regression SSE — “prediction” for x_i is its cluster’s centroid

alternating minimization — same engine as PCA in Ch 14

one more thing first — standardize

k-means uses Euclidean distance

→ a feature with range 0-1000 dominates one with range 0-1

even when shot shares already sum to 1, standardize before clustering — safe default for any distance-based method

same lesson as PCA: standardize before any distance-based method

run it — all five zones

features = ['pct_RA', 'pct_PAINT', 'pct_MID', 'pct_C3', 'pct_ATB3']

X = StandardScaler().fit_transform(nba[features].values)

kmeans = KMeans(n_clusters=5, random_state=42, n_init=10)
labels = kmeans.fit_predict(X)

Cluster sizes: {0: 54, 1: 76, 2: 67, 3: 60, 4: 60}

five clusters, sizes 54-76. what’s in each?

five archetypes

each panel = one archetype’s mean shot mix across the five zones

Each panel is one archetype’s average shot mix. Read left to right:

Archetype 0 — Interior finishers (Giannis, Zion, Sabonis): overwhelming share at the rim, almost no threes. Bigs and downhill drivers.
Archetype 1 — Two-way wings (LeBron, Kuzma, Jaylen Brown): balanced across all zones. Versatile attackers, no single specialty.
Archetype 2 — Midrange masters (Brunson, Edwards, Durant): highest midrange share of any cluster — about 3× the league baseline. Classic iso scorers and pull-up artists.
Archetype 3 — Floor-spacers (Hield, DiVincenzo, Strus): threes dominate (61% of attempts), very little midrange. Specialists who spot up for kickouts.
Archetype 4 — Three-point creators (Curry, Luka, Tatum): mostly above-the-break threes with some rim attempts. Guards who pull up off the dribble.

Note: archetype numbering is not what sklearn returned. We relabeled by mean rim share so archetype 0 = most rim-heavy. Sklearn’s labels are arbitrary across runs.

About 2 min.

the archetypes spread across the floor

k-means didn’t see position labels at all — yet the midrange-masters cluster mixes a guard (Brunson), a wing (Edwards), and a forward (Durant).

one iteration of k-means — on paper.

four points: (0, 0), (1, 0), (4, 3), (5, 3).

two initial centroids: \mu_1 = (1, 1), \mu_2 = (3, 2).

assign each point to its nearest centroid (Euclidean).
recompute each centroid as the mean of its assigned points.

what are the new centroids?

DISCUSSION: think-pair (2 min think + 1 min pair). Pure mechanics — tests whether students can carry out one iteration on a tiny example.

Round 1 (the actual exercise):

Distances: - (0,0): to \mu_1=(1,1) → \sqrt{2} \approx 1.4; to \mu_2=(3,2) → \sqrt{13} \approx 3.6 → assign to \mu_1. - (1,0): to \mu_1 → 1; to \mu_2 → \sqrt{8}\approx 2.8 → \mu_1. - (4,3): to \mu_1 → \sqrt{13}\approx 3.6; to \mu_2 → \sqrt{2}\approx 1.4 → \mu_2. - (5,3): to \mu_1 → \sqrt{20}\approx 4.5; to \mu_2 → \sqrt{5}\approx 2.2 → \mu_2.

Cluster 1 = {(0,0), (1,0)}; Cluster 2 = {(4,3), (5,3)}.

Recompute: - \mu_1 = mean of {(0,0), (1,0)} = (0.5, 0) - \mu_2 = mean of {(4,3), (5,3)} = (4.5, 3)

Round 2 (if time): re-run with the new centroids. Assignments stay the same → converged. The lesson: convergence is fast on well-separated points; the algorithm finishes whenever an iteration leaves assignments unchanged.

Key payoff to surface: the algorithm has no idea what the points represent. It’s geometry, nothing more. The decisions encoded in (a) the choice of distance, (b) the features, (c) the initial centroids — all upstream of this iteration.

About 4 min total.

how do you pick k?

elbow + silhouette: the standard tools

elbow method

plot SSE vs. k; look for a bend where adding clusters stops helping

silhouette score

for each point, s(i) = \dfrac{b(i) - a(i)}{\max(a(i), b(i))} \in [-1, 1]

a(i) = mean within-cluster distance
b(i) = mean distance to nearest other cluster

near +1 = sits comfortably in its cluster, 0 = on a boundary, -1 = probably misassigned

what do the heuristics say?

elbow — gradual, no sharp bend
silhouette — peaks at k=3 (k=2 a close second), then drifts down
both rule out clearly bad k, neither picks k for you

silhouette, per player

each bar = one player’s silhouette score, grouped by archetype

silhouette says k=3. we picked k=5. why?

what information did we use that the silhouette score does not see?

k=3

three coarse buckets — bigs, wings, guards

k=5

five archetypes — finishers, wings, midrange, spacers, creators

DISCUSSION: think-pair-share (4 min). 1 min think + 2 min pair + 1 min debrief.

Target answer: the question being asked. The Pacers don’t want “is this a guard or a forward?” — they want “what kind of guard is this, and does he do the same things as Hield?” That question demands archetype-level resolution, finer than the silhouette score alone recommends.

Other defensible answers: - We know modern positions are more than three — basketball knowledge constraints. - We want recognizable categories, not just the most-separated clusters. - The next-best signal: the SSE elbow at k=5 is mild but visible if you squint.

Key insight to surface: the heuristics rule out clearly bad choices (k=2 too coarse, k=20 over-segments), but the granularity a task needs is not in the data — it’s in the question.

If a pair says “just trust the silhouette” — fine, push back: “you take that to the Pacers’ GM. He hands you back a cluster with Curry, LeBron, and Jokic in it and asks which one is a Hield replacement.”

don’t commit to k up front

hierarchical clustering: don’t commit to k up front

if the right k is unclear — build a tree instead

agglomerative hierarchical clustering

start with every point as its own cluster
merge the two closest clusters
repeat until one cluster remains

cut the tree at any height → clustering at that granularity

a tree of merges

top 3 players by FGA in each archetype labeled. neighboring leaves play similarly.

cut high → coarse partition (a few big groups)
cut low → fine partition (many small groups)
a second algorithm on the same data — how much does it agree with k-means?

The dendrogram with names on the leaves. Each leaf is a player; the y-axis is the Ward linkage distance at which two clusters merged. Read horizontally: players near each other are similar by shot mix; the higher the merge, the more different the groups being combined.

The interior finishers (Sabonis, Williamson, Antetokounmpo — red, far left) sit alone all the way up. The floor-spacers (Hield, DiVincenzo, Bogdanovic — light blue) sit close to the midrange masters (Fox, Brunson, Gilgeous-Alexander — yellow). The two-way wings (James, Wagner, Maxey — orange) and three-point creators (Edwards, Doncic, Curry — dark blue) merge at moderate heights on the right.

Two algorithms, two partitions of the same 317 players — how much do they agree? Eyeballing the tree against the k-means archetypes won’t answer that. We need a number. That motivates the next section.

Hierarchical is slower than k-means (memory grows as n^2, time superlinear). For hundreds of points where k is unclear, it’s a useful complement; for very large n, stick with k-means.

About 90 sec.

how stable is any of this?

comparing two clusterings

two clusterings of the same n points — how much do they agree?

Rand index (RI)

for every pair of points, the two clusterings either agree or disagree:

TP — both put the pair together
TN — both put the pair apart
FP / FN — one says together, the other says apart

\text{RI} = \frac{\text{TP} + \text{TN}}{\text{all pairs}}

adjusted Rand index (ARI)

RI inflates by chance — most pairs are “apart” under any partition. ARI corrects for that.

\text{ARI} = 1 → identical clusterings
\text{ARI} = 0 → no better than random
\text{ARI} < 0 → worse than random (rare)

Same TP / TN / FP / FN vocabulary as the chapter 7 confusion matrix — only the unit of analysis has changed. There, we classified individual points (predicted vs. actual label). Here, we classify pairs of points (together vs. apart under two clusterings).

Why pairs? Cluster integer labels are arbitrary — cluster “3” in one run could be cluster “0” in the next. The question “do these two players land in the same cluster?” is invariant to relabeling, so we compare partitions at the pair level.

Rand index = (agreeing pairs)/(all pairs). Easy to compute, easy to interpret — but inflated by chance: even random partitions of 317 players into 5 clusters agree on most pairs simply because most pairs are apart under any reasonable partition.

ARI subtracts the expected RI under random labelings and rescales so 0 = chance and 1 = identical. We skip the closed-form formula — the headline is the same: bigger = more agreement.

About 90 sec.

different seeds, different clusters

k-means picks random initial centroids → different starts can converge to different local optima

we run k-means 10 times at k=5 with n_init=1 — how often do they agree?

10 runs, pairwise ARI

mean ARI = 0.62, range 0.37 – 1.00 across 45 pairs
some pairs agree perfectly (ARI = 1); some find genuinely different partitions
same data, same algorithm, same k — different clusters across runs

The heatmap from the chapter. Each cell is the ARI between two of the 10 runs (with n_init=1); diagonal blanked. 45 pairs total.

Reveal the numbers after hearing student guesses. Mean = 0.62 — moderate. Range = 0.37 to 1.00 — one pair converged to identical partitions, the worst pair agrees about a third of the time.

Most cells are warm — those pairs converged to identical or near-identical local optima. A few cells are much paler — those pairs found genuinely different solutions.

The lesson: most runs agree, a few find different solutions, and you can’t tell from the algorithm output which is the “right” one. There’s no global ranking — they’re all local optima.

Hold the 0.62 number — we’ll compare it against algorithm-to-algorithm and feature-recipe-to-recipe agreement in the next section.

About 75 sec.

sklearn’s defenses (and what they don’t fix)

# sklearn defaults:
KMeans(n_clusters=5, n_init=10, init='k-means++')

n_init=10 — runs k-means 10 times, keeps the lowest-SSE partition
k-means++ — picks initial centroids spread apart, not just random rows
together: most pairs of runs now agree closely; a few still diverge
cluster IDs are arbitrary — Cluster 3 in one run is Cluster 0 in the next
describe clusters by profile, not by integer label
boundary players flip across seeds — that’s signal, not noise: they’re hybrids sitting between archetypes

What sklearn does by default. n_init=10 keeps the lowest-SSE start out of 10 random starts. k-means++ picks initial centroids spread apart: the first centroid is random, then each subsequent one is chosen with probability proportional to squared distance from the nearest existing centroid. Together these reduce — but don’t eliminate — sensitivity to the random seed.

The arbitrary-integer-ID problem is separate from instability. Even when two runs converge to exactly the same partition, the integer labels can be shuffled (Cluster 0 ↔︎ Cluster 3). That’s why you describe clusters by their content (mean features, named members) and give them names. “Floor-spacers” travels across runs; “Cluster 3” doesn’t.

Boundary players are the third issue. A player who sits between two archetypes will flip across seeds — that’s not noise to fix, it’s signal about the player’s hybrid nature. Hield is unambiguous; a Hield with a broader shot mix would be a boundary case.

About 90 sec.

your features pick your clusters

same players, three feature sets

we have 317 players and k=5 in all three runs

shot mix

pct_RA, pct_PAINT,
pct_MID, pct_C3,
pct_ATB3

→ archetypes

shot volume

FGA (one feature)

→ tiers: stars to bench

shot efficiency

efg_pct (effective FG%)

→ tiers: skill ladder

same data, same k, same algorithm. three different stories.

Three feature recipes on the same 317 players, all at k=5.

Shot mix (5 features): asks “what kind of shots?” → archetypes (this section so far).
Shot volume (1 feature, FGA): asks “how many?” → stars-vs-role-players tiers. K-means on one feature reduces to sorting players and splitting them into five contiguous ranges.
Shot efficiency (1 feature, eFG%): asks “how well?” → skill ladder. eFG% counts each made three as 1.5 makes; it’s the NBA’s standard shooting efficiency stat.

The single-feature versions are still valid clusterings, just one-dimensional. Volume tier 0 holds the highest-volume scorers; tier 4 holds the bench. Efficiency tier 0 holds the most efficient shooters (rim finishers and rebounders); tier 4 holds players shooting below league average.

About 60 sec.

three stories, eight spotlight players

each row = one player labeled three ways. color encodes the tier.

features scramble clusters more than seeds do

three sources of disagreement, ranked by ARI:

across seeds — 10 k-means runs at k=5: mean ARI ≈ 0.62
across algorithms — k-means k=5 vs. hierarchical k=5: ARI ≈ 0.57
across feature sets — shot mix vs. volume / shot mix vs. efficiency / volume vs. efficiency: ARI ≈ 0

feature choice scrambles the partition an order of magnitude more than the random seed or the algorithm does

The pedagogical payoff for ARI. Three ways two clusterings of the same 317 players might disagree, ranked from least to most:

Seeds (within k-means at k=5): mean cross-seed ARI ≈ 0.62 — moderate, with some pairs at 1 and some as low as 0.37.
Algorithms (k-means vs. hierarchical, both k=5): ARI ≈ 0.57 — similar magnitude to seed instability. Two different algorithms agree about as much as two different starts of the same algorithm.
Feature sets (shot mix vs. shot volume vs. shot efficiency, all at k=5): all three pairwise ARIs land near zero — the volume-vs-efficiency pair is even slightly negative. Two analysts using different feature recipes produce essentially unrelated partitions of the same players.

The order of magnitude matters. Worry about your random seed, sure. Worry about which clustering algorithm, sure. But worry about your feature choice an order of magnitude more — that’s where the big variability lives.

About 90 sec.

which clustering for which question?

GM looking for a Hield replacement → cluster on shot mix (find players who fill the same role)
GM ranking trade targets by raw production → cluster on volume (find players who do the same amount)
GM valuing a contract → cluster on efficiency (find players who score well per attempt)

there’s no “true” clustering of these players. each recipe answers a different question; the analyst picks which question the clustering should answer.

four cases from the literature

clustering as the scaffolding of a cell atlas

100,605 mouse cells, 20 organs, no labels
PCA + graph clustering → ~50 clusters
each cluster annotated by its marker genes — decades of cell biology
almost every named cell type came back out
validation came from outside the algorithm

Tabula Muris Consortium, Nature 2018, Fig. 2

Tabula Muris (2018) clustered single-cell RNA-seq expression vectors across 100,605 cells from 20 mouse organs. Method: PCA on the variable genes, then graph-based (Seurat / Louvain) clustering on the PC scores. Each cluster was annotated by overlap with known marker genes for cell types — decades of cell-biology priors. Almost every named cell type was recovered.

The point for our students: this is a good use of clustering, but the goodness comes from what they did after the labels came out — they validated each cluster against an external standard (marker genes). The clustering didn’t reveal the truth; it organized the data well enough that biologists could verify each group against existing knowledge.

Optional debrief prompt: would you trust this clustering if there were no curated marker-gene database to validate against?

About 90 sec.

clusters that change which drug you should take

184 + 187 + 68 asthma patients, 8 clinical features
hierarchical + k-means → 4 clusters
one cluster is discordant: high inflammation, few symptoms
RCT: treat by inflammation → exacerbations 3.53 → 0.38/yr (p = 0.002)
cluster membership predicted which treatment helped

Haldar et al., Am. J. Respir. Crit. Care Med. 2008, Fig. 1

Haldar et al. 2008 clustered three asthma cohorts (primary-care n=184, secondary-care refractory n=187, longitudinal RCT n=68) on 8 clinical and physiologic features: age of onset, sex, atopic status, BMI, peak-flow variability, FEV1 bronchodilator response, sputum eosinophil count, exhaled NO. Method: Ward’s hierarchical clustering to estimate k, then k-means with that k.

Two clusters appeared in both cohorts (early-onset atopic; obese non-eosinophilic). Two more appeared only in the refractory cohort: symptom-predominant (lots of symptoms, no eosinophils) and inflammation-predominant (eosinophilic, few symptoms). These two clusters are discordant — symptoms don’t match the underlying inflammation.

The validation: an RCT randomized patients in the inflammation-predominant cluster to be managed either by symptoms (the usual practice) or by inflammation markers. Exacerbations fell from 3.53 to 0.38 per patient per year (p = 0.002). Symptom-predominant patients managed by symptoms got 1,829 μg less beclomethasone-equivalent daily (p = 0.02).

Why this is the strongest case in this section: cluster membership predicted which treatment helped. That’s a much harder test than “clusters look biologically interpretable” or “clusters separate cleanly in feature space.” If different treatments work better in different clusters, the clusters carve at a real joint.

Discussion prompt: is “different treatments work better in different clusters” a stronger or weaker validation than “the clusters look biologically interpretable”?

About 90 sec.

beautiful clusters, no replication

Drysdale 2017, Nature Medicine. n = 1,188 depressed patients
CCA + hierarchical → 4 depression biotypes, 82–93% accuracy
Dinga 2019 replication, n = 187: permutation p of canonical correlations 0.64, 0.99
cluster p = 0.45. cross-validated correlations ~0
noise made visible by an under-regularized projection

Drysdale et al., Nat. Med. 2017, Fig. 1; Dinga et al., NeuroImage: Clinical 2019

The canonical recent failure of high-dimensional clustering. Drysdale et al. 2017 (Nature Medicine) used canonical correlation analysis (CCA) on resting-state fMRI connectivity features + clinical symptom scores, then hierarchical clustering on the CCA-projected fMRI scores. Four “biotypes” of depression, claimed 82–93% sensitivity/specificity, claimed predictive of TMS treatment response.

Dinga et al. 2019 ran the same pipeline on independent data (n = 187 from NESDA + MOTAR). Permutation tests on the canonical correlations: p = 0.64 and 0.99 — non-significant. Cross-validated correlations on held-out data dropped to ~0. The four-cluster solution itself had p = 0.45 (Calinski-Harabasz) and p = 0.19 (silhouette) under the null. Cluster assignments were unstable to leaving out a single subject.

The mechanism: CCA can always find linear combinations of two high-dimensional feature sets that look correlated, even when the data is noise. Once you permutation-test against the null, the structure disappears. The original analysis didn’t do that step.

This is exactly what we showed in our chapter’s ARI experiment — high-dimensional clustering will always find something; the question is whether what it finds replicates. Without a permutation test or out-of-sample replication, you can’t tell.

Discussion prompt: if you ran clustering on a high-dimensional dataset and got beautiful-looking groups, what would you do before publishing them?

About 90 sec.

same lesson, even in physics

APOGEE survey. 153,847 stellar spectra. k-means in raw flux
some real populations recover: dwarfs vs. giants, bulge vs. halo
authors: “a discrete classification in flux space does not result in a neat organisation in the parameters space”
sensitive to initialization too
populations real, features wrong — flux distance ≠ physics distance

Garcia-Dias et al., Astronomy & Astrophysics 2018, Fig. 9

Garcia-Dias et al. 2018 ran k-means on 153,847 stellar spectra from APOGEE (resolution R ≈ 22,500). Features: normalized flux as a function of wavelength — the raw spectra. Not pre-computed astrophysical parameters (temperature, gravity, metallicity).

The clusters recovered some known stellar populations: dwarfs vs. giants, sub-giants, red-clump and red-giant-branch stars, bulge vs. halo. Partially. The authors flagged two problems explicitly: “a discrete classification in flux space does not result in a neat organisation in the parameters space,” and the algorithm was sensitive to initial seeds.

The lesson: the underlying objects (stellar populations) are physically real — they differ in mass, age, metallicity. Yet k-means in flux space only partially recovered them, because Euclidean distance in raw flux space is not a clean proxy for distance in the parameter space astronomers actually care about. Same lesson as the NBA chapter (you cluster on shot mix and get archetypes; cluster on volume and get tiers): the features pick the clusters, and flux is the wrong features for this question.

Discussion prompt: if the stellar populations are physically real, why didn’t clustering on raw spectra recover them cleanly?

About 90 sec.

same lesson, four fields

biology — Tabula Muris: clustering as scaffolding, validated by external marker-gene priors → GOOD
medicine — Haldar asthma: clusters validated by RCT treatment response → GOOD
neuroscience — Drysdale biotypes: beautiful clusters, no permutation test, no replication → BAD
astrophysics — APOGEE: real populations, wrong feature space, partial recovery → MIXED

the algorithm always returns clusters. the verdict comes from what you do after the labels come out.

Synthesis. Four published cases, four different validations, two GOOD / one BAD / one MIXED.

The thread: the algorithm always returns clusters. Whether to trust them depends on what validation step you can do — external marker priors (Tabula Muris), randomized treatment outcomes (Haldar), permutation + replication (Drysdale → Dinga), or known underlying objects (APOGEE).

Plus our case from class: NBA — archetypes work because the question is interpretive (which shot mix? which players play alike?), and feature choice maps cleanly onto the GM’s decision (Hield replacement vs. trade ranking vs. contract valuation).

Students should leave with the instinct: before I trust any cluster summary, I ask what features were fed in, what validation step is available, and whether anyone has actually done it.

About 90 sec.

the clustering gotcha checklist

before reporting any cluster, ask:

what features did I include? — IDs, geography, dates often dominate
what features did I drop, and why? — feature choice is the model
is k defensible? — elbow + silhouette + the actual question
are the clusters stable? — n_init=10+, multiple seeds, ARI heatmap
can I name each cluster from its profile? — if not, it’s an artifact
does the clustering answer a question someone has? — if no, don’t ship
what value judgment does the feature choice embed? — every feature set is a theory of similarity

summary

k-means — assign, recompute, repeat; minimizes SSE = \sum_k \sum_{i \in C_k} \|x_i - \mu_k\|^2
standardize — same lesson as PCA, Euclidean distance can’t see units
elbow + silhouette — rule out bad choices, not oracles for k
non-deterministic — different seeds, different clusters; Rand-index / ARI quantifies agreement
your features pick your clusters — same 317 players → archetypes, tiers, or skill ladder; feature choice scrambles partitions more than seed or algorithm
validate from outside — trust comes from external evidence (marker genes, RCT outcomes, replication), not from the clustering itself

next: when validation isn’t enough

we’ve trained models, tested them, validated them.

what happens when the world changes after we deploy?

temporal leakage — random splits leak future information
feedback loops — predictions change the outcomes we measure
Goodhart’s law — every target becomes a target

feedback

forms.gle/feedback

what worked? what didn’t? what’s still confusing?

Lecture 15: Clustering — K-Means

logistics

the brief

Pacers front office

today

bridge from PCA

the data

what does “similar” mean?

famous pairs

k-means in three steps

watch it iterate

the objective

one more thing first — standardize

run it — all five zones

five archetypes

the archetypes spread across the floor

elbow + silhouette: the standard tools

what do the heuristics say?

silhouette, per player

hierarchical clustering: don’t commit to k up front

a tree of merges

comparing two clusterings

different seeds, different clusters

10 runs, pairwise ARI

sklearn’s defenses (and what they don’t fix)

same players, three feature sets

three stories, eight spotlight players

features scramble clusters more than seeds do

which clustering for which question?

clustering as the scaffolding of a cell atlas

clusters that change which drug you should take

beautiful clusters, no replication

same lesson, even in physics

same lesson, four fields

the clustering gotcha checklist

summary

next: when validation isn’t enough

feedback