---
title: "Clustering — K-Means"
execute:
enabled: true
jupyter: python3
---
## Finding structure that nobody labeled
An Airbnb host wants to know: who is my competition? If you're renting a cozy Brooklyn studio, your competitive set isn't a Manhattan penthouse. Can we automatically discover these market segments — and should we trust the groups the algorithm finds?
This task is **unsupervised learning**: there's no "right answer" column. We're looking for patterns in the data itself. In Chapter 14, PCA found a few directions that explain most of the variation in our Airbnb data. Today, we ask a different question: instead of *directions*, can we find *groups*? PCA compresses; k-means clusters. Both are unsupervised — no labels, just structure.
We'll use **k-means clustering** — one of the simplest and most widely used clustering algorithms — and discover that finding clusters is easy, but *trusting* them is hard.
```{python}
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (7, 4.5)
plt.rcParams['font.size'] = 12
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import silhouette_score
# Load data
DATA_DIR = 'data'
```
## Preparing the data
Let's load the Airbnb data and select features that describe each listing's "market position": price, size, location, reviews, and availability.
```{python}
# Load Airbnb data
airbnb = pd.read_csv(f'{DATA_DIR}/airbnb/listings.csv', low_memory=False)
# Clean price column (often stored as "$150.00" string)
airbnb['price'] = pd.to_numeric(
airbnb['price'].astype(str).str.replace(r'[\$,]', '', regex=True),
errors='coerce'
)
```
We select features that describe each listing's market position and drop rows with missing values. Note that `.dropna()` silently removes incomplete rows — we should check how many are lost.
```{python}
# Select clustering features
cluster_features = ['price', 'latitude', 'longitude', 'accommodates',
'bedrooms', 'number_of_reviews', 'availability_365',
'minimum_nights']
n_before = len(airbnb)
listings = airbnb[cluster_features + ['room_type', 'neighbourhood_group_cleansed']].dropna()
print(f"After dropping NaN: {len(listings):,} rows (dropped {n_before - len(listings):,})")
```
We need to handle `room_type`, which is a categorical variable. We'll use **one-hot encoding** — creating a separate binary column for each room type. Why not ordinal encoding (Shared=0, Private=1, Entire=2)? Because k-means uses Euclidean distance, and ordinal encoding would imply that "Private room" is equidistant between "Shared room" and "Entire home/apt," which has no meaningful interpretation.
```{python}
# One-hot encode room_type for distance-based clustering
room_dummies = pd.get_dummies(listings['room_type'], prefix='room')
listings = pd.concat([listings, room_dummies], axis=1)
room_cols = list(room_dummies.columns)
print(f"Data for clustering: {len(listings):,} rows")
listings[cluster_features].head()
```
Let's check the feature scales — these will matter for k-means, which uses Euclidean distance.
```{python}
listings[cluster_features].describe().round(1)
```
## How k-means works
The algorithm is beautifully simple:
1. **Pick k initial centroids** (randomly)
2. **Assign** each point to the nearest centroid
3. **Recompute** each centroid as the mean of its assigned points
4. **Repeat** steps 2-3 until assignments stop changing
:::{.callout-important}
## Definition: Inertia
K-means minimizes the total within-cluster sum of squared distances — called the **inertia**:
$$\sum_{k=1}^{K} \sum_{i \in C_k} \|x_i - \mu_k\|^2$$
:::
Each step of the algorithm is guaranteed to decrease this number (or keep it the same). But the objective is non-convex, so different starting points can converge to different **local minima** — more on that shortly.
Let's watch it work. We'll use just two features — price and number of reviews — so we can visualize the steps.
```{python}
# Set up 2D visualization data
np.random.seed(42)
viz_features = ['price', 'number_of_reviews']
X_viz = listings[viz_features].values
scaler_viz = StandardScaler()
X_viz_scaled = scaler_viz.fit_transform(X_viz)
# Sample for cleaner visualization
sample_idx = np.random.choice(len(X_viz_scaled), size=2000, replace=False)
X_sample = X_viz_scaled[sample_idx]
# Initialize k=3 centroids randomly from data points
k = 3
init_idx = np.random.choice(len(X_sample), size=k, replace=False)
centroids = X_sample[init_idx].copy()
```
Now we run the assign-recompute loop manually, plotting each iteration.
```{python}
# Watch k-means iterate: assign, recompute, repeat
fig, axes = plt.subplots(2, 2, figsize=(10, 8))
colors = ['#e74c3c', '#3498db', '#2ecc71']
for step in range(4):
ax = axes[step // 2, step % 2]
distances = np.array([np.sum((X_sample - c)**2, axis=1) for c in centroids])
labels = distances.argmin(axis=0)
for j in range(k):
mask = labels == j
ax.scatter(X_sample[mask, 0], X_sample[mask, 1],
c=colors[j], alpha=0.3, s=10)
ax.scatter(centroids[j, 0], centroids[j, 1],
c=colors[j], marker='X', s=200, edgecolors='black', linewidths=2)
ax.set_xlabel('Price (standardized)')
ax.set_ylabel('Reviews (standardized)')
ax.set_title(f'Iteration {step + 1}')
# Recompute centroids
for j in range(k):
if (labels == j).sum() > 0:
centroids[j] = X_sample[labels == j].mean(axis=0)
plt.suptitle('K-Means: Watch the centroids converge', fontsize=14, y=1.02)
plt.tight_layout()
plt.show()
```
Panel 1 shows random initial centroids. By Panel 4, the centroids have stabilized — assignments barely change. That's k-means: dead simple, fast, and guaranteed to converge. (The algorithm converges in finite steps: there are finitely many possible assignments, and inertia decreases at every step.) The solution is only a local optimum, though, as we'll see shortly.
But there's a lot hiding behind that simplicity. Let's dig in.
## Running k-means on the full feature set
Let's use all our clustering features with k=3 and see what emerges.
Just like PCA (Chapter 14), k-means uses Euclidean distance — so features on larger scales dominate. We **standardize** to give each feature equal voice. Try running this without standardization and you'll see price (in dollars) completely dominate latitude (in degrees). Same lesson, same fix.
```{python}
# Standardize all features for clustering
features_for_clustering = cluster_features + room_cols
X_scaled = StandardScaler().fit_transform(listings[features_for_clustering].values)
# Fit k-means with k=3
kmeans_3 = KMeans(n_clusters=3, random_state=42, n_init=10)
listings['cluster_k3'] = kmeans_3.fit_predict(X_scaled)
print("Cluster sizes:")
print(listings['cluster_k3'].value_counts().sort_index())
```
What do these clusters look like? The mean feature values per cluster tell the story.
```{python}
# What do the clusters look like? Show mean features per cluster
cluster_profiles = listings.groupby('cluster_k3')[cluster_features].mean()
print("Cluster profiles (mean values):")
cluster_profiles.round(1).T
```
:::{.callout-tip}
## Think About It
We're clustering on price, location, size, reviews, and availability. What do you think the clusters will capture — price tiers? listing types? something else? Make a prediction before running the next cell.
:::
```{python}
# Scatter: price vs reviews, colored by cluster
fig, ax = plt.subplots(figsize=(8, 5))
for c in range(3):
mask = listings['cluster_k3'] == c
ax.scatter(listings.loc[mask, 'price'], listings.loc[mask, 'number_of_reviews'],
alpha=0.3, s=10, label=f'Cluster {c}')
ax.set_xlabel('Price ($)')
ax.set_ylabel('Number of Reviews')
ax.set_title('Clusters: Price vs Reviews')
ax.legend(markerscale=5)
ax.set_xlim(0, 600)
ax.set_ylim(0, 300)
plt.show()
```
Now let's map the clusters geographically.
```{python}
# Map: lat/lon colored by cluster
fig, ax = plt.subplots(figsize=(8, 5))
for c in range(3):
mask = listings['cluster_k3'] == c
ax.scatter(listings.loc[mask, 'longitude'], listings.loc[mask, 'latitude'],
alpha=0.3, s=5, label=f'Cluster {c}')
ax.set_xlabel('Longitude')
ax.set_ylabel('Latitude')
ax.set_title('Clusters on the NYC Map')
ax.legend(markerscale=5)
ax.text(-73.97, 40.78, 'Manhattan', fontsize=9, fontstyle='italic')
ax.text(-73.95, 40.68, 'Brooklyn', fontsize=9, fontstyle='italic')
plt.show()
```
**Look at the map.** The clusters partially reflect geographic patterns — listings in different boroughs do tend to cluster together — but they're also driven by price and listing type. With lat/lon in the feature set, k-means mixes geography and market segment together, so the clusters don't cleanly match boroughs.
Is that useful? We already *have* the borough column. The model spent its effort on geography instead of finding genuine market segments.
## Choosing k: elbow plot and silhouette score
How did we know k=3 was right? We didn't. K is a **choice**, not a discovery. But there are heuristics to help.
The **elbow method** plots the inertia (the k-means objective we defined above) as k increases. We look for an "elbow" where adding more clusters stops helping much.
:::{.callout-important}
## Definition: Silhouette Score
The **silhouette score** measures how similar each point is to its own cluster vs the nearest other cluster. Range is [-1, 1]: a score near +1 means points sit comfortably inside their cluster; near 0 means they're on the boundary between clusters; near -1 means they're probably in the wrong cluster.
:::
```{python}
from sklearn.metrics import silhouette_samples
k_range = range(2, 11)
inertias = []
silhouettes = []
for k in k_range:
km = KMeans(n_clusters=k, random_state=42, n_init=10)
labels_k = km.fit_predict(X_scaled)
inertias.append(km.inertia_)
silhouettes.append(silhouette_score(X_scaled, labels_k, sample_size=5000))
```
Let's visualize both metrics side by side.
```{python}
# Elbow plot
fig, axes = plt.subplots(1, 2, figsize=(11, 4))
axes[0].plot(k_range, inertias, 'o-', color='steelblue')
axes[0].axvline(x=3, color='gray', linestyle='--', alpha=0.5, label='k=3 (our choice)')
axes[0].set_xlabel('Number of Clusters (k)')
axes[0].set_ylabel('Inertia (within-cluster sum of squares)')
axes[0].set_title('Elbow Plot')
axes[0].annotate('No clear elbow — typical of real data',
xy=(5, inertias[3]), fontsize=9, fontstyle='italic')
axes[0].legend()
axes[1].plot(k_range, silhouettes, 'o-', color='coral')
axes[1].axvline(x=3, color='gray', linestyle='--', alpha=0.5, label='k=3 (our choice)')
axes[1].set_xlabel('Number of Clusters (k)')
axes[1].set_ylabel('Silhouette Score')
axes[1].set_title('Silhouette Score')
axes[1].legend()
plt.tight_layout()
plt.show()
```
Neither plot gives a definitive answer. The "elbow" is gradual — there's no sharp bend. The silhouette score is highest at k=2 (just splitting the data in half). Choosing k requires judgment, not just a formula.
These heuristics won't give you a magic number, but they can rule out bad choices. If the silhouette score is terrible at k=10, don't use 10 clusters. Think of them as sanity checks, not oracles.
But there's an even bigger issue than choosing k. The clusters you get depend on which features you include — and that choice can completely change the story.
## Geography dominates (unless you remove it)
Let's run k-means *without* latitude and longitude and see if we get genuinely different market segments.
```{python}
# Clustering WITHOUT geography
features_no_geo = ['price', 'accommodates', 'bedrooms',
'number_of_reviews', 'availability_365',
'minimum_nights'] + room_cols
X_no_geo_scaled = StandardScaler().fit_transform(listings[features_no_geo].values)
kmeans_no_geo = KMeans(n_clusters=3, random_state=42, n_init=10)
listings['cluster_no_geo'] = kmeans_no_geo.fit_predict(X_no_geo_scaled)
```
How do the cluster profiles change without geography?
```{python}
# Compare cluster profiles
print("=== WITH geography (clusters are basically boroughs) ===")
print(listings.groupby('cluster_k3')[['price', 'accommodates', 'bedrooms']].mean().round(1))
print()
print("=== WITHOUT geography (clusters are market segments) ===")
print(listings.groupby('cluster_no_geo')[['price', 'accommodates', 'bedrooms']].mean().round(1))
```
The side-by-side maps make the difference vivid.
```{python}
# Side-by-side maps: WITH vs WITHOUT geography
fig, axes = plt.subplots(1, 2, figsize=(11, 4))
for c in range(3):
mask = listings['cluster_k3'] == c
axes[0].scatter(listings.loc[mask, 'longitude'], listings.loc[mask, 'latitude'],
alpha=0.3, s=5, label=f'Cluster {c}')
axes[0].set_title('WITH lat/lon: Clusters = Geography')
axes[0].set_xlabel('Longitude')
axes[0].set_ylabel('Latitude')
axes[0].legend(markerscale=5)
for c in range(3):
mask = listings['cluster_no_geo'] == c
axes[1].scatter(listings.loc[mask, 'longitude'], listings.loc[mask, 'latitude'],
alpha=0.3, s=5, label=f'Cluster {c}')
axes[1].set_title('WITHOUT lat/lon: Clusters = Market Segments')
axes[1].set_xlabel('Longitude')
axes[1].set_ylabel('Latitude')
axes[1].legend(markerscale=5)
plt.tight_layout()
plt.show()
```
**Completely different!** Without geography, the clusters are no longer separated on the map — they're mixed across all boroughs. Instead, they capture differences in price, size, and listing type. These are actual market segments, not just zip codes.
:::{.callout-warning}
## Your features determine your clusters
If you include latitude and longitude (even standardized), they often dominate — spatial separation is strong and clean. Remove them, and you discover listing-type structure instead.
:::
```{python}
# Visualize market segments: price vs accommodates
fig, ax = plt.subplots(figsize=(8, 5))
for c in range(3):
mask = listings['cluster_no_geo'] == c
n = mask.sum()
ax.scatter(listings.loc[mask, 'accommodates'], listings.loc[mask, 'price'],
alpha=0.3, s=10, label=f'Cluster {c} ({n:,} listings)')
ax.set_xlabel('Accommodates')
ax.set_ylabel('Price ($)')
ax.set_title('Market Segments (without geography)')
ax.set_ylim(0, 600)
ax.legend(markerscale=5)
plt.show()
```
Let's name the segments by examining each cluster's characteristics.
```{python}
# Interpret the market-segment clusters
for c in range(3):
sub = listings[listings['cluster_no_geo'] == c]
print(f"Cluster {c} ({len(sub):,} listings):")
print(f" Avg price: ${sub['price'].mean():.0f}/night | "
f"Avg bedrooms: {sub['bedrooms'].mean():.1f} | "
f"Avg reviews: {sub['number_of_reviews'].mean():.0f}")
print(f" Top room type: {sub['room_type'].value_counts().index[0]}")
print()
```
Now we can tell a story: one cluster is shared-room budget listings (~$57/night), another is private-room moderate listings (~$77/night), and the third is entire-home premium listings (~$186/night). Reviews per listing are similar across all three clusters — the key split is room type and price, not activity level.
These segments are useful for an Airbnb host trying to figure out their competitive set. The borough-based clusters told us nothing we didn't already know.
## Different random seeds, different clusters
K-means starts with random initial centroids. Different starting points can lead to different final clusters — the algorithm finds a **local optimum**, not the global best.
:::{.callout-tip}
## Think About It
If you ran k-means again with different starting points, would you get the same clusters?
:::
We deliberately set `n_init=1` below (one random start) to expose the instability. sklearn's default `n_init=10` helps by picking the best of 10 runs, but even that doesn't guarantee the same answer every time. (sklearn also uses **k-means++** initialization by default, which spreads out the initial centroids to avoid poor starts — but local minima remain a fact of life.)
To compare two different clusterings, we use the **Adjusted Rand Index (ARI)**. The ARI asks: if we take any two listings, do both runs agree on whether they're in the same cluster or different clusters? ARI = 1 means perfect agreement; ARI = 0 means the two runs agree no more than random chance.
```{python}
# Run k-means 10 times with different random seeds
from sklearn.metrics import adjusted_rand_score
all_labels = []
for seed in range(10):
km = KMeans(n_clusters=3, random_state=seed, n_init=1)
all_labels.append(km.fit_predict(X_no_geo_scaled))
```
Now we compare all pairs of runs using the ARI.
```{python}
# Compute pairwise ARI between all 10 runs
ari_matrix = np.zeros((10, 10))
for i in range(10):
for j in range(10):
ari_matrix[i, j] = adjusted_rand_score(all_labels[i], all_labels[j])
fig, ax = plt.subplots(figsize=(8, 6))
# Mask diagonal to focus on off-diagonal agreement
display_matrix = ari_matrix.copy()
np.fill_diagonal(display_matrix, np.nan)
sns.heatmap(display_matrix, annot=True, fmt='.2f', cmap='YlOrRd',
ax=ax, vmin=0, vmax=1)
ax.set_xlabel('Run')
ax.set_ylabel('Run')
ax.set_title('Agreement Between 10 K-Means Runs (n_init=1)')
plt.tight_layout()
plt.show()
off_diag = ari_matrix[np.triu_indices(10, k=1)]
print(f"ARI range: {off_diag.min():.3f} to {off_diag.max():.3f}")
print(f"ARI mean: {off_diag.mean():.3f}")
```
Some pairs of runs produce nearly identical clusters (ARI close to 1). Others show moderate variation. The clusters are **not stable** — they depend on where k-means happened to start.
:::{.callout-tip}
## Think About It
If your clusters change depending on the random seed, can you really trust the labels? What does "Cluster 1" even mean if it's different every time?
:::
## The AI gotcha
An AI assistant asked to "cluster these Airbnb listings" would:
1. Run k-means with default settings
2. Report 3 or 5 "beautiful" clusters
3. Maybe make a scatter plot
4. Move on
It would **not**:
- Check whether the clusters are dominated by geography
- Try different feature sets to see how clusters change
- Run with multiple random seeds to check stability
- Question whether k=3 is the right choice or just an arbitrary default
- Interpret what each cluster actually represents
Clustering is easy to **run** and hard to **validate**. The algorithm will always produce clusters — even from random noise. Your job is to ask: are these clusters *real*, *useful*, and *stable*?
:::{.callout-note}
## George Box on models
"All models are wrong, but some are useful." The clusters are not "true" categories hiding in the data. They're a simplification — and the question is whether that simplification helps you make a better decision.
:::
## Key Takeaways
- **K-means** minimizes within-cluster sum of squares (inertia): $\sum_{k=1}^{K} \sum_{i \in C_k} \|x_i - \mu_k\|^2$. The algorithm always converges in finite steps (finitely many assignments, inertia decreasing at each step), but only to a **local** optimum.
- **Standardize first.** K-means uses Euclidean distance, so features on larger scales dominate — same lesson as PCA.
- **Your features determine your clusters.** Including latitude and longitude makes k-means rediscover boroughs. Remove them and you find listing-type segments. Neither is "right" — they answer different questions.
- **Choosing k is a judgment call**, not a formula. The elbow plot and silhouette score provide guidance, but there's no single correct answer.
- **K-means is not deterministic.** Different random seeds produce different clusters. Always check stability by running multiple times.
- **Always interpret and validate.** Clustering algorithms will find structure in *any* data, including pure noise. Ask: are the clusters meaningful? Are they stable? Are they useful for the decision you need to make?
## Looking ahead
We've now covered three major types of learning:
- **Supervised** (regression in Chapters 5-7, classification in Chapter 13)
- **Unsupervised for structure** (PCA in Chapter 14 — finding directions that explain variation)
- **Unsupervised for grouping** (clustering today — assigning points to discrete clusters)
Mathematically, PCA and k-means are cousins. PCA approximates the data matrix with a rank-k matrix; k-means approximates it with a matrix assigning each row to one of k prototypes. Both find low-dimensional summaries, but PCA gives a continuous representation while k-means forces each point into exactly one cluster.
In Chapters 18-19, we'll ask a fundamentally different question: not "what groups exist?" but "did this intervention cause that outcome?" The question shifts from prediction to **causal inference**.
## Study guide
### Key ideas
- **Unsupervised learning:** finding structure in data without labeled outcomes.
- **K-means:** clustering algorithm that alternates between assigning points to nearest centroid and recomputing centroids. It minimizes within-cluster sum of squares: $\sum_{k=1}^{K} \sum_{i \in C_k} \|x_i - \mu_k\|^2$.
- **Centroid:** the mean of all points in a cluster.
- **Inertia:** the k-means objective — total within-cluster sum of squared distances.
- **Silhouette score:** measures how well each point fits its own cluster vs. the nearest other cluster (range [-1, 1]).
- **Elbow method:** plotting inertia vs. k and looking for diminishing returns.
- **Local optimum:** a solution that can't be improved by small changes, but may not be the global best.
- **Adjusted Rand Index (ARI):** measures agreement between two clusterings, adjusted for chance (1 = identical, 0 = random).
- Standardization is essential for k-means, which uses Euclidean distance — same reason as PCA.
- Feature selection determines what the clusters capture (geography vs. market segments).
- K-means is sensitive to initialization — different random seeds can produce different clusters. The k-means++ initialization strategy picks initial centroids that are spread apart, reducing sensitivity to random starts.
- The algorithm always finds clusters, even in pure noise — validation is your responsibility.
### Computational tools
- `KMeans(n_clusters=k, random_state=42, n_init=10)` — fit k-means clustering
- `km.fit_predict(X)` — fit and return cluster labels
- `km.inertia_` — the k-means objective value (within-cluster sum of squares)
- `StandardScaler().fit_transform(X)` — standardize features to zero mean, unit variance
- `silhouette_score(X, labels)` — average silhouette score across all points
- `silhouette_samples(X, labels)` — per-point silhouette values
- `adjusted_rand_score(labels1, labels2)` — compare two clusterings
### For the quiz
You should be able to:
- Describe the k-means algorithm (assign, recompute, repeat) and state what it optimizes
- Explain why standardization matters for k-means (and connect to PCA)
- Interpret an elbow plot and a silhouette plot
- Explain why different runs can produce different clusters (local optima, initialization)
- Given cluster profiles, interpret what each cluster represents
- Explain why feature selection changes the clusters you find