Clustering — K-Means

Finding structure that nobody labeled

An Airbnb host wants to know: who is my competition? If you’re renting a cozy Brooklyn studio, your competitive set isn’t a Manhattan penthouse. Can we automatically discover these market segments — and should we trust the groups the algorithm finds?

This task is unsupervised learning: there’s no “right answer” column. We’re looking for patterns in the data itself. In Chapter 14, PCA found a few directions that explain most of the variation in our Airbnb data. Today, we ask a different question: instead of directions, can we find groups? PCA compresses; k-means clusters. Both are unsupervised — no labels, just structure.

We’ll use k-means clustering — one of the simplest and most widely used clustering algorithms — and discover that finding clusters is easy, but trusting them is hard.

Code
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (7, 4.5)
plt.rcParams['font.size'] = 12

from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import silhouette_score

# Load data
DATA_DIR = 'data'

Preparing the data

Let’s load the Airbnb data and select features that describe each listing’s “market position”: price, size, location, reviews, and availability.

Code
# Load Airbnb data
airbnb = pd.read_csv(f'{DATA_DIR}/airbnb/listings.csv', low_memory=False)

# Clean price column (often stored as "$150.00" string)
airbnb['price'] = pd.to_numeric(
    airbnb['price'].astype(str).str.replace(r'[\$,]', '', regex=True),
    errors='coerce'
)

We select features that describe each listing’s market position and drop rows with missing values. Note that .dropna() silently removes incomplete rows — we should check how many are lost.

Code
# Select clustering features
cluster_features = ['price', 'latitude', 'longitude', 'accommodates',
                    'bedrooms', 'number_of_reviews', 'availability_365',
                    'minimum_nights']

n_before = len(airbnb)
listings = airbnb[cluster_features + ['room_type', 'neighbourhood_group_cleansed']].dropna()
print(f"After dropping NaN: {len(listings):,} rows (dropped {n_before - len(listings):,})")
After dropping NaN: 29,142 rows (dropped 0)

We need to handle room_type, which is a categorical variable. We’ll use one-hot encoding — creating a separate binary column for each room type. Why not ordinal encoding (Shared=0, Private=1, Entire=2)? Because k-means uses Euclidean distance, and ordinal encoding would imply that “Private room” is equidistant between “Shared room” and “Entire home/apt,” which has no meaningful interpretation.

Code
# One-hot encode room_type for distance-based clustering
room_dummies = pd.get_dummies(listings['room_type'], prefix='room')
listings = pd.concat([listings, room_dummies], axis=1)
room_cols = list(room_dummies.columns)

print(f"Data for clustering: {len(listings):,} rows")
listings[cluster_features].head()
Data for clustering: 29,142 rows
price latitude longitude accommodates bedrooms number_of_reviews availability_365 minimum_nights
0 45 40.754444 -73.895933 1 1 3 0 1
1 95 40.785097 -73.942337 2 1 101 74 2
2 60 40.709891 -73.957495 2 1 19 0 6
3 70 40.850388 -73.938551 2 1 8 182 2
4 104 40.766609 -73.968214 2 1 4 244 12

Let’s check the feature scales — these will matter for k-means, which uses Euclidean distance.

Code
listings[cluster_features].describe().round(1)
price latitude longitude accommodates bedrooms number_of_reviews availability_365 minimum_nights
count 29142.0 29142.0 29142.0 29142.0 29142.0 29142.0 29142.0 29142.0
mean 133.0 40.7 -74.0 2.9 1.2 24.5 126.6 3.9
std 103.6 0.1 0.0 1.9 0.7 38.4 135.5 10.2
min 0.0 40.5 -74.2 1.0 0.0 1.0 0.0 1.0
25% 67.0 40.7 -74.0 2.0 1.0 3.0 0.0 1.0
50% 100.0 40.7 -74.0 2.0 1.0 9.0 74.0 2.0
75% 165.0 40.8 -73.9 4.0 1.0 29.0 259.0 3.0
max 999.0 40.9 -73.7 16.0 11.0 521.0 365.0 600.0

How k-means works

The algorithm is beautifully simple:

  1. Pick k initial centroids (randomly)
  2. Assign each point to the nearest centroid
  3. Recompute each centroid as the mean of its assigned points
  4. Repeat steps 2-3 until assignments stop changing
ImportantDefinition: Inertia

K-means minimizes the total within-cluster sum of squared distances — called the inertia:

\[\sum_{k=1}^{K} \sum_{i \in C_k} \|x_i - \mu_k\|^2\]

Each step of the algorithm is guaranteed to decrease this number (or keep it the same). But the objective is non-convex, so different starting points can converge to different local minima — more on that shortly.

Let’s watch it work. We’ll use just two features — price and number of reviews — so we can visualize the steps.

Code
# Set up 2D visualization data
np.random.seed(42)
viz_features = ['price', 'number_of_reviews']
X_viz = listings[viz_features].values
scaler_viz = StandardScaler()
X_viz_scaled = scaler_viz.fit_transform(X_viz)

# Sample for cleaner visualization
sample_idx = np.random.choice(len(X_viz_scaled), size=2000, replace=False)
X_sample = X_viz_scaled[sample_idx]

# Initialize k=3 centroids randomly from data points
k = 3
init_idx = np.random.choice(len(X_sample), size=k, replace=False)
centroids = X_sample[init_idx].copy()

Now we run the assign-recompute loop manually, plotting each iteration.

Code
# Watch k-means iterate: assign, recompute, repeat
fig, axes = plt.subplots(2, 2, figsize=(10, 8))
colors = ['#e74c3c', '#3498db', '#2ecc71']

for step in range(4):
    ax = axes[step // 2, step % 2]
    distances = np.array([np.sum((X_sample - c)**2, axis=1) for c in centroids])
    labels = distances.argmin(axis=0)
    for j in range(k):
        mask = labels == j
        ax.scatter(X_sample[mask, 0], X_sample[mask, 1],
                   c=colors[j], alpha=0.3, s=10)
        ax.scatter(centroids[j, 0], centroids[j, 1],
                   c=colors[j], marker='X', s=200, edgecolors='black', linewidths=2)
    ax.set_xlabel('Price (standardized)')
    ax.set_ylabel('Reviews (standardized)')
    ax.set_title(f'Iteration {step + 1}')
    # Recompute centroids
    for j in range(k):
        if (labels == j).sum() > 0:
            centroids[j] = X_sample[labels == j].mean(axis=0)

plt.suptitle('K-Means: Watch the centroids converge', fontsize=14, y=1.02)
plt.tight_layout()
plt.show()

Panel 1 shows random initial centroids. By Panel 4, the centroids have stabilized — assignments barely change. That’s k-means: dead simple, fast, and guaranteed to converge. (The algorithm converges in finite steps: there are finitely many possible assignments, and inertia decreases at every step.) The solution is only a local optimum, though, as we’ll see shortly.

But there’s a lot hiding behind that simplicity. Let’s dig in.

Running k-means on the full feature set

Let’s use all our clustering features with k=3 and see what emerges.

Just like PCA (Chapter 14), k-means uses Euclidean distance — so features on larger scales dominate. We standardize to give each feature equal voice. Try running this without standardization and you’ll see price (in dollars) completely dominate latitude (in degrees). Same lesson, same fix.

Code
# Standardize all features for clustering
features_for_clustering = cluster_features + room_cols
X_scaled = StandardScaler().fit_transform(listings[features_for_clustering].values)

# Fit k-means with k=3
kmeans_3 = KMeans(n_clusters=3, random_state=42, n_init=10)
listings['cluster_k3'] = kmeans_3.fit_predict(X_scaled)

print("Cluster sizes:")
print(listings['cluster_k3'].value_counts().sort_index())
Cluster sizes:
cluster_k3
0    13442
1      651
2    15049
Name: count, dtype: int64

What do these clusters look like? The mean feature values per cluster tell the story.

Code
# What do the clusters look like? Show mean features per cluster
cluster_profiles = listings.groupby('cluster_k3')[cluster_features].mean()
print("Cluster profiles (mean values):")
cluster_profiles.round(1).T
Cluster profiles (mean values):
cluster_k3 0 1 2
price 76.8 57.3 186.4
latitude 40.7 40.7 40.7
longitude -73.9 -73.9 -74.0
accommodates 2.0 1.8 3.8
bedrooms 1.0 1.0 1.3
number_of_reviews 24.3 20.5 24.7
availability_365 136.3 174.4 115.8
minimum_nights 3.5 3.6 4.2
TipThink About It

We’re clustering on price, location, size, reviews, and availability. What do you think the clusters will capture — price tiers? listing types? something else? Make a prediction before running the next cell.

Code
# Scatter: price vs reviews, colored by cluster
fig, ax = plt.subplots(figsize=(8, 5))
for c in range(3):
    mask = listings['cluster_k3'] == c
    ax.scatter(listings.loc[mask, 'price'], listings.loc[mask, 'number_of_reviews'],
               alpha=0.3, s=10, label=f'Cluster {c}')
ax.set_xlabel('Price ($)')
ax.set_ylabel('Number of Reviews')
ax.set_title('Clusters: Price vs Reviews')
ax.legend(markerscale=5)
ax.set_xlim(0, 600)
ax.set_ylim(0, 300)
plt.show()

Now let’s map the clusters geographically.

Code
# Map: lat/lon colored by cluster
fig, ax = plt.subplots(figsize=(8, 5))
for c in range(3):
    mask = listings['cluster_k3'] == c
    ax.scatter(listings.loc[mask, 'longitude'], listings.loc[mask, 'latitude'],
               alpha=0.3, s=5, label=f'Cluster {c}')
ax.set_xlabel('Longitude')
ax.set_ylabel('Latitude')
ax.set_title('Clusters on the NYC Map')
ax.legend(markerscale=5)
ax.text(-73.97, 40.78, 'Manhattan', fontsize=9, fontstyle='italic')
ax.text(-73.95, 40.68, 'Brooklyn', fontsize=9, fontstyle='italic')
plt.show()

Look at the map. The clusters partially reflect geographic patterns — listings in different boroughs do tend to cluster together — but they’re also driven by price and listing type. With lat/lon in the feature set, k-means mixes geography and market segment together, so the clusters don’t cleanly match boroughs.

Is that useful? We already have the borough column. The model spent its effort on geography instead of finding genuine market segments.

Choosing k: elbow plot and silhouette score

How did we know k=3 was right? We didn’t. K is a choice, not a discovery. But there are heuristics to help.

The elbow method plots the inertia (the k-means objective we defined above) as k increases. We look for an “elbow” where adding more clusters stops helping much.

ImportantDefinition: Silhouette Score

The silhouette score measures how similar each point is to its own cluster vs the nearest other cluster. Range is [-1, 1]: a score near +1 means points sit comfortably inside their cluster; near 0 means they’re on the boundary between clusters; near -1 means they’re probably in the wrong cluster.

Code
from sklearn.metrics import silhouette_samples

k_range = range(2, 11)
inertias = []
silhouettes = []

for k in k_range:
    km = KMeans(n_clusters=k, random_state=42, n_init=10)
    labels_k = km.fit_predict(X_scaled)
    inertias.append(km.inertia_)
    silhouettes.append(silhouette_score(X_scaled, labels_k, sample_size=5000))

Let’s visualize both metrics side by side.

Code
# Elbow plot
fig, axes = plt.subplots(1, 2, figsize=(11, 4))

axes[0].plot(k_range, inertias, 'o-', color='steelblue')
axes[0].axvline(x=3, color='gray', linestyle='--', alpha=0.5, label='k=3 (our choice)')
axes[0].set_xlabel('Number of Clusters (k)')
axes[0].set_ylabel('Inertia (within-cluster sum of squares)')
axes[0].set_title('Elbow Plot')
axes[0].annotate('No clear elbow — typical of real data',
                 xy=(5, inertias[3]), fontsize=9, fontstyle='italic')
axes[0].legend()

axes[1].plot(k_range, silhouettes, 'o-', color='coral')
axes[1].axvline(x=3, color='gray', linestyle='--', alpha=0.5, label='k=3 (our choice)')
axes[1].set_xlabel('Number of Clusters (k)')
axes[1].set_ylabel('Silhouette Score')
axes[1].set_title('Silhouette Score')
axes[1].legend()

plt.tight_layout()
plt.show()

Neither plot gives a definitive answer. The “elbow” is gradual — there’s no sharp bend. The silhouette score is highest at k=2 (just splitting the data in half). Choosing k requires judgment, not just a formula.

These heuristics won’t give you a magic number, but they can rule out bad choices. If the silhouette score is terrible at k=10, don’t use 10 clusters. Think of them as sanity checks, not oracles.

But there’s an even bigger issue than choosing k. The clusters you get depend on which features you include — and that choice can completely change the story.

Geography dominates (unless you remove it)

Let’s run k-means without latitude and longitude and see if we get genuinely different market segments.

Code
# Clustering WITHOUT geography
features_no_geo = ['price', 'accommodates', 'bedrooms',
                   'number_of_reviews', 'availability_365',
                   'minimum_nights'] + room_cols
X_no_geo_scaled = StandardScaler().fit_transform(listings[features_no_geo].values)

kmeans_no_geo = KMeans(n_clusters=3, random_state=42, n_init=10)
listings['cluster_no_geo'] = kmeans_no_geo.fit_predict(X_no_geo_scaled)

How do the cluster profiles change without geography?

Code
# Compare cluster profiles
print("=== WITH geography (clusters are basically boroughs) ===")
print(listings.groupby('cluster_k3')[['price', 'accommodates', 'bedrooms']].mean().round(1))
print()
print("=== WITHOUT geography (clusters are market segments) ===")
print(listings.groupby('cluster_no_geo')[['price', 'accommodates', 'bedrooms']].mean().round(1))
=== WITH geography (clusters are basically boroughs) ===
            price  accommodates  bedrooms
cluster_k3                               
0            76.8           2.0       1.0
1            57.3           1.8       1.0
2           186.4           3.8       1.3

=== WITHOUT geography (clusters are market segments) ===
                price  accommodates  bedrooms
cluster_no_geo                               
0                76.7           2.0       1.0
1               186.5           3.8       1.3
2                57.3           1.8       1.0

The side-by-side maps make the difference vivid.

Code
# Side-by-side maps: WITH vs WITHOUT geography
fig, axes = plt.subplots(1, 2, figsize=(11, 4))

for c in range(3):
    mask = listings['cluster_k3'] == c
    axes[0].scatter(listings.loc[mask, 'longitude'], listings.loc[mask, 'latitude'],
                    alpha=0.3, s=5, label=f'Cluster {c}')
axes[0].set_title('WITH lat/lon: Clusters = Geography')
axes[0].set_xlabel('Longitude')
axes[0].set_ylabel('Latitude')
axes[0].legend(markerscale=5)

for c in range(3):
    mask = listings['cluster_no_geo'] == c
    axes[1].scatter(listings.loc[mask, 'longitude'], listings.loc[mask, 'latitude'],
                    alpha=0.3, s=5, label=f'Cluster {c}')
axes[1].set_title('WITHOUT lat/lon: Clusters = Market Segments')
axes[1].set_xlabel('Longitude')
axes[1].set_ylabel('Latitude')
axes[1].legend(markerscale=5)

plt.tight_layout()
plt.show()

Completely different! Without geography, the clusters are no longer separated on the map — they’re mixed across all boroughs. Instead, they capture differences in price, size, and listing type. These are actual market segments, not just zip codes.

WarningYour features determine your clusters

If you include latitude and longitude (even standardized), they often dominate — spatial separation is strong and clean. Remove them, and you discover listing-type structure instead.

Code
# Visualize market segments: price vs accommodates
fig, ax = plt.subplots(figsize=(8, 5))
for c in range(3):
    mask = listings['cluster_no_geo'] == c
    n = mask.sum()
    ax.scatter(listings.loc[mask, 'accommodates'], listings.loc[mask, 'price'],
               alpha=0.3, s=10, label=f'Cluster {c} ({n:,} listings)')
ax.set_xlabel('Accommodates')
ax.set_ylabel('Price ($)')
ax.set_title('Market Segments (without geography)')
ax.set_ylim(0, 600)
ax.legend(markerscale=5)
plt.show()

Let’s name the segments by examining each cluster’s characteristics.

Code
# Interpret the market-segment clusters
for c in range(3):
    sub = listings[listings['cluster_no_geo'] == c]
    print(f"Cluster {c} ({len(sub):,} listings):")
    print(f"  Avg price: ${sub['price'].mean():.0f}/night | "
          f"Avg bedrooms: {sub['bedrooms'].mean():.1f} | "
          f"Avg reviews: {sub['number_of_reviews'].mean():.0f}")
    print(f"  Top room type: {sub['room_type'].value_counts().index[0]}")
    print()
Cluster 0 (13,438 listings):
  Avg price: $77/night | Avg bedrooms: 1.0 | Avg reviews: 24
  Top room type: Private room

Cluster 1 (15,053 listings):
  Avg price: $186/night | Avg bedrooms: 1.3 | Avg reviews: 25
  Top room type: Entire home/apt

Cluster 2 (651 listings):
  Avg price: $57/night | Avg bedrooms: 1.0 | Avg reviews: 20
  Top room type: Shared room

Now we can tell a story: one cluster is shared-room budget listings (~$57/night), another is private-room moderate listings (~$77/night), and the third is entire-home premium listings (~$186/night). Reviews per listing are similar across all three clusters — the key split is room type and price, not activity level.

These segments are useful for an Airbnb host trying to figure out their competitive set. The borough-based clusters told us nothing we didn’t already know.

Different random seeds, different clusters

K-means starts with random initial centroids. Different starting points can lead to different final clusters — the algorithm finds a local optimum, not the global best.

TipThink About It

If you ran k-means again with different starting points, would you get the same clusters?

We deliberately set n_init=1 below (one random start) to expose the instability. sklearn’s default n_init=10 helps by picking the best of 10 runs, but even that doesn’t guarantee the same answer every time. (sklearn also uses k-means++ initialization by default, which spreads out the initial centroids to avoid poor starts — but local minima remain a fact of life.)

To compare two different clusterings, we use the Adjusted Rand Index (ARI). The ARI asks: if we take any two listings, do both runs agree on whether they’re in the same cluster or different clusters? ARI = 1 means perfect agreement; ARI = 0 means the two runs agree no more than random chance.

Code
# Run k-means 10 times with different random seeds
from sklearn.metrics import adjusted_rand_score

all_labels = []
for seed in range(10):
    km = KMeans(n_clusters=3, random_state=seed, n_init=1)
    all_labels.append(km.fit_predict(X_no_geo_scaled))

Now we compare all pairs of runs using the ARI.

Code
# Compute pairwise ARI between all 10 runs
ari_matrix = np.zeros((10, 10))
for i in range(10):
    for j in range(10):
        ari_matrix[i, j] = adjusted_rand_score(all_labels[i], all_labels[j])

fig, ax = plt.subplots(figsize=(8, 6))
# Mask diagonal to focus on off-diagonal agreement
display_matrix = ari_matrix.copy()
np.fill_diagonal(display_matrix, np.nan)
sns.heatmap(display_matrix, annot=True, fmt='.2f', cmap='YlOrRd',
            ax=ax, vmin=0, vmax=1)
ax.set_xlabel('Run')
ax.set_ylabel('Run')
ax.set_title('Agreement Between 10 K-Means Runs (n_init=1)')
plt.tight_layout()
plt.show()

off_diag = ari_matrix[np.triu_indices(10, k=1)]
print(f"ARI range: {off_diag.min():.3f} to {off_diag.max():.3f}")
print(f"ARI mean:  {off_diag.mean():.3f}")

ARI range: 0.719 to 1.000
ARI mean:  0.903

Some pairs of runs produce nearly identical clusters (ARI close to 1). Others show moderate variation. The clusters are not stable — they depend on where k-means happened to start.

TipThink About It

If your clusters change depending on the random seed, can you really trust the labels? What does “Cluster 1” even mean if it’s different every time?

The AI gotcha

An AI assistant asked to “cluster these Airbnb listings” would:

  1. Run k-means with default settings
  2. Report 3 or 5 “beautiful” clusters
  3. Maybe make a scatter plot
  4. Move on

It would not: - Check whether the clusters are dominated by geography - Try different feature sets to see how clusters change - Run with multiple random seeds to check stability - Question whether k=3 is the right choice or just an arbitrary default - Interpret what each cluster actually represents

Clustering is easy to run and hard to validate. The algorithm will always produce clusters — even from random noise. Your job is to ask: are these clusters real, useful, and stable?

NoteGeorge Box on models

“All models are wrong, but some are useful.” The clusters are not “true” categories hiding in the data. They’re a simplification — and the question is whether that simplification helps you make a better decision.

Key Takeaways

  • K-means minimizes within-cluster sum of squares (inertia): \(\sum_{k=1}^{K} \sum_{i \in C_k} \|x_i - \mu_k\|^2\). The algorithm always converges in finite steps (finitely many assignments, inertia decreasing at each step), but only to a local optimum.

  • Standardize first. K-means uses Euclidean distance, so features on larger scales dominate — same lesson as PCA.

  • Your features determine your clusters. Including latitude and longitude makes k-means rediscover boroughs. Remove them and you find listing-type segments. Neither is “right” — they answer different questions.

  • Choosing k is a judgment call, not a formula. The elbow plot and silhouette score provide guidance, but there’s no single correct answer.

  • K-means is not deterministic. Different random seeds produce different clusters. Always check stability by running multiple times.

  • Always interpret and validate. Clustering algorithms will find structure in any data, including pure noise. Ask: are the clusters meaningful? Are they stable? Are they useful for the decision you need to make?

Looking ahead

We’ve now covered three major types of learning: - Supervised (regression in Chapters 5-7, classification in Chapter 13) - Unsupervised for structure (PCA in Chapter 14 — finding directions that explain variation) - Unsupervised for grouping (clustering today — assigning points to discrete clusters)

Mathematically, PCA and k-means are cousins. PCA approximates the data matrix with a rank-k matrix; k-means approximates it with a matrix assigning each row to one of k prototypes. Both find low-dimensional summaries, but PCA gives a continuous representation while k-means forces each point into exactly one cluster.

In Chapters 18-19, we’ll ask a fundamentally different question: not “what groups exist?” but “did this intervention cause that outcome?” The question shifts from prediction to causal inference.

Study guide

Key ideas

  • Unsupervised learning: finding structure in data without labeled outcomes.
  • K-means: clustering algorithm that alternates between assigning points to nearest centroid and recomputing centroids. It minimizes within-cluster sum of squares: \(\sum_{k=1}^{K} \sum_{i \in C_k} \|x_i - \mu_k\|^2\).
  • Centroid: the mean of all points in a cluster.
  • Inertia: the k-means objective — total within-cluster sum of squared distances.
  • Silhouette score: measures how well each point fits its own cluster vs. the nearest other cluster (range [-1, 1]).
  • Elbow method: plotting inertia vs. k and looking for diminishing returns.
  • Local optimum: a solution that can’t be improved by small changes, but may not be the global best.
  • Adjusted Rand Index (ARI): measures agreement between two clusterings, adjusted for chance (1 = identical, 0 = random).
  • Standardization is essential for k-means, which uses Euclidean distance — same reason as PCA.
  • Feature selection determines what the clusters capture (geography vs. market segments).
  • K-means is sensitive to initialization — different random seeds can produce different clusters. The k-means++ initialization strategy picks initial centroids that are spread apart, reducing sensitivity to random starts.
  • The algorithm always finds clusters, even in pure noise — validation is your responsibility.

Computational tools

  • KMeans(n_clusters=k, random_state=42, n_init=10) — fit k-means clustering
  • km.fit_predict(X) — fit and return cluster labels
  • km.inertia_ — the k-means objective value (within-cluster sum of squares)
  • StandardScaler().fit_transform(X) — standardize features to zero mean, unit variance
  • silhouette_score(X, labels) — average silhouette score across all points
  • silhouette_samples(X, labels) — per-point silhouette values
  • adjusted_rand_score(labels1, labels2) — compare two clusterings

For the quiz

You should be able to:

  • Describe the k-means algorithm (assign, recompute, repeat) and state what it optimizes
  • Explain why standardization matters for k-means (and connect to PCA)
  • Interpret an elbow plot and a silhouette plot
  • Explain why different runs can produce different clusters (local optima, initialization)
  • Given cluster profiles, interpret what each cluster represents
  • Explain why feature selection changes the clusters you find