Backtesting and Time Series Validation

You build a model to predict tomorrow’s air quality. It scores R² = 0.66 on your test set — not bad. But when you deploy it, the model can’t even beat the simplest baseline: predicting “same as yesterday.” What happened?

The problem isn’t just your model — it’s how you evaluated it. When data has a temporal structure, the standard random train/test split can silently mislead you, and metrics like R² can hide the fact that your model adds no value over a trivial baseline. This chapter shows exactly how, and demonstrates the right way to validate time series models.

Code
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_absolute_error
import warnings
warnings.filterwarnings('ignore')
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (7, 4.5)
plt.rcParams['font.size'] = 12

DATA_DIR = 'data'

Air quality in California

The EPA publishes daily Air Quality Index (AQI) measurements for every U.S. county. AQI measures how polluted the air is — 0 is pristine, 50 is “good,” and anything above 150 is “unhealthy.” In California, wildfire smoke can push AQI into the thousands.

Let’s load four years of data and focus on a few counties with good coverage.

Code
# Load all four years of EPA data, filtered to California
dfs = []
for year in range(2021, 2025):
    df_year = pd.read_csv(f'{DATA_DIR}/epa-air-quality/daily_aqi_by_county_{year}.csv')
    df_year = df_year[df_year['State Name'] == 'California']
    dfs.append(df_year)

aqi = pd.concat(dfs, ignore_index=True)
aqi['Date'] = pd.to_datetime(aqi['Date'])
aqi = aqi.sort_values(['county Name', 'Date']).reset_index(drop=True)

print(f"California records: {len(aqi):,}")
print(f"Counties: {aqi['county Name'].nunique()}")
print(f"Date range: {aqi['Date'].min().date()} to {aqi['Date'].max().date()}")
California records: 75,449
Counties: 53
Date range: 2021-01-01 to 2024-12-31

We’ll focus on three counties with good daily coverage and distinct air quality patterns: Los Angeles (urban smog), Sacramento (valley pollution), and Mono (extreme dust events).

Code
# Pick 3 counties with good daily coverage and interesting patterns
counties = ['Los Angeles', 'Sacramento', 'Mono']
aqi_sub = aqi[aqi['county Name'].isin(counties)].copy()

# Check coverage
for county in counties:
    n = aqi_sub[aqi_sub['county Name'] == county].shape[0]
    print(f"{county}: {n} daily observations")
Los Angeles: 1461 daily observations
Sacramento: 1461 daily observations
Mono: 1460 daily observations

What does air quality look like over time?

Before modeling anything, let’s look at the data. Notice the seasonal pattern — summer and fall have worse air quality (wildfire season). And notice the spikes: these are smoke events that dwarf normal variation.

Code
fig, axes = plt.subplots(3, 1, figsize=(10, 8), sharex=True)

for ax, county in zip(axes, counties):
    county_data = aqi_sub[aqi_sub['county Name'] == county]
    ax.plot(county_data['Date'], county_data['AQI'], linewidth=0.5, alpha=0.8)
    ax.set_ylabel('AQI')
    ax.set_title(f'{county} County')
    ax.axhline(y=150, color='red', linestyle='--', alpha=0.5, label='Unhealthy threshold')
    ax.legend(loc='upper right')

axes[-1].set_xlabel('Date')
fig.suptitle('Daily AQI in California Counties (2021–2024)', fontsize=14, y=1.01)
plt.tight_layout()
plt.show()

Mono County is wild — AQI values above 1,000 (and once above 8,000!) from extreme dust storms near Mono Lake. Los Angeles and Sacramento show the typical California pattern: seasonal wildfire spikes layered on top of baseline urban pollution.

ImportantDefinition: Non-Stationarity

A time series is non-stationary when its statistical properties (mean, variance) change over time. The AQI data above is non-stationary: the mean and variance shift with the seasons, and extreme events (wildfire spikes) appear unpredictably. The rest of this chapter explores why non-stationarity undermines standard evaluation methods.

Building a prediction model

Let’s try to predict tomorrow’s AQI. A natural approach: use lag features — yesterday’s AQI, the 7-day average, etc. This approach is just the feature engineering from Chapter 6, applied to time.

We’ll focus on Los Angeles for a clean demonstration.

Code
# Work with Los Angeles
la = aqi_sub[aqi_sub['county Name'] == 'Los Angeles'][['Date', 'AQI']].copy()
la = la.sort_values('Date').reset_index(drop=True)

# Create lag features
la['aqi_lag1'] = la['AQI'].shift(1)       # yesterday
la['aqi_lag7_avg'] = la['AQI'].rolling(7).mean().shift(1)  # 7-day average (as of yesterday)
la['aqi_lag3_avg'] = la['AQI'].rolling(3).mean().shift(1)  # 3-day average
la['day_of_year'] = la['Date'].dt.dayofyear  # seasonality feature

# Drop rows with NaN from lagging
la = la.dropna().reset_index(drop=True)
print(f"LA dataset: {len(la)} days with features")
la.head(10)
LA dataset: 1454 days with features
Date AQI aqi_lag1 aqi_lag7_avg aqi_lag3_avg day_of_year
0 2021-01-08 88 84.0 96.857143 86.000000 8
1 2021-01-09 96 88.0 83.142857 87.333333 9
2 2021-01-10 71 96.0 85.857143 89.333333 10
3 2021-01-11 69 71.0 84.285714 85.000000 11
4 2021-01-12 69 69.0 83.142857 78.666667 12
5 2021-01-13 77 69.0 81.000000 69.666667 13
6 2021-01-14 86 77.0 79.142857 71.666667 14
7 2021-01-15 82 86.0 79.428571 77.333333 15
8 2021-01-16 93 82.0 78.571429 81.666667 16
9 2021-01-17 84 93.0 78.142857 87.000000 17

The standard split — and why it fails here

Recall from Chapter 7 that we split data randomly into training and test sets. That approach assumed each observation was independent of the others. Time-ordered observations violate that assumption.

Here’s what most tutorials (and most AI tools) would do — randomly shuffle the data and split 80/20. Let’s see how well it works.

Code
# Random 80/20 split -- THE WRONG WAY
np.random.seed(42)
features = ['aqi_lag1', 'aqi_lag7_avg', 'aqi_lag3_avg', 'day_of_year']

mask = np.random.rand(len(la)) < 0.8
train_std = la[mask]
test_std = la[~mask]

model_std = LinearRegression()
model_std.fit(train_std[features], train_std['AQI'])

pred_std = model_std.predict(test_std[features])
r2_std = r2_score(test_std['AQI'], pred_std)
mae_std = mean_absolute_error(test_std['AQI'], pred_std)

print(f"Random split results:")
print(f"  R-squared = {r2_std:.3f}")
print(f"  MAE = {mae_std:.1f}")
print(f"  Train size: {len(train_std)}, Test size: {len(test_std)}")
print()
print("Looks great! Ship it?")
Random split results:
  R-squared = 0.664
  MAE = 15.4
  Train size: 1151, Test size: 303

Looks great! Ship it?

R² about 0.66 — not bad for a first attempt. But a hidden problem lurks in the evaluation itself.

TipThink About It

With a random split, your test set includes days from the middle of the time series. A test-set day from June 2022 has its neighbors (June 1, June 3, …) in the training set. The lag features — “yesterday’s AQI” — are computed from data that’s right next to training points.

ImportantDefinition: Data Leakage (Temporal)

Data leakage occurs when information from the future (or from the test set) contaminates training, producing optimistically biased results. In time series, a random split causes leakage: test observations are highly correlated with nearby training observations. The lag feature “yesterday’s AQI” for a test-set day was likely in the training set.

Proper backtesting: temporal split

The fix is simple: train on the past, test on the future. This protocol mimics how the model would actually be used — you’d train it on historical data and deploy it to predict tomorrow. The diagram below shows the difference: random splitting scatters test points throughout the timeline, while a temporal split draws a clean boundary.

Code
# Visualize random vs temporal split
fig, axes = plt.subplots(2, 1, figsize=(10, 3), sharex=True)

dates = la['Date'].values
n = len(dates)

# Random split
colors_random = np.where(mask[:n], 'steelblue', 'orange')
axes[0].scatter(dates, np.ones(n), c=colors_random, s=2, alpha=0.6)
axes[0].set_yticks([])
axes[0].set_ylabel('Random\nsplit', rotation=0, ha='right', va='center')
axes[0].set_title('Train (blue) vs Test (orange): Random Split vs Temporal Split')

# Temporal split
cutoff = pd.Timestamp('2023-01-01')
colors_temporal = np.where(la['Date'] < cutoff, 'steelblue', 'orange')
axes[1].scatter(dates, np.ones(n), c=colors_temporal, s=2, alpha=0.6)
axes[1].axvline(x=cutoff, color='black', linewidth=1.5, linestyle='--', label='Train/test boundary')
axes[1].set_yticks([])
axes[1].set_ylabel('Temporal\nsplit', rotation=0, ha='right', va='center')
axes[1].legend(loc='upper right')

plt.tight_layout()
plt.show()

print("Top: random split scatters test points everywhere (leakage!).")
print("Bottom: temporal split puts all test points in the future (correct).")

Top: random split scatters test points everywhere (leakage!).
Bottom: temporal split puts all test points in the future (correct).
TipThink About It

We’re about to train on 2021–2022 and test on 2023. Will R-squared be higher, lower, or about the same as the random split? Why?

Let’s train on 2021–2022 and test on 2023.

Code
# Temporal split -- THE RIGHT APPROACH
train_temporal = la[la['Date'] < '2023-01-01']
test_temporal = la[(la['Date'] >= '2023-01-01') & (la['Date'] < '2024-01-01')]

model_temporal = LinearRegression()
model_temporal.fit(train_temporal[features], train_temporal['AQI'])

pred_temporal = model_temporal.predict(test_temporal[features])
r2_temporal = r2_score(test_temporal['AQI'], pred_temporal)
mae_temporal = mean_absolute_error(test_temporal['AQI'], pred_temporal)

print(f"Temporal split results (train: 2021-2022, test: 2023):")
print(f"  R-squared = {r2_temporal:.3f}")
print(f"  MAE = {mae_temporal:.1f}")
print(f"  Train size: {len(train_temporal)}, Test size: {len(test_temporal)}")
Temporal split results (train: 2021-2022, test: 2023):
  R-squared = 0.686
  MAE = 16.4
  Train size: 723, Test size: 365

Now let’s compare the two approaches side by side.

Code
# Side-by-side comparison
print("=" * 50)
print(f"{'Method':<25} {'R-sq':>8} {'MAE':>8}")
print("=" * 50)
print(f"{'Random split (WRONG)':<25} {r2_std:>8.3f} {mae_std:>8.1f}")
print(f"{'Temporal split (RIGHT)':<25} {r2_temporal:>8.3f} {mae_temporal:>8.1f}")
print("=" * 50)
print(f"{'Leakage gap':<25} {r2_std - r2_temporal:>8.3f} {mae_temporal - mae_std:>8.1f}")
print()
gap = r2_std - r2_temporal
if gap > 0.01:
    print(f"The random split inflated R-squared by {gap:.3f} -- that's the leakage.")
elif gap < -0.01:
    print(f"Here the temporal split slightly outperforms (gap = {gap:.3f}).")
    print("The leakage didn't help much -- but the correct evaluation is still the temporal one.")
else:
    print(f"The gap is small ({gap:.3f}), but the temporal split is the only honest evaluation.")
    print("Even when the numbers are similar, the random split's validity is compromised.")
==================================================
Method                        R-sq      MAE
==================================================
Random split (WRONG)         0.664     15.4
Temporal split (RIGHT)       0.686     16.4
==================================================
Leakage gap                 -0.021      1.0

Here the temporal split slightly outperforms (gap = -0.021).
The leakage didn't help much -- but the correct evaluation is still the temporal one.

In this dataset, the two R² values are close — the random split didn’t dramatically inflate performance. But the random split is still the wrong evaluation. Its validity is compromised: test-set days have their neighbors in the training set, so the lag features get to “cheat.” The fact that the numbers happen to be similar here doesn’t justify the method. The temporal split is the only honest evaluation because it mimics how the model would actually be deployed.

ImportantDefinition: Naive Baseline

The naive forecast predicts that tomorrow’s value equals today’s value: \(\hat{y}_{t+1} = y_t\). This baseline is the simplest possible forecast and the bar any model must clear. In the forecasting literature, other standard benchmarks include the seasonal naive method (predict the value from the same season last year) and the drift method (extrapolate the line between the first and last observations). These methods are collectively called benchmark forecasting methods.

Code
# Naive forecast baseline: predict "same as yesterday"
naive_pred = test_temporal['aqi_lag1']  # yesterday's AQI IS the naive forecast
mae_naive_baseline = mean_absolute_error(test_temporal['AQI'], naive_pred)
r2_naive_baseline = r2_score(test_temporal['AQI'], naive_pred)

print(f"Naive baseline (same as yesterday):")
print(f"  R-squared = {r2_naive_baseline:.3f}")
print(f"  MAE = {mae_naive_baseline:.1f}")
print()
print(f"Linear model (temporal split):")
print(f"  R-squared = {r2_temporal:.3f}")
print(f"  MAE = {mae_temporal:.1f}")
print()
print("The naive baseline is surprisingly hard to beat!")
print("This is common in time series: yesterday is a strong predictor of today.")
Naive baseline (same as yesterday):
  R-squared = 0.662
  MAE = 16.1

Linear model (temporal split):
  R-squared = 0.686
  MAE = 16.4

The naive baseline is surprisingly hard to beat!
This is common in time series: yesterday is a strong predictor of today.

This result illustrates why backtesting is essential. In production, your model only ever sees the past. Your evaluation should reflect that constraint.

Scale-independent accuracy: MAPE and MASE

MAE tells you the average error in the same units as your target — useful for one series, but hard to compare across series with different scales. Is an MAE of 5 good for Los Angeles (typical AQI ~50) and Mono County (typical AQI ~30, but spikes to 8,000)?

MAPE (Mean Absolute Percentage Error) expresses errors as percentages: \[\text{MAPE} = \frac{1}{n} \sum_{i=1}^{n} \left| \frac{y_i - \hat{y}_i}{y_i} \right| \times 100\%.\] MAPE is intuitive — “the forecast is off by 15% on average” — but it breaks when actual values are near zero (division by zero) and penalizes under-forecasts more heavily than over-forecasts. When the actual value is small, even a modest error produces a large percentage — MAPE disproportionately penalizes forecasts for low-value days.

MASE (Mean Absolute Scaled Error) compares your model’s error to the naive baseline’s error: \[\text{MASE} = \frac{\text{MAE of your model}}{\text{MAE of the naive forecast}}.\] MASE < 1 means you beat the naive forecast; MASE > 1 means you did worse than just predicting “same as yesterday.” This metric connects directly to the baseline comparison above.1

Code
# Compute MAPE and MASE on the temporal test set
actual = test_temporal['AQI'].values
predicted = pred_temporal
naive = test_temporal['aqi_lag1'].values

# MAPE — watch out for zeros
nonzero = actual != 0
mape = np.mean(np.abs((actual[nonzero] - predicted[nonzero]) / actual[nonzero])) * 100

# MASE — scale by naive forecast error
mae_model = np.mean(np.abs(actual - predicted))
mae_naive_ref = np.mean(np.abs(actual - naive))
mase = mae_model / mae_naive_ref

print(f"LA County AQI forecasts (temporal split, 2023):")
print(f"  MAE  = {mae_model:.2f}")
print(f"  MAPE = {mape:.1f}%")
print(f"  MASE = {mase:.3f}")
print()
if mase < 1:
    print(f"MASE = {mase:.3f} < 1: the model beats the naive baseline.")
else:
    print(f"MASE = {mase:.3f} >= 1: the model does NOT beat the naive baseline.")
print("MAPE gives an intuitive percentage, but would blow up if AQI hit 0.")
LA County AQI forecasts (temporal split, 2023):
  MAE  = 16.40
  MAPE = 20.4%
  MASE = 1.019

MASE = 1.019 >= 1: the model does NOT beat the naive baseline.
MAPE gives an intuitive percentage, but would blow up if AQI hit 0.

A MASE near 1 means our linear model barely matches the naive “same as yesterday” baseline. That’s not a failure of the method — it’s a genuine finding. AQI is highly autocorrelated, so yesterday’s value is a strong predictor that’s hard to improve on with a simple linear model. More sophisticated models (nonlinear, with weather covariates) might do better, but the naive baseline sets a high bar.

Walk-forward validation

A single temporal split gives you one number. But model performance can change over time. Walk-forward validation trains on an expanding window and tests on the next period, giving you a performance curve. In the forecasting literature, this is called time series cross-validation — the same idea as k-fold cross-validation, but respecting temporal order.

Here’s the idea. Each row is one “fold.” The training set (blue) expands forward in time, and the model is tested on the next month (orange):

Fold 1: [===== TRAIN =====][TEST]
Fold 2: [======= TRAIN =======][TEST]
Fold 3: [========= TRAIN =========][TEST]
Fold 4: [=========== TRAIN ===========][TEST]
         2021              2022              2023

We used an expanding window here (all past data). An alternative is a sliding window (fixed-width training set that moves forward), which can be better when very old data is from a different regime — exactly the distribution shift problem we’ll see later.

This distinction is especially important when the data has regime changes — like wildfire season.

Code
# Walk-forward: train on expanding window, test on next month
# (freq='MS' = month start: generates the 1st of each month)
results = []
test_months = pd.date_range('2023-01-01', '2024-01-01', freq='MS')

for i, test_start in enumerate(test_months[:-1]):
    test_end = test_months[i + 1]

    train_wf = la[la['Date'] < test_start]
    test_wf = la[(la['Date'] >= test_start) & (la['Date'] < test_end)]

    if len(test_wf) < 10 or len(train_wf) < 100:
        continue

    model_wf = LinearRegression()
    model_wf.fit(train_wf[features], train_wf['AQI'])
    pred_wf = model_wf.predict(test_wf[features])

    month_mae = mean_absolute_error(test_wf['AQI'], pred_wf)
    month_r2 = r2_score(test_wf['AQI'], pred_wf)
    results.append({
        'month': test_start,
        'mae': month_mae,
        'r2': month_r2,
        'n_test': len(test_wf),
        'max_aqi': test_wf['AQI'].max()
    })
    print(f"  {test_start.strftime('%b %Y')}: MAE = {month_mae:.1f}, R² = {month_r2:.3f}")

wf_results = pd.DataFrame(results)
wf_results['month_label'] = wf_results['month'].dt.strftime('%b %Y')
print(wf_results[['month_label', 'mae', 'r2', 'max_aqi']].to_string(index=False))
  Jan 2023: MAE = 9.3, R² = 0.130
  Feb 2023: MAE = 9.6, R² = -0.260
  Mar 2023: MAE = 6.8, R² = -0.607
  Apr 2023: MAE = 14.9, R² = 0.479
  May 2023: MAE = 15.0, R² = 0.576
  Jun 2023: MAE = 17.4, R² = 0.309
  Jul 2023: MAE = 27.6, R² = -0.145
  Aug 2023: MAE = 23.8, R² = 0.609
  Sep 2023: MAE = 22.3, R² = 0.187
  Oct 2023: MAE = 20.6, R² = 0.507
  Nov 2023: MAE = 14.3, R² = 0.222
  Dec 2023: MAE = 13.6, R² = 0.199
month_label       mae        r2  max_aqi
   Jan 2023  9.266960  0.130164       82
   Feb 2023  9.586316 -0.260122       97
   Mar 2023  6.789728 -0.607446       64
   Apr 2023 14.934882  0.479284      156
   May 2023 14.992056  0.575711      136
   Jun 2023 17.394106  0.309244      192
   Jul 2023 27.599075 -0.144907      210
   Aug 2023 23.828699  0.608855      208
   Sep 2023 22.339099  0.186713      164
   Oct 2023 20.577634  0.507455      172
   Nov 2023 14.298563  0.222461      164
   Dec 2023 13.556200  0.199320      153

The monthly results tell a richer story than a single annual number. Let’s visualize the performance over time.

Code
fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(10, 7), sharex=True)

ax1.bar(wf_results['month_label'], wf_results['mae'], color='steelblue', alpha=0.8)
ax1.set_ylabel('MAE')
ax1.set_title('Walk-Forward Validation: Monthly MAE (Los Angeles, 2023)')
ax1.axhline(y=mae_temporal, color='red', linestyle='--', alpha=0.5,
            label=f'Full-year MAE = {mae_temporal:.1f}')
ax1.legend()

ax2.bar(wf_results['month_label'], wf_results['max_aqi'], color='coral', alpha=0.8)
ax2.set_ylabel('Max AQI in Month')
ax2.set_title('Worst Air Quality Day Each Month')
ax2.axhline(y=150, color='red', linestyle='--', alpha=0.5, label='Unhealthy threshold')
ax2.legend()

plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()

Notice how the model’s error spikes during summer months when wildfire smoke events push AQI to extremes. The model was trained on mostly “normal” air quality patterns and hasn’t seen enough extreme events to handle them.

This pattern is the value of walk-forward validation: it reveals when your model breaks, not just how well it does on average.

Lag features are just feature engineering

Recall from Chapter 6 that feature engineering transforms raw data into more useful predictors. Lag features are the same idea applied to time:

Feature What it captures
Yesterday’s AQI Short-term momentum
3-day average Recent trend
7-day average Weekly baseline
Day of year Seasonal pattern
Same day last year Year-over-year pattern

Let’s see which features matter most.

Code
# Which features contribute most?
print("Feature coefficients (temporal model):")
for feat, coef in zip(features, model_temporal.coef_):
    print(f"  {feat:<20s} {coef:+.4f}")
print(f"  {'intercept':<20s} {model_temporal.intercept_:+.4f}")
print()
print("Yesterday's AQI dominates -- AQI is highly autocorrelated.")
print("The seasonal feature (day_of_year) adds a small boost.")
Feature coefficients (temporal model):
  aqi_lag1             +0.7579
  aqi_lag7_avg         +0.2048
  aqi_lag3_avg         -0.1952
  day_of_year          +0.0103
  intercept            +18.2827

Yesterday's AQI dominates -- AQI is highly autocorrelated.
The seasonal feature (day_of_year) adds a small boost.

A second example: NBA performance prediction

You might be thinking: “This example is an air quality problem. My field is tech/finance/sports.” The temporal leakage trap can appear anywhere data has a time dimension. Let’s see what happens in a completely different domain.

If you could reliably predict how many points an NBA player will score tonight, you could make a fortune betting. Let’s try — predicting a player’s performance from their last 5 games. The temporal split is still the right evaluation: you’d train on past seasons and predict the current one.

Code
# Load NBA data
nba = pd.read_csv(f'{DATA_DIR}/nba/nba_load_management.csv')
nba['GAME_DATE'] = pd.to_datetime(nba['GAME_DATE'])
nba = nba.sort_values(['PLAYER_NAME', 'GAME_DATE']).reset_index(drop=True)

# Focus on players with enough games across all 3 seasons
games_per_player = nba.groupby('PLAYER_NAME').size()
active_players = games_per_player[games_per_player >= 150].index
nba_active = nba[nba['PLAYER_NAME'].isin(active_players)].copy()

print(f"Active players (150+ games): {len(active_players)}")
print(f"Records: {len(nba_active):,}")
print(f"Seasons: {nba_active['SEASON'].unique()}")
Active players (150+ games): 255
Records: 49,266
Seasons: <StringArray>
['2021-22', '2022-23', '2023-24']
Length: 3, dtype: str

We construct rolling lag features for each player using groupby combined with .shift() and .rolling() — the same temporal feature engineering from the AQI example, but now grouped by player.

Code
# Create rolling features: last-5-game averages
nba_active = nba_active.sort_values(['PLAYER_NAME', 'GAME_DATE']).reset_index(drop=True)
nba_active['pts_lag1'] = nba_active.groupby('PLAYER_NAME')['PTS'].shift(1)
nba_active['pts_lag5_avg'] = nba_active.groupby('PLAYER_NAME')['PTS'].transform(
    lambda x: x.rolling(5).mean().shift(1)
)
nba_active['min_lag5_avg'] = nba_active.groupby('PLAYER_NAME')['MIN'].transform(
    lambda x: x.rolling(5).mean().shift(1)
)

nba_model = nba_active.dropna(subset=['pts_lag1', 'pts_lag5_avg', 'min_lag5_avg']).copy()
nba_features = ['pts_lag1', 'pts_lag5_avg', 'min_lag5_avg', 'HOME']

print(f"Modeling dataset: {len(nba_model):,} player-games")
Modeling dataset: 47,991 player-games

Now we repeat the same experiment: random split versus temporal split. For the temporal split, we train on the first two seasons and test on the third.

Code
# Standard random split vs temporal split
np.random.seed(42)
mask_nba = np.random.rand(len(nba_model)) < 0.8

# Random split
model_r = LinearRegression()
model_r.fit(nba_model[mask_nba][nba_features], nba_model[mask_nba]['PTS'])
pred_r = model_r.predict(nba_model[~mask_nba][nba_features])
r2_r = r2_score(nba_model[~mask_nba]['PTS'], pred_r)

# Temporal split: train on seasons 1-2, test on season 3
train_nba = nba_model[nba_model['SEASON'].isin(['2021-22', '2022-23'])]
test_nba = nba_model[nba_model['SEASON'] == '2023-24'].copy()

model_t = LinearRegression()
model_t.fit(train_nba[nba_features], train_nba['PTS'])
pred_t = model_t.predict(test_nba[nba_features])
r2_t = r2_score(test_nba['PTS'], pred_t)

print(f"NBA prediction (points scored):")
print(f"  Random split R-squared:   {r2_r:.3f}")
print(f"  Temporal split R-squared: {r2_t:.3f}")
print(f"  Gap:                      {r2_r - r2_t:.3f}")
print()
gap_nba = r2_r - r2_t
if gap_nba > 0.01:
    print(f"Same pattern as AQI -- the random split is optimistically biased (gap = {gap_nba:.3f}).")
elif gap_nba < -0.01:
    print(f"Interesting: the temporal split slightly outperforms (gap = {gap_nba:.3f}).")
    print("The model generalizes well across seasons -- no leakage bias here.")
else:
    print(f"The two splits give similar results (gap = {gap_nba:.3f}).")
    print("The model generalizes well across seasons, but the temporal split is still the valid evaluation.")
NBA prediction (points scored):
  Random split R-squared:   0.468
  Temporal split R-squared: 0.484
  Gap:                      -0.017

Interesting: the temporal split slightly outperforms (gap = -0.017).
The model generalizes well across seasons -- no leakage bias here.

Unlike the AQI example, the NBA model generalizes well across seasons — the temporal split performs comparably to the random split. Player scoring patterns are more stable across years than air quality is across seasons. When temporal structure doesn’t create strong distribution shift, the random split may not inflate performance. But the temporal split is still the correct evaluation: it’s the only one that mirrors how you’d actually deploy the model (train on past seasons, predict the current one).

Even a model with decent R-squared doesn’t necessarily translate to profit in a betting market — the market already incorporates the information your model uses. The gap between statistical performance and economic value is a recurring theme in applied statistics.

When the future doesn’t look like the past

Let’s go back to air quality and look at something the model has never seen.

Mono County, near Mono Lake in eastern California, occasionally experiences extreme dust/wildfire events that push AQI into the thousands — sometimes above 8,000. That’s roughly 16x the “Hazardous” threshold of 500.

TipThink About It

If we train a model on “normal” days (AQI below 300), what would it predict for an AQI of 8,000? How far off would it be?

Code
# Mono County extremes
mono = aqi_sub[aqi_sub['county Name'] == 'Mono'][['Date', 'AQI']].copy()
mono = mono.sort_values('Date').reset_index(drop=True)

# Create lag features
mono['aqi_lag1'] = mono['AQI'].shift(1)
mono['aqi_lag7_avg'] = mono['AQI'].rolling(7).mean().shift(1)
mono['aqi_lag3_avg'] = mono['AQI'].rolling(3).mean().shift(1)
mono['day_of_year'] = mono['Date'].dt.dayofyear
mono = mono.dropna().reset_index(drop=True)

print("Mono County AQI distribution:")
print(mono['AQI'].describe().to_string())
print(f"\nDays above 500: {(mono['AQI'] > 500).sum()}")
print(f"Max AQI: {mono['AQI'].max()}")
Mono County AQI distribution:
count    1453.000000
mean       57.946318
std       351.039623
min         0.000000
25%        14.000000
50%        23.000000
75%        39.000000
max      8368.000000

Days above 500: 16
Max AQI: 8368

Let’s train a linear model on “normal” days (AQI below 300) and see what it predicts during extreme events.

Code
# Train a model on "normal" data (AQI < 300) and see what it predicts for extremes
normal_mono = mono[mono['AQI'] < 300]
extreme_mono = mono[mono['AQI'] >= 300]

model_mono = LinearRegression()
model_mono.fit(normal_mono[features], normal_mono['AQI'])

if len(extreme_mono) > 0:
    pred_extreme = model_mono.predict(extreme_mono[features])

    print("Model predictions vs reality during extreme events:")
    print(f"{'Actual AQI':>12} {'Predicted':>12} {'Error':>12}")
    print("-" * 40)
    for actual, pred in list(zip(extreme_mono['AQI'].values, pred_extreme))[:10]:
        print(f"{actual:>12.0f} {pred:>12.0f} {actual - pred:>12.0f}")
    print()
    print("The model can't extrapolate to values it's never seen.")
    print("This is DISTRIBUTION SHIFT -- the future can be fundamentally")
    print("different from the past.")
else:
    print("No extreme events found in Mono County subset.")
Model predictions vs reality during extreme events:
  Actual AQI    Predicted        Error
----------------------------------------
        1196           29         1167
         662           29          633
        2971           30         2941
         414           29          385
         761           29          732
         304           30          274
        3404           30         3374
         768           30          738
        7835           30         7805
        2050           38         2012

The model can't extrapolate to values it's never seen.
This is DISTRIBUTION SHIFT -- the future can be fundamentally
different from the past.

The numbers are stark, but the visualization makes the failure visceral.

Code
# Visualize the model's failure during extreme events
fig, ax = plt.subplots(figsize=(10, 5))

ax.plot(mono['Date'], mono['AQI'], linewidth=0.5, alpha=0.7, label='Actual AQI')

# Predict on all data
pred_all = model_mono.predict(mono[features])
ax.plot(mono['Date'], pred_all, linewidth=0.5, alpha=0.7, color='orange',
        label='Model prediction')

ax.axhline(y=500, color='red', linestyle='--', alpha=0.5, label='Hazardous (500)')
ax.set_ylabel('AQI')
ax.set_xlabel('Date')
ax.set_title('Mono County: Model vs Reality During Extreme Events')
ax.legend()
plt.tight_layout()
plt.show()

print("The orange line (predictions) can't reach the spikes.")
print("A linear model trained on normal conditions extrapolates nonsensically.")

The orange line (predictions) can't reach the spikes.
A linear model trained on normal conditions extrapolates nonsensically.
ImportantDefinition: Distribution Shift

Distribution shift occurs when future data comes from a fundamentally different regime than the training data. It is a severe form of non-stationarity — not just gradual changes in the mean, but a qualitative change the training data never captured.

Distribution shift is the deepest problem in time series forecasting:

  • Your training data covers normal conditions
  • The events that matter most — wildfires, pandemics, financial crashes — are precisely the ones your model has never seen
  • No amount of clever cross-validation alone fixes this; it’s a fundamental limitation of purely data-driven models

“Prediction is very difficult, especially about the future.” — attr. Niels Bohr

In Chapter 10, we saw that a clinical trial on adults can’t tell you about children. Same idea here: a model trained on AQI 0–300 can’t tell you what happens at AQI 8,000. You can’t extrapolate beyond the range of your data.

In Chapter 17, we’ll see that AutoML tools often make this same mistake — they use standard cross-validation by default and don’t know your data is temporal.

Distribution shift can also be gradual — styles of play, market conditions, or climate patterns that evolve slowly can erode model performance without triggering obvious alarms.

Prediction intervals: quantifying forecast uncertainty

A point forecast — “tomorrow’s AQI will be 62” — hides everything that matters for decision-making. Should the county issue a health advisory? That depends on whether 62 means “almost certainly between 55 and 70” or “could easily be 200.” Prediction intervals give you that range.

The idea: use the bootstrap. Resample residuals from the training set, add them to each forecast, and repeat many times. The spread of the resulting forecasts gives you a prediction interval.

Code
# Bootstrap prediction intervals for the temporal test set
np.random.seed(42)
train_resid = train_temporal['AQI'].values - model_temporal.predict(train_temporal[features])
n_boot = 1000

# Generate bootstrap forecasts by resampling residuals
boot_preds = np.zeros((n_boot, len(test_temporal)))
point_pred = model_temporal.predict(test_temporal[features])
for b in range(n_boot):
    resampled_resid = np.random.choice(train_resid, size=len(test_temporal), replace=True)
    boot_preds[b, :] = point_pred + resampled_resid

# 95% prediction interval: 2.5th and 97.5th percentiles
lower = np.percentile(boot_preds, 2.5, axis=0)
upper = np.percentile(boot_preds, 97.5, axis=0)

# How often does the actual value fall inside the interval?
coverage = np.mean((test_temporal['AQI'].values >= lower) & (test_temporal['AQI'].values <= upper))
avg_width = np.mean(upper - lower)

print(f"95% bootstrap prediction interval:")
print(f"  Coverage: {coverage:.1%} of test days (target: 95%)")
print(f"  Average width: {avg_width:.1f} AQI points")
95% bootstrap prediction interval:
  Coverage: 93.7% of test days (target: 95%)
  Average width: 89.5 AQI points

Let’s visualize the prediction intervals for the first two months of the test period. The shaded band shows the 95% interval around each point forecast.

Code
# Visualize prediction intervals for a 60-day window
fig, ax = plt.subplots(figsize=(10, 5))
window = slice(0, 60)
dates_w = test_temporal['Date'].values[window]

ax.fill_between(dates_w, lower[window], upper[window],
                alpha=0.25, color='steelblue', label='95% prediction interval')
ax.plot(dates_w, point_pred[window], color='steelblue', linewidth=1.2, label='Point forecast')
ax.plot(dates_w, test_temporal['AQI'].values[window], color='black',
        linewidth=0.8, alpha=0.8, label='Actual')
ax.set_ylabel('AQI')
ax.set_xlabel('Date')
ax.set_title('LA County AQI: Forecast with 95% Prediction Interval (Jan–Feb 2023)')
ax.legend()
plt.tight_layout()
plt.show()

print("Point forecasts alone are dangerous for decision-making.")
print("Always report intervals so stakeholders can plan for the plausible range.")

Point forecasts alone are dangerous for decision-making.
Always report intervals so stakeholders can plan for the plausible range.

This resampling approach is completely general: you can construct prediction intervals for any model, not just linear regression. And you’re not limited to our residual-resampling scheme — you can devise your own null model, simulate from it, and compare. The key principle is that a forecast without a measure of uncertainty isn’t a complete forecast.

These intervals assume stationary residuals. In practice, intervals should be wider during high-variance periods like wildfire season. A model that uses the same residual distribution year-round will undercover extreme events and overcover calm ones.

Key takeaways

  • Never randomly split time series data. Use temporal splits: train on the past, test on the future.

  • Walk-forward validation reveals when your model breaks, not just how well it does on average.

  • Lag features are feature engineering — yesterday’s value, rolling averages, same-day-last-year are all transformations of the raw data.

  • Backtesting is not prediction. A strategy that looks profitable in backtest may fail in production — the market already knows the same signals.

  • Distribution shift is the hardest problem. Models trained on normal conditions fail during extreme events.

  • Use scale-independent metrics (MAPE, MASE) to compare forecasts across different series. MASE < 1 means you beat the naive baseline.

  • Always report prediction intervals, not just point forecasts. Bootstrap resampling of residuals is a simple, general-purpose method.

Study guide

Key ideas

  • Backtesting (temporal split): Evaluating a model by training on past data and testing on future data, mimicking real deployment.
  • Data leakage: When information from the future (or the test set) contaminates training, producing optimistically biased results. Random train/test splits create leakage in time series: test observations are correlated with nearby training observations.
  • Lag features: Predictors constructed from past values of the time series (e.g., yesterday’s value, 7-day rolling average).
  • Non-stationarity: A time series whose statistical properties (mean, variance) change over time.
  • Walk-forward validation (time series cross-validation): Repeatedly training on an expanding (or sliding) window of past data and testing on the next period. Reveals when a model fails, not just its average performance.
  • Distribution shift: When future data comes from a fundamentally different regime than the training data. Models trained on normal conditions cannot predict unprecedented events.
  • Benchmark forecasting methods: The naive forecast (“same as yesterday”), seasonal naive (“same day last year”), and drift method (“extrapolate the trend”) are standard baselines. Any useful model must beat these.
  • MAPE (Mean Absolute Percentage Error): expresses forecast error as a percentage. Intuitive but undefined when actual values are zero.
  • MASE (Mean Absolute Scaled Error): ratio of your model’s MAE to the naive forecast’s MAE. MASE < 1 means you beat the naive baseline. Scale-independent, so useful for comparing across different series.
  • Prediction intervals: A range (e.g., 95%) around a point forecast capturing plausible future values. Bootstrap resampling of training residuals is a general-purpose method for constructing them.
  • Backtesting a strategy does not guarantee future profitability — markets already incorporate public information.

Computational tools

  • pd.Series.shift(n) — create lag features by shifting values forward or backward
  • pd.Series.rolling(k).mean() — compute rolling window averages
  • Temporal train/test split: filter by date (df[df['Date'] < cutoff])
  • Walk-forward loop: iterate over time periods with an expanding training window

For the quiz

You should be able to: (1) explain why random splitting fails for time series data, (2) implement a temporal train/test split given a date cutoff, (3) describe walk-forward validation (expanding and sliding window variants), (4) explain what distribution shift is and why no cross-validation method can fully address it, (5) compute and interpret MAPE and MASE, and (6) explain why prediction intervals matter and how bootstrap resampling produces them.


  1. Strictly, MASE uses the in-sample naive MAE as denominator (Hyndman & Koehler, 2006); we use the test-set version for simplicity.↩︎