Regression Inference + Diagnostics

Should an NBA team restructure its schedule around rest?

An NBA team’s analytics department says resting players boosts performance by 0.3 points per game. They want to restructure the schedule around it. Should the front office invest in roster flexibility based on this finding? We also examine Airbnb listings to see what happens when the effect IS real and large. Same tools, very different conclusions.

NoteWhy inference, not just prediction?

In Act 1, we built regression models to predict — given a listing’s features, what price should we expect? Prediction works well when you observe a new unit from the same population: a new listing appears, you plug in its features, you get a price.

But now consider a different question. An Airbnb host is thinking about spending $50,000 to convert a closet into a second bedroom. The regression coefficient on bedrooms is about +$34/night. Does that mean the renovation will pay for itself in four years?

Not necessarily. The coefficient tells you that listings with more bedrooms tend to cost more — but it doesn’t tell you that adding a bedroom to your listing would increase your price by $34. Maybe the real driver is apartment size, or building quality, or neighborhood. Larger apartments happen to have more bedrooms and higher prices, but the extra bedroom isn’t doing the work.

Prediction asks: what will happen to a new unit from this population? Inference asks: what would happen if I changed something? The second question is harder — and it’s the one that matters when the decision involves an investment, a policy, or an intervention. This chapter gives you the tools to start answering it.

NoteAct 1 meets Act 2

Act 1 built regression models for prediction. Act 2 introduced hypothesis testing and confidence intervals. This chapter combines both toolkits: we apply inference to regression coefficients and check whether the model’s assumptions justify that inference.

The central question: is each predictor in our model useful, or could we drop it without loss? Parsimony favors the simplest model that adequately fits the data (an idea sometimes called Occam’s razor). We also check whether the model’s assumptions hold using regression diagnostics.

Code
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
import statsmodels.api as sm
import statsmodels.formula.api as smf
from statsmodels.nonparametric.smoothers_lowess import lowess
import warnings
warnings.filterwarnings('ignore')
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (8, 5)
plt.rcParams['font.size'] = 12

DATA_DIR = 'data'

We define a helper function format_pvalue() to display very small p-values in scientific notation throughout this chapter.

Code
def format_pvalue(p):
    """Format p-values: scientific notation for very small values."""
    if p < 0.001:
        return f"{p:.2e}"
    return f"{p:.3f}"

Building a regression model for NBA performance

Recall from Chapter 11 that the raw correlation between rest and points was misleading – Simpson’s paradox showed that bench players dominate the extended-rest group. Can regression help us control for player quality and isolate the rest effect?

Code
# Load NBA data
nba = pd.read_csv(f'{DATA_DIR}/nba/nba_load_management.csv')
nba['GAME_DATE'] = pd.to_datetime(nba['GAME_DATE'])
nba_games = nba.dropna(subset=['REST_DAYS']).copy()

# Cap rest days at 7 to avoid long injury absences
nba_games = nba_games[nba_games['REST_DAYS'] <= 7].copy()

# Compute player-season averages as a measure of player quality
nba_games['PLAYER_SEASON_AVG'] = nba_games.groupby(
    ['PLAYER_ID', 'SEASON']
)['GAME_SCORE'].transform('mean')

# Compute opponent strength (opponent's average GAME_SCORE allowed)
opp_strength = nba_games.groupby('OPPONENT')['GAME_SCORE'].mean()
nba_games['OPP_STRENGTH'] = nba_games['OPPONENT'].map(opp_strength)

# Derive additional covariates from existing columns
# Back-to-back indicator: did the player play yesterday?
nba_games['BACK_TO_BACK'] = (nba_games['REST_DAYS'] == 1).astype(int)

# Minutes in previous game (proxy for fatigue/usage)
nba_games = nba_games.sort_values(['PLAYER_ID', 'GAME_DATE'])
nba_games['PREV_MIN'] = nba_games.groupby('PLAYER_ID')['MIN'].shift(1)

print(f"Observations: {len(nba_games):,}")
print(f"Players: {nba_games['PLAYER_NAME'].nunique()}")
print()
print("Variables for regression:")
print(f"  GAME_SCORE (response): mean={nba_games['GAME_SCORE'].mean():.2f}, std={nba_games['GAME_SCORE'].std():.2f}")
print(f"  REST_DAYS: mean={nba_games['REST_DAYS'].mean():.2f}")
print(f"  PLAYER_SEASON_AVG: mean={nba_games['PLAYER_SEASON_AVG'].mean():.2f}")
print(f"  HOME: {nba_games['HOME'].mean():.2%} home games")
print(f"  BACK_TO_BACK: {nba_games['BACK_TO_BACK'].mean():.2%} of games")
print(f"  PREV_MIN: mean={nba_games['PREV_MIN'].mean():.1f} (previous game minutes)")
Observations: 72,156
Players: 787

Variables for regression:
  GAME_SCORE (response): mean=8.85, std=7.85
  REST_DAYS: mean=2.27
  PLAYER_SEASON_AVG: mean=8.85
  HOME: 49.92% home games
  BACK_TO_BACK: 15.64% of games
  PREV_MIN: mean=23.5 (previous game minutes)

Before fitting any model, let’s look at the data. Is there a visible relationship between rest days and performance?

Code
# Visualize: game score by rest days
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

sns.boxplot(x='REST_DAYS', y='GAME_SCORE', data=nba_games, ax=axes[0], color='steelblue')
axes[0].set_xlabel('Rest Days')
axes[0].set_ylabel('Game Score')
axes[0].set_title('Game Score by Rest Days')

# Scatter with player quality on x-axis
axes[1].scatter(nba_games['PLAYER_SEASON_AVG'], nba_games['GAME_SCORE'],
               alpha=0.02, s=5, color='steelblue')
axes[1].set_xlabel('Player Season Average')
axes[1].set_ylabel('Game Score')
axes[1].set_title('Game Score vs Player Quality\n(the dominant predictor)')

plt.tight_layout()
plt.show()

print("The box plots look almost identical across rest days.")
print("Player quality clearly matters more. Can regression disentangle these effects?")

The box plots look almost identical across rest days.
Player quality clearly matters more. Can regression disentangle these effects?

Step-by-step model building: what does “controlling for” mean?

One of the most important ideas in regression is controlling for other variables. We build the model one predictor at a time and watch how the REST_DAYS coefficient changes.

TipThink About It

Before looking at the numbers, do you think the rest effect will be positive or negative? Large or small?

NoteThe matrix API

Statsmodels also provides a lower-level matrix interface, sm.OLS(y, X), where you pass arrays directly. The formula API (smf.ols('y ~ x1 + x2', data=df)) is more readable and handles categorical encoding automatically, so we use it throughout this course.

Code
# Model 1: naive — just rest days
model_1 = smf.ols('GAME_SCORE ~ REST_DAYS', data=nba_games).fit()

# Model 2: add player quality
model_2 = smf.ols('GAME_SCORE ~ REST_DAYS + PLAYER_SEASON_AVG', data=nba_games).fit()

# Model 3: full model with all controls
model_full = smf.ols(
    'GAME_SCORE ~ REST_DAYS + PLAYER_SEASON_AVG + HOME + OPP_STRENGTH',
    data=nba_games
).fit()

print("REST_DAYS coefficient as we add controls:")
print(f"  Model 1 (REST_DAYS only):           {model_1.params['REST_DAYS']:+.4f}  (p = {format_pvalue(model_1.pvalues['REST_DAYS'])})")
print(f"  Model 2 (+ PLAYER_SEASON_AVG):      {model_2.params['REST_DAYS']:+.4f}  (p = {format_pvalue(model_2.pvalues['REST_DAYS'])})")
print(f"  Model 3 (+ HOME + OPP_STRENGTH):    {model_full.params['REST_DAYS']:+.4f}  (p = {format_pvalue(model_full.pvalues['REST_DAYS'])})")
print()
print("This progression illustrates what 'controlling for' means — isolating the rest")
print("effect from confounders like player quality.")
REST_DAYS coefficient as we add controls:
  Model 1 (REST_DAYS only):           -0.6771  (p = 2.13e-130)
  Model 2 (+ PLAYER_SEASON_AVG):      -0.1464  (p = 4.56e-12)
  Model 3 (+ HOME + OPP_STRENGTH):    -0.1531  (p = 4.28e-13)

This progression illustrates what 'controlling for' means — isolating the rest
effect from confounders like player quality.

The full regression model

We predict game score from:

  • REST_DAYS: our variable of interest
  • PLAYER_SEASON_AVG: controls for player quality
  • HOME: home court advantage
  • OPP_STRENGTH: how good is the opponent?

\[\text{GAME\_SCORE}_i = \beta_0 + \beta_1 \cdot \text{REST\_DAYS}_i + \beta_2 \cdot \text{PLAYER\_SEASON\_AVG}_i + \beta_3 \cdot \text{HOME}_i + \beta_4 \cdot \text{OPP\_STRENGTH}_i + \varepsilon_i\]

Code
# Full model summary
print(model_full.summary().tables[1])
print()
print(f"R-squared: {model_full.rsquared:.4f}")
print(f"Observations: {int(model_full.nobs)}")
=====================================================================================
                        coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------------
Intercept            -8.6638      0.493    -17.565      0.000      -9.631      -7.697
REST_DAYS            -0.1531      0.021     -7.248      0.000      -0.195      -0.112
PLAYER_SEASON_AVG     0.9965      0.004    232.879      0.000       0.988       1.005
HOME                  0.2913      0.044      6.627      0.000       0.205       0.377
OPP_STRENGTH          1.0050      0.055     18.224      0.000       0.897       1.113
=====================================================================================

R-squared: 0.4356
Observations: 72156

Interpreting the coefficients

ImportantDefinition: Coefficient Interpretation

Each coefficient has a specific interpretation: **“Holding everything else constant,** a one-unit increase in \(x_j\) is associated with a \(\hat{\beta}_j\) change in game score.”

This phrasing describes what the model estimates – it does not mean we can literally hold everything else constant in the real world. In a more advanced course, you may see the same regression coefficients interpreted from three perspectives: prediction, inference, and causality. We explore the distinction in Chapter 18.

Code
# Extract and interpret coefficients
coefs = model_full.params
ses = model_full.bse
pvals = model_full.pvalues

print("Coefficient interpretations:")
print()
print(f"  REST_DAYS:        {coefs['REST_DAYS']:+.3f}")
print(f"    -> One more rest day is associated with a {coefs['REST_DAYS']:.3f} change in game score")
print(f"    -> Over a 3-day difference: {coefs['REST_DAYS']*3:.2f} points")
print()
print(f"  PLAYER_SEASON_AVG: {coefs['PLAYER_SEASON_AVG']:+.3f}")
print(f"    -> Better players score more (unsurprisingly)")
print()
print(f"  HOME:             {coefs['HOME']:+.3f}")
print(f"    -> Home court advantage is worth ~{coefs['HOME']:.1f} game score points")
print()
print(f"  OPP_STRENGTH:     {coefs['OPP_STRENGTH']:+.3f}")
print(f"    -> Facing a stronger opponent (by 1 GS point avg) shifts score by {coefs['OPP_STRENGTH']:.2f}")
Coefficient interpretations:

  REST_DAYS:        -0.153
    -> One more rest day is associated with a -0.153 change in game score
    -> Over a 3-day difference: -0.46 points

  PLAYER_SEASON_AVG: +0.997
    -> Better players score more (unsurprisingly)

  HOME:             +0.291
    -> Home court advantage is worth ~0.3 game score points

  OPP_STRENGTH:     +1.005
    -> Facing a stronger opponent (by 1 GS point avg) shifts score by 1.00

Raw coefficients are hard to compare across predictors with different units. Standardized coefficients multiply each coefficient by its predictor’s standard deviation, putting all effects on a common scale: “the effect of a one-SD change.”

Code
# Visualize CIs for all coefficients (exclude intercept for scale)
fig, ax = plt.subplots(figsize=(8, 5))
ci = model_full.conf_int().drop('Intercept')
coef_vals = model_full.params.drop('Intercept')

# Standardize: multiply each coefficient by its predictor's SD
predictor_sds = nba_games[['REST_DAYS', 'PLAYER_SEASON_AVG', 'HOME', 'OPP_STRENGTH']].std()
std_coefs = coef_vals * predictor_sds
std_ci_lower = ci[0] * predictor_sds
std_ci_upper = ci[1] * predictor_sds

labels = ['Rest Days', 'Player Quality\n(Season Avg)', 'Home Court', 'Opponent\nStrength']
y_pos = range(len(labels))

ax.barh(y_pos, std_coefs.values, xerr=[
    std_coefs.values - std_ci_lower.values,
    std_ci_upper.values - std_coefs.values
], color='steelblue', edgecolor='black', capsize=5, alpha=0.7)
ax.axvline(x=0, color='red', linestyle='--', linewidth=1)
ax.set_yticks(list(y_pos))
ax.set_yticklabels(labels)
ax.set_xlabel('Effect of 1 SD Change in Predictor (Game Score Points)')
ax.set_title('Standardized Regression Coefficients with 95% CIs')
plt.tight_layout()
plt.show()

print("Player quality dominates. REST_DAYS is barely distinguishable from zero.")

Player quality dominates. REST_DAYS is barely distinguishable from zero.

Which of these coefficients should we trust? The bars for player quality and home court are clearly separated from zero. The bar for rest days barely clears the line. We need a formal framework to decide which coefficients are distinguishable from noise.

Which coefficients matter? The null hypothesis

For each coefficient, the null hypothesis states that the predictor contributes nothing to the model after accounting for all other predictors:

\[H_0: \beta_j = 0 \quad \text{vs} \quad H_a: \beta_j \neq 0\]

If \(\beta_j = 0\), then predictor \(j\) is dead weight – the model would be equally good without it. The question: can we reject that null hypothesis?

Bootstrap approach: computing a p-value by resampling

One way to assess whether the REST_DAYS coefficient is distinguishable from zero is to bootstrap the coefficient and see how often the resampled estimate crosses zero.

Code
# Bootstrap the REST_DAYS coefficient
np.random.seed(42)
n_boot = 2000
boot_coefs = np.zeros(n_boot)

for i in range(n_boot):
    boot_idx = np.random.choice(len(nba_games), size=len(nba_games), replace=True)
    boot_data = nba_games.iloc[boot_idx]
    boot_model = smf.ols(
        'GAME_SCORE ~ REST_DAYS + PLAYER_SEASON_AVG + HOME + OPP_STRENGTH',
        data=boot_data
    ).fit()
    boot_coefs[i] = boot_model.params['REST_DAYS']

boot_se = boot_coefs.std()
boot_ci = np.percentile(boot_coefs, [2.5, 97.5])
# Two-sided p-value: fraction of bootstrap samples on the opposite side of zero
if np.mean(boot_coefs) >= 0:
    boot_p = 2 * np.mean(boot_coefs <= 0)
else:
    boot_p = 2 * np.mean(boot_coefs >= 0)

print("Bootstrap inference for REST_DAYS coefficient:")
print(f"  Original estimate: {coefs['REST_DAYS']:.4f}")
print(f"  Bootstrap SE:      {boot_se:.4f}")
print(f"  Bootstrap 95% CI:  [{boot_ci[0]:.4f}, {boot_ci[1]:.4f}]")
print(f"  Bootstrap p-value: {format_pvalue(boot_p)}")
Bootstrap inference for REST_DAYS coefficient:
  Original estimate: -0.1531
  Bootstrap SE:      0.0193
  Bootstrap 95% CI:  [-0.1890, -0.1136]
  Bootstrap p-value: 0.00e+00

The bootstrap gives us a distribution of the coefficient across resampled datasets. Now we derive the same result using a formula.

The t-statistic: a formula shortcut

The bootstrap standard error and the formula-based standard error measure the same thing: how much the coefficient would vary across repeated samples. The t-statistic divides the coefficient by its standard error.

Code
# The coefficient and its SE
beta_rest = coefs['REST_DAYS']
se_rest = ses['REST_DAYS']
# df_resid = degrees of freedom of the residuals (not a DataFrame!)
df_resid = model_full.df_resid

print(f"REST_DAYS coefficient: {beta_rest:.4f}")
print(f"Formula SE:            {se_rest:.4f}")
print(f"Bootstrap SE:          {boot_se:.4f}  (close match)")
print(f"Ratio (coef / SE):     {beta_rest / se_rest:.4f}")
print()
print("The ratio is about -7 -- the coefficient sits roughly 7 SEs from zero.")
REST_DAYS coefficient: -0.1531
Formula SE:            0.0211
Bootstrap SE:          0.0193  (close match)
Ratio (coef / SE):     -7.2478

The ratio is about -7 -- the coefficient sits roughly 7 SEs from zero.
ImportantDefinition: Coefficient t-test

This ratio has a name – the t-statistic – and we know its distribution under the null hypothesis.

For each coefficient, we test:

\[H_0: \beta_j = 0 \quad \text{vs} \quad H_a: \beta_j \neq 0\]

The test statistic is:

\[t = \frac{\hat{\beta}_j}{\text{SE}(\hat{\beta}_j)}\]

Under \(H_0\), this statistic follows a \(t\)-distribution with \(n - p - 1\) degrees of freedom.

(This derivation assumes approximately normal errors – reasonable here given our large sample size, \(n > 70{,}000\). With smaller samples, you would want to check a residual Q-Q plot.)

Code
# T-test for REST_DAYS coefficient — verify by hand
t_stat = beta_rest / se_rest
p_value = 2 * stats.t.sf(abs(t_stat), df_resid)

print("T-test for REST_DAYS coefficient (by hand):")
print(f"  t = {beta_rest:.4f} / {se_rest:.4f} = {t_stat:.4f}")
print(f"  df = {df_resid:.0f}")
print(f"  p-value = {format_pvalue(p_value)}")
print()
print("From statsmodels:")
print(f"  t = {model_full.tvalues['REST_DAYS']:.4f}")
print(f"  p = {format_pvalue(model_full.pvalues['REST_DAYS'])}")
print()
print("They match! The formula gives the same answer as bootstrap, much faster.")
T-test for REST_DAYS coefficient (by hand):
  t = -0.1531 / 0.0211 = -7.2478
  df = 72151
  p-value = 4.28e-13

From statsmodels:
  t = -7.2478
  p = 4.28e-13

They match! The formula gives the same answer as bootstrap, much faster.

Confidence interval for a coefficient

A 95% confidence interval for \(\beta_j\):

\[\hat{\beta}_j \pm t^* \times \text{SE}(\hat{\beta}_j)\]

where \(t^*\) is the critical value from the \(t\)-distribution.

Code
# 95% CI for REST_DAYS coefficient
t_star = stats.t.ppf(0.975, df_resid)
ci_lower = beta_rest - t_star * se_rest
ci_upper = beta_rest + t_star * se_rest

print(f"95% CI for REST_DAYS coefficient:")
print(f"  {beta_rest:.4f} +/- {t_star:.3f} x {se_rest:.4f}")
print(f"  [{ci_lower:.4f}, {ci_upper:.4f}]")
print()
print(f"Interpretation: we are 95% confident that one extra day of rest is associated")
print(f"with between {ci_lower:.3f} and {ci_upper:.3f} additional game score points.")
print()
print(f"From statsmodels: {model_full.conf_int().loc['REST_DAYS'].values}")
print(f"From bootstrap:   [{boot_ci[0]:.4f}, {boot_ci[1]:.4f}]")
95% CI for REST_DAYS coefficient:
  -0.1531 +/- 1.960 x 0.0211
  [-0.1945, -0.1117]

Interpretation: we are 95% confident that one extra day of rest is associated
with between -0.195 and -0.112 additional game score points.

From statsmodels: [-0.19453639 -0.11171749]
From bootstrap:   [-0.1890, -0.1136]

Can additional covariates flip the rest-days sign?

The REST_DAYS coefficient is negative (or near zero) – disappointing if you hoped rest helps performance. Can adding more covariates change the story?

We derive two additional predictors from columns already in the dataset:

  • BACK_TO_BACK: a binary indicator for playing on consecutive days (REST_DAYS = 1)
  • PREV_MIN: minutes played in the previous game (a fatigue/usage proxy)
Code
# Extended model with additional covariates
nba_extended = nba_games.dropna(subset=['PREV_MIN']).copy()

model_extended = smf.ols(
    'GAME_SCORE ~ REST_DAYS + PLAYER_SEASON_AVG + HOME + OPP_STRENGTH + BACK_TO_BACK + PREV_MIN',
    data=nba_extended
).fit()

print("REST_DAYS coefficient across model specifications:")
print(f"  Base model (4 predictors):     {model_full.params['REST_DAYS']:+.4f}  (p = {format_pvalue(model_full.pvalues['REST_DAYS'])})")
print(f"  Extended model (6 predictors): {model_extended.params['REST_DAYS']:+.4f}  (p = {format_pvalue(model_extended.pvalues['REST_DAYS'])})")
print()
print("Extended model coefficients:")
for var in ['REST_DAYS', 'BACK_TO_BACK', 'PREV_MIN', 'PLAYER_SEASON_AVG', 'HOME', 'OPP_STRENGTH']:
    print(f"  {var:25s} {model_extended.params[var]:+.4f}  (p = {format_pvalue(model_extended.pvalues[var])})")
print()
print("The coefficient for REST_DAYS remains negative (or near zero) regardless of")
print("which covariates we include. Additional rest does not appear to improve performance")
print("in this dataset — a genuine finding, not an artifact of omitted variables.")
print("Sometimes the data simply do not support the hypothesis we hoped for.")
REST_DAYS coefficient across model specifications:
  Base model (4 predictors):     -0.1531  (p = 4.28e-13)
  Extended model (6 predictors): -0.0726  (p = 0.004)

Extended model coefficients:
  REST_DAYS                 -0.0726  (p = 0.004)
  BACK_TO_BACK              +0.0420  (p = 0.557)
  PREV_MIN                  +0.0653  (p = 9.67e-119)
  PLAYER_SEASON_AVG         +0.9096  (p = 0.00e+00)
  HOME                      +0.2691  (p = 1.08e-09)
  OPP_STRENGTH              +1.0120  (p = 1.56e-74)

The coefficient for REST_DAYS remains negative (or near zero) regardless of
which covariates we include. Additional rest does not appear to improve performance
in this dataset — a genuine finding, not an artifact of omitted variables.
Sometimes the data simply do not support the hypothesis we hoped for.

Regression diagnostics: checking the model’s assumptions

The t-tests and confidence intervals above assume residuals are approximately normal with constant variance. Diagnostics let us look under the hood.

ImportantDefinition: LINE Conditions

The assumptions for regression inference are remembered by the LINE mnemonic:

  • Linearity: the relationship between predictors and response is linear.
  • Independence: observations are independent of each other.
  • Normality: residuals are approximately normally distributed.
  • Equal variance: the spread of residuals is roughly constant across fitted values (no heteroscedasticity).

Each LINE condition has a characteristic diagnostic signature. As a homework exercise, you’ll generate synthetic data that violates each assumption and create the corresponding diagnostic plots.

Residual plot: do residuals have constant spread?

ImportantDefinition: Residual Plot

A residual plot shows residuals (observed \(-\) predicted) vs. fitted values. We want to see a random scatter with no pattern. If the spread of residuals changes with the fitted value, that is heteroscedasticity – the model’s uncertainty is not uniform.

Code
# Residual plot for the NBA model
fitted_vals = model_full.fittedvalues
residuals = model_full.resid

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Residuals vs fitted
axes[0].scatter(fitted_vals, residuals, alpha=0.02, s=5, color='steelblue')
axes[0].axhline(y=0, color='red', linestyle='--', linewidth=1)
axes[0].set_xlabel('Fitted Values')
axes[0].set_ylabel('Residuals')
axes[0].set_title('Residuals vs Fitted Values')

# Add a LOWESS smoother to highlight any trend
smooth = lowess(residuals, fitted_vals, frac=0.1)
axes[0].plot(smooth[:, 0], smooth[:, 1], color='orange', linewidth=2, label='LOWESS smooth')
axes[0].legend()

# Scale-location plot: sqrt of |standardized residuals| vs fitted
std_resid = model_full.get_influence().resid_studentized_internal
axes[1].scatter(fitted_vals, np.sqrt(np.abs(std_resid)), alpha=0.02, s=5, color='steelblue')
axes[1].set_xlabel('Fitted Values')
axes[1].set_ylabel(r'$\sqrt{|\mathrm{Standardized\ Residuals}|}$')
axes[1].set_title('Scale-Location Plot\n(check for heteroscedasticity)')
smooth2 = lowess(np.sqrt(np.abs(std_resid)), fitted_vals, frac=0.1)
axes[1].plot(smooth2[:, 0], smooth2[:, 1], color='orange', linewidth=2)

plt.tight_layout()
plt.show()

print("The spread of residuals looks roughly constant across fitted values — reassuring.")

The spread of residuals looks roughly constant across fitted values — reassuring.

Q-Q plot: are residuals approximately normal?

A Q-Q plot compares residual quantiles to what we would expect from a normal distribution. Points on the diagonal indicate normality. Deviations in the tails indicate heavy tails or skew.

Code
# Q-Q plot using scipy.stats.probplot for correct scaling
fig, ax = plt.subplots(figsize=(6, 6))
(theoretical_quantiles, ordered_residuals), (slope, intercept, r) = stats.probplot(
    residuals, dist='norm'
)
ax.scatter(theoretical_quantiles, ordered_residuals, alpha=0.02, s=5, color='steelblue')
# Reference line: if residuals are normal, points lie on y = intercept + slope * x
ref_x = np.array([theoretical_quantiles.min(), theoretical_quantiles.max()])
ax.plot(ref_x, intercept + slope * ref_x, 'r-', linewidth=1.5, label='Normal reference line')
ax.set_xlabel('Theoretical Quantiles (Normal)')
ax.set_ylabel('Sample Quantiles (Residuals)')
ax.set_title('Q-Q Plot of Residuals\n(NBA regression model)')
ax.legend()
plt.tight_layout()
plt.show()

print("The middle of the distribution follows the normal line closely.")
print("The tails deviate — more extreme game scores occur than a normal")
print("distribution would predict. This pattern is common in sports data.")

The middle of the distribution follows the normal line closely.
The tails deviate — more extreme game scores occur than a normal
distribution would predict. This pattern is common in sports data.

Why residuals reveal what raw data hides

Looking at residuals instead of the raw response variable is not just a formality. Other predictors add scatter to the raw data that can mask violations of the model’s assumptions. Residuals remove the explained variation, leaving only the unexplained part — and that is where patterns become visible.

For example, a plot of price vs. bedrooms for Airbnb listings shows enormous scatter because neighborhood, room type, and amenities all vary across listings. But the residuals from a model that controls for those variables are much cleaner. A nonlinear pattern — say, diminishing returns from extra bedrooms — that was invisible in the raw scatter may jump out in the residual plot.

TipThink About It

Residuals reveal nonlinearity that raw data hides. Why? Because residuals strip away the variation explained by other predictors, making any remaining pattern easier to see.

WarningWhen diagnostics reveal problems

If residuals show strong non-normality or heteroscedasticity, the formula-based standard errors and p-values may be unreliable. Options include:

  • Bootstrap confidence intervals (Chapter 8) do not assume normality – they let the data speak for itself.
  • Robust standard errors adjust for heteroscedasticity without changing the coefficient estimates.
  • Transform the response (e.g., log price) to stabilize variance.

For the NBA model, our sample is large (\(n > 70{,}000\)) and the Q-Q plot is reasonable in the middle, so the formula-based inference is trustworthy here. With smaller samples or heavier tails, bootstrap CIs would be the safer choice.

The significance-vs-importance tension

TipThink About It

The t-test says REST_DAYS is “statistically significant” (\(p < 0.001\)). Yet the rest-days bar in the coefficient plot above barely clears zero. The model scarcely changes when we add rest days. Should we care?

This example illustrates the tension between statistical significance and practical importance.

“Significant” does not mean “important”

A small p-value does not mean a large effect. This principle is one of the most frequently misunderstood ideas in applied statistics.

Let us return to the NBA model. The REST_DAYS coefficient was statistically significant (\(p < 0.001\)). Look at the effect size.

Code
# Effect size for REST_DAYS in NBA model
rest_coef = model_full.params['REST_DAYS']
rest_se = model_full.bse['REST_DAYS']
game_score_std = nba_games['GAME_SCORE'].std()

# Standardized effect size: coefficient divided by outcome SD
std_effect = rest_coef / game_score_std

print("=== NBA Rest Effect: Significance vs Importance ===")
print()
print(f"Coefficient: {rest_coef:.3f} game score points per rest day")
print(f"Standard deviation of game score: {game_score_std:.2f}")
print(f"Standardized effect size: {std_effect:.4f} SD")
print(f"p-value: {format_pvalue(model_full.pvalues['REST_DAYS'])}")
print()
print("For context, Cohen's d benchmarks for effect sizes:")
print("  Small effect:  d = 0.2")
print("  Medium effect: d = 0.5")
print("  Large effect:  d = 0.8")
print(f"  Our effect:    {std_effect:.4f}  <-- essentially zero")
print()
print(f"With n = {len(nba_games):,} observations, we can detect effects that are")
print(f"completely negligible in practice.")
=== NBA Rest Effect: Significance vs Importance ===

Coefficient: -0.153 game score points per rest day
Standard deviation of game score: 7.85
Standardized effect size: -0.0195 SD
p-value: 4.28e-13

For context, Cohen's d benchmarks for effect sizes:
  Small effect:  d = 0.2
  Medium effect: d = 0.5
  Large effect:  d = 0.8
  Our effect:    -0.0195  <-- essentially zero

With n = 72,156 observations, we can detect effects that are
completely negligible in practice.

Placing the rest effect alongside home court advantage and player quality makes the relative magnitudes concrete.

Code
# Visualize: comparing effect sizes in the model
fig, ax = plt.subplots(figsize=(8, 5))

effects = {
    'One rest day': rest_coef,
    'Home court': model_full.params['HOME'],
    '1 SD player quality': model_full.params['PLAYER_SEASON_AVG'] * nba_games['PLAYER_SEASON_AVG'].std(),
}
colors = ['#e74c3c', '#3498db', '#2ecc71']
bars = ax.bar(effects.keys(), effects.values(), color=colors, edgecolor='black')
ax.set_ylabel('Effect on Game Score')
ax.set_title('Comparing Effect Sizes\n(rest effect is tiny)')
for bar, val in zip(bars, effects.values()):
    ax.text(bar.get_x() + bar.get_width()/2, val + 0.1, f'{val:.2f}',
            ha='center', fontsize=10)
plt.tight_layout()
plt.show()

print("The rest effect is dwarfed by player quality and even home court advantage.")

The rest effect is dwarfed by player quality and even home court advantage.
TipThink About It

A sports analytics team comes to you and says “We ran a regression and found a negative coefficient for rest days – extra rest actually hurts performance, with \(p < 0.001\)!” What questions should you ask?

  1. How big is the effect? (Tiny – a fraction of a point, standardized effect near zero.)
  2. Is there confounding? (Yes – coaches strategically rest players before tough opponents.)
  3. Does the model capture all relevant factors? (No – schedule difficulty, travel, fatigue history, etc.)
  4. Would this result hold out of sample? (Uncertain.)

Statistical significance with large samples can detect effects too small to matter.

Association is not causation

WarningAssociation is not causation

Our regression controls for player quality, home court, and opponent strength, but it does NOT estimate a causal effect of rest. Coaches choose when to rest players for strategic reasons we cannot observe in the data. The coefficient tells us about association, not causation.

We formalize this distinction in Chapter 18 using causal DAGs. For now, remember: regression can adjust for measured confounders, but it cannot eliminate unmeasured ones.

Regression inference on Airbnb data

The NBA example showed statistical significance without practical importance. Regression inference is not always so underwhelming – sometimes the effect IS real and large.

The question: is the price premium for an extra bathroom statistically significant, and is it large enough to matter?

We filter to listings under $500/night to focus on typical listings and avoid extreme luxury outliers that would distort the regression.

Code
# Load and clean Airbnb data
airbnb = pd.read_csv(f'{DATA_DIR}/airbnb/listings.csv', low_memory=False)

# Convert price to numeric (handles both numeric and "$150.00" string formats)
airbnb['price_clean'] = pd.to_numeric(
    airbnb['price'].astype(str).str.replace(r'[\$,]', '', regex=True),
    errors='coerce'
)

# Filter to typical listings (under $500/night removes extreme luxury outliers)
airbnb = airbnb.rename(columns={'neighbourhood_group_cleansed': 'borough'})

airbnb_clean = airbnb[
    (airbnb['price_clean'] > 0) &
    (airbnb['price_clean'] <= 500) &
    (airbnb['bathrooms'].notna()) &
    (airbnb['bedrooms'].notna()) &
    (airbnb['room_type'] == 'Entire home/apt')
].copy()

print(f"Airbnb listings (filtered): {len(airbnb_clean):,}")
print(f"Price range: ${airbnb_clean['price_clean'].min():.0f} - ${airbnb_clean['price_clean'].max():.0f}")
print(f"Bathrooms range: {airbnb_clean['bathrooms'].min():.0f} - {airbnb_clean['bathrooms'].max():.0f}")
Airbnb listings (filtered): 14,689
Price range: $10 - $500
Bathrooms range: 0 - 6

Before modeling, let us visualize. The borough column represents NYC boroughs (Manhattan, Brooklyn, Queens, Bronx, Staten Island).

Code
# Visualize price by bathrooms
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

sns.boxplot(x='bathrooms', y='price_clean', data=airbnb_clean, ax=axes[0], color='steelblue')
axes[0].set_xlabel('Number of Bathrooms')
axes[0].set_ylabel('Price per Night ($)')
axes[0].set_title('Airbnb Price by Bathrooms')

sns.boxplot(x='borough', y='price_clean',
            data=airbnb_clean, ax=axes[1], color='steelblue')
axes[1].set_xlabel('Borough')
axes[1].set_ylabel('Price per Night ($)')
axes[1].set_title('Airbnb Price by Borough')
axes[1].tick_params(axis='x', rotation=30)

plt.tight_layout()
plt.show()

print("More bathrooms = higher price. Does this hold after controlling for bedrooms and borough?")

More bathrooms = higher price. Does this hold after controlling for bedrooms and borough?

We fit a regression using smf.ols() with the formula interface. The C() wrapper tells statsmodels to treat borough as a categorical variable, creating indicator (dummy) columns automatically.

Code
# Regression: price ~ bathrooms + bedrooms + borough
airbnb_model = smf.ols(
    'price_clean ~ bathrooms + bedrooms + C(borough)',
    data=airbnb_clean
).fit()

print(airbnb_model.summary().tables[1])
===============================================================================================
                                  coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------------------------------------------------------------
Intercept                      -2.3836      5.679     -0.420      0.675     -13.515       8.747
C(borough)[T.Brooklyn]         37.5698      5.481      6.854      0.000      26.826      48.313
C(borough)[T.Manhattan]        95.5836      5.464     17.492      0.000      84.873     106.294
C(borough)[T.Queens]           17.9948      5.794      3.106      0.002       6.638      29.352
C(borough)[T.Staten Island]   -20.1103      9.645     -2.085      0.037     -39.015      -1.206
bathrooms                      62.4363      1.861     33.557      0.000      58.789      66.083
bedrooms                       34.0280      0.734     46.331      0.000      32.588      35.468
===============================================================================================

Let’s isolate the bathrooms coefficient and build a confidence interval for it.

Code
# Focus on the bathrooms coefficient
beta_bath = airbnb_model.params['bathrooms']
se_bath = airbnb_model.bse['bathrooms']
ci_bath = airbnb_model.conf_int().loc['bathrooms']
p_bath = airbnb_model.pvalues['bathrooms']

print(f"Bathrooms coefficient: ${beta_bath:.2f} per night")
print(f"95% CI: [${ci_bath[0]:.2f}, ${ci_bath[1]:.2f}]")
print(f"p-value: {format_pvalue(p_bath)}")
print()
print(f"Interpretation: controlling for bedrooms and borough, each additional bathroom")
print(f"is associated with ${beta_bath:.2f} more per night.")
print()
print("This IS practically meaningful -- unlike the NBA rest effect!")
Bathrooms coefficient: $62.44 per night
95% CI: [$58.79, $66.08]
p-value: 6.28e-238

Interpretation: controlling for bedrooms and borough, each additional bathroom
is associated with $62.44 more per night.

This IS practically meaningful -- unlike the NBA rest effect!

Prediction intervals vs. confidence intervals

We have built confidence intervals for coefficients – but what about predictions? When we use a regression model to predict at a particular value of \(x\), there are two different questions, and they give different intervals.

  • Confidence interval for the mean response: How uncertain are we about the average \(Y\) at this \(x\)? This interval reflects estimation uncertainty in \(\hat{\beta}\).
  • Prediction interval for a new observation: Where might a single new \(Y\) fall? This interval adds individual variability (\(\varepsilon\)) on top of estimation uncertainty.
ImportantPrediction intervals are always wider

A prediction interval accounts for two sources of uncertainty: (1) estimation uncertainty in the regression line, and (2) individual variability around the line (\(\sigma^2\)). A confidence interval for the mean includes only source (1). In formulas:

\[\text{PI width}^2 \;\approx\; \text{CI width}^2 \;+\; \hat{\sigma}^2\]

where \(\hat{\sigma}^2\) is the estimated residual variance. The prediction interval is always wider, and the gap depends on how noisy the response is.

Code
# Prediction interval vs confidence interval — Airbnb model
bath_vals = np.arange(1, airbnb_clean['bathrooms'].max() + 0.5, 0.5)
grid = pd.DataFrame({
    'bathrooms': bath_vals,
    'bedrooms': [airbnb_clean['bedrooms'].median()] * len(bath_vals),
    'borough': ['Manhattan'] * len(bath_vals),
})
grid_pred = airbnb_model.get_prediction(grid).summary_frame(alpha=0.05)

fig, ax = plt.subplots(figsize=(8, 5))
ax.plot(bath_vals, grid_pred['mean'], 'o-', color='steelblue', label='Predicted mean')
ax.fill_between(bath_vals, grid_pred['mean_ci_lower'], grid_pred['mean_ci_upper'],
                alpha=0.4, color='steelblue', label='95% CI for mean response')
ax.fill_between(bath_vals, grid_pred['obs_ci_lower'], grid_pred['obs_ci_upper'],
                alpha=0.15, color='orange', label='95% prediction interval')
ax.set_xlabel('Number of Bathrooms')
ax.set_ylabel('Predicted Price ($/night)')
ax.set_title('CI for Mean Response vs. Prediction Interval\n(Manhattan, median bedrooms)')
ax.legend()
plt.tight_layout()
plt.show()

print("The blue band (CI for the mean) is narrow — we are fairly sure about the average price.")
print("The orange band (prediction interval) is wide — individual listings vary greatly.")

The blue band (CI for the mean) is narrow — we are fairly sure about the average price.
The orange band (prediction interval) is wide — individual listings vary greatly.

Both intervals for a specific prediction point make the difference concrete.

Code
# Numeric example: predicted price for a 2-bathroom, 2-bedroom Manhattan listing
example = pd.DataFrame({
    'bathrooms': [2],
    'bedrooms': [2],
    'borough': ['Manhattan'],
})
example_pred = airbnb_model.get_prediction(example).summary_frame(alpha=0.05)

print("Prediction for a 2-bath, 2-bed Manhattan listing:")
print(f"  Predicted price:        ${example_pred['mean'].values[0]:.2f}/night")
print(f"  95% CI for mean price:  [${example_pred['mean_ci_lower'].values[0]:.2f}, ${example_pred['mean_ci_upper'].values[0]:.2f}]")
print(f"  95% prediction interval:[${example_pred['obs_ci_lower'].values[0]:.2f}, ${example_pred['obs_ci_upper'].values[0]:.2f}]")
print()
print("The CI says the *average* listing like this rents for roughly")
print(f"${example_pred['mean_ci_lower'].values[0]:.0f}-${example_pred['mean_ci_upper'].values[0]:.0f}/night.")
print("The PI says a *single new* listing could fall anywhere in a much wider range,")
print("reflecting the large variation among individual Airbnb listings.")
Prediction for a 2-bath, 2-bed Manhattan listing:
  Predicted price:        $286.13/night
  95% CI for mean price:  [$282.79, $289.46]
  95% prediction interval:[$151.65, $420.60]

The CI says the *average* listing like this rents for roughly
$283-$289/night.
The PI says a *single new* listing could fall anywhere in a much wider range,
reflecting the large variation among individual Airbnb listings.

Putting it together: NBA vs Airbnb

Both analyses used the same tools – t-tests and confidence intervals on regression coefficients. The conclusions differ dramatically.

Code
# Side-by-side comparison
print("=" * 60)
print(f"{'':30s} {'NBA Rest':>12s}  {'Airbnb Bath':>12s}")
print("=" * 60)
print(f"{'Coefficient':30s} {rest_coef:12.3f}  {beta_bath:12.2f}")
print(f"{'SE':30s} {rest_se:12.4f}  {se_bath:12.2f}")
print(f"{'p-value':30s} {format_pvalue(model_full.pvalues['REST_DAYS']):>12s}  {format_pvalue(p_bath):>12s}")
ci_rest = model_full.conf_int().loc['REST_DAYS']
print(f"{'95% CI lower':30s} {ci_rest[0]:12.3f}  {ci_bath[0]:12.2f}")
print(f"{'95% CI upper':30s} {ci_rest[1]:12.3f}  {ci_bath[1]:12.2f}")
std_effect_label = "Standardized effect (coef/SD)"
print(f"{std_effect_label:30s} {std_effect:12.4f}  {beta_bath/airbnb_clean['price_clean'].std():12.4f}")
print(f"{'n':30s} {len(nba_games):12,d}  {int(airbnb_model.nobs):12,d}")
print(f"{'Practically meaningful?':30s} {'No':>12s}  {'Yes':>12s}")
print("=" * 60)
print()
print("Both are 'statistically significant.' Only one is large enough to act on.")
============================================================
                                   NBA Rest   Airbnb Bath
============================================================
Coefficient                          -0.153         62.44
SE                                   0.0211          1.86
p-value                            4.28e-13     6.28e-238
95% CI lower                         -0.195         58.79
95% CI upper                         -0.112         66.08
Standardized effect (coef/SD)       -0.0195        0.7299
n                                    72,156        14,689
Practically meaningful?                  No           Yes
============================================================

Both are 'statistically significant.' Only one is large enough to act on.

Key takeaways

  1. Regression inference asks: is each predictor useful? The coefficient t-test tests \(H_0: \beta_j = 0\), asking whether the variable contributes above noise. Parsimony favors dropping predictors that fail this test.

  2. Building models step by step reveals what “controlling for” means. Watch how coefficients change as you add predictors – this progression makes the idea concrete.

  3. Confidence intervals tell you the range of plausible effect sizes. A CI that barely excludes zero should make you cautious.

  4. Statistical significance \(\neq\) practical importance. With large \(n\), you can detect effects too small to matter. Always report effect sizes alongside p-values.

  5. Context determines whether a result is actionable. The bathroom premium (about $60/night) is actionable for Airbnb hosts. The rest effect (a fraction of a game score point) is not actionable for NBA coaches.

  6. Regression estimates associations, not causal effects (unless all confounders are controlled for – more in Chapter 18).

We revisit this topic in Chapter 18 when we ask: coaches choose when to rest players – so is the “rest effect” actually a “coaching strategy” effect? That question is causal, and regression alone cannot answer it.

Next up: Chapter 13 is the Act 2 capstone on classification, where we apply inference ideas to categorical outcomes.

Study guide

Key ideas

  • Coefficient t-test: Tests whether a single regression coefficient is significantly different from zero. \(t = \hat{\beta}_j / \text{SE}(\hat{\beta}_j)\).
  • Standard error (of a coefficient): The estimated standard deviation of \(\hat{\beta}_j\) across repeated samples. Measures uncertainty in the coefficient estimate.
  • Controlling for / holding constant: The regression interpretation – the effect of one predictor after accounting for the others in the model.
  • Practical significance: Whether an effect is large enough to matter in context, regardless of the p-value.
  • Effect size: A standardized measure of how large an effect is (e.g., coefficient divided by the outcome’s SD).
  • Residual plot: Scatter plot of residuals vs. fitted values; used to check for patterns, non-linearity, and heteroscedasticity.
  • Q-Q plot: Quantile-quantile plot comparing residual quantiles to a normal distribution; points on the diagonal indicate normality.
  • Heteroscedasticity: When the variance of residuals changes across fitted values (the spread “fans out”). Violates the constant-variance assumption of OLS inference.
  • LINE conditions: The four assumptions for regression inference – Linearity, Independence, Normality, Equal variance.
  • Prediction interval: An interval for where a single new observation might fall at a given \(x\). Always wider than the confidence interval for the mean – it includes both estimation uncertainty and individual variability.
  • Parsimony / Occam’s razor: The best model is the simplest one that adequately fits the data. Drop predictors that are not useful.
  • The t-test for a coefficient asks: is this predictor useful, or could the observed coefficient be due to noise?
  • A confidence interval for a coefficient gives the range of plausible effect sizes, not just a yes/no answer.
  • Building models step by step (adding one predictor at a time) reveals what “controlling for” actually does to coefficients.
  • Statistical significance with large n can detect effects too small to matter – always check effect size.
  • Regression controls for measured confounders but does not establish causation.
  • Residual plots and Q-Q plots check whether the model’s assumptions (LINE conditions) hold; when they do not, use bootstrap CIs (Chapter 8) instead of formula CIs.

Computational tools

  • smf.ols('y ~ x1 + x2', data=df).fit() – fit an OLS regression with formula syntax
  • model.summary() – full regression output including coefficients, SEs, t-stats, p-values
  • model.params – dictionary of fitted coefficients
  • model.pvalues – dictionary of p-values for each coefficient’s t-test
  • model.conf_int() – 95% confidence intervals for all coefficients
  • model.bse – standard errors of all coefficients
  • model.fittedvalues, model.resid – fitted values and residuals for diagnostic plots
  • model.get_prediction(new_data).summary_frame() – confidence interval for the mean and prediction interval for a new observation
  • stats.probplot(residuals, dist='norm') – Q-Q plot data with correct scaling
  • format_pvalue(p) – format p-values in scientific notation when below 0.001

For the quiz

  • Be able to interpret a coefficient table: read off coefficients, SEs, t-statistics, and p-values.
  • Know what “controlling for” means in a regression context and how adding a predictor can change other coefficients.
  • Distinguish statistical significance (small p-value) from practical importance (large effect size).
  • Know the formula \(t = \hat{\beta}_j / \text{SE}(\hat{\beta}_j)\) and what it tests.
  • Be able to interpret a confidence interval for a regression coefficient in context.
  • Know what a residual plot and Q-Q plot show, and what patterns indicate problems (heteroscedasticity, non-normality).
  • Know the LINE conditions (Linearity, Independence, Normality, Equal variance) and which diagnostic plots check which conditions.
  • Know that bootstrap CIs (Chapter 8) are a fallback when normality assumptions fail.