---
title: "Regression Inference + Diagnostics"
execute:
enabled: true
jupyter: python3
---
## Should an NBA team restructure its schedule around rest?
An NBA team's analytics department says resting players boosts performance by 0.3 points
per game. They want to restructure the schedule around it. Should the front office invest
in roster flexibility based on this finding? We also examine Airbnb listings to see
what happens when the effect IS real and large. Same tools, very different conclusions.
:::{.callout-note}
## Why inference, not just prediction?
In Act 1, we built regression models to predict — given a listing's features, what price should we expect? Prediction works well when you observe a new unit from the same population: a new listing appears, you plug in its features, you get a price.
But now consider a different question. An Airbnb host is thinking about spending \$50,000 to convert a closet into a second bedroom. The regression coefficient on `bedrooms` is about +\$34/night. Does that mean the renovation will pay for itself in four years?
Not necessarily. The coefficient tells you that listings *with* more bedrooms tend to cost more — but it doesn't tell you that *adding* a bedroom to *your* listing would increase your price by \$34. Maybe the real driver is apartment size, or building quality, or neighborhood. Larger apartments happen to have more bedrooms *and* higher prices, but the extra bedroom isn't doing the work.
Prediction asks: **what will happen to a new unit from this population?** Inference asks: **what would happen if I changed something?** The second question is harder — and it's the one that matters when the decision involves an investment, a policy, or an intervention. This chapter gives you the tools to start answering it.
:::
:::{.callout-note}
## Act 1 meets Act 2
Act 1 built regression models for prediction. Act 2 introduced hypothesis testing and
confidence intervals. This chapter combines both toolkits: we apply inference to
regression coefficients and check whether the model's assumptions justify that inference.
:::
The central question: **is each predictor in our model useful, or could we
drop it without loss?** Parsimony favors the simplest model that
adequately fits the data (an idea sometimes called *Occam's razor*).
We also check whether the model's assumptions hold using **regression diagnostics**.
```{python}
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
import statsmodels.api as sm
import statsmodels.formula.api as smf
from statsmodels.nonparametric.smoothers_lowess import lowess
import warnings
warnings.filterwarnings('ignore')
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (8, 5)
plt.rcParams['font.size'] = 12
DATA_DIR = 'data'
```
We define a helper function `format_pvalue()` to display very small p-values in scientific notation throughout this chapter.
```{python}
def format_pvalue(p):
"""Format p-values: scientific notation for very small values."""
if p < 0.001:
return f"{p:.2e}"
return f"{p:.3f}"
```
## Building a regression model for NBA performance
Recall from Chapter 11 that the raw correlation between rest and points was misleading --
Simpson's paradox showed that bench players dominate the extended-rest group.
Can regression help us **control for** player quality and isolate the rest effect?
```{python}
# Load NBA data
nba = pd.read_csv(f'{DATA_DIR}/nba/nba_load_management.csv')
nba['GAME_DATE'] = pd.to_datetime(nba['GAME_DATE'])
nba_games = nba.dropna(subset=['REST_DAYS']).copy()
# Cap rest days at 7 to avoid long injury absences
nba_games = nba_games[nba_games['REST_DAYS'] <= 7].copy()
# Compute player-season averages as a measure of player quality
nba_games['PLAYER_SEASON_AVG'] = nba_games.groupby(
['PLAYER_ID', 'SEASON']
)['GAME_SCORE'].transform('mean')
# Compute opponent strength (opponent's average GAME_SCORE allowed)
opp_strength = nba_games.groupby('OPPONENT')['GAME_SCORE'].mean()
nba_games['OPP_STRENGTH'] = nba_games['OPPONENT'].map(opp_strength)
# Derive additional covariates from existing columns
# Back-to-back indicator: did the player play yesterday?
nba_games['BACK_TO_BACK'] = (nba_games['REST_DAYS'] == 1).astype(int)
# Minutes in previous game (proxy for fatigue/usage)
nba_games = nba_games.sort_values(['PLAYER_ID', 'GAME_DATE'])
nba_games['PREV_MIN'] = nba_games.groupby('PLAYER_ID')['MIN'].shift(1)
print(f"Observations: {len(nba_games):,}")
print(f"Players: {nba_games['PLAYER_NAME'].nunique()}")
print()
print("Variables for regression:")
print(f" GAME_SCORE (response): mean={nba_games['GAME_SCORE'].mean():.2f}, std={nba_games['GAME_SCORE'].std():.2f}")
print(f" REST_DAYS: mean={nba_games['REST_DAYS'].mean():.2f}")
print(f" PLAYER_SEASON_AVG: mean={nba_games['PLAYER_SEASON_AVG'].mean():.2f}")
print(f" HOME: {nba_games['HOME'].mean():.2%} home games")
print(f" BACK_TO_BACK: {nba_games['BACK_TO_BACK'].mean():.2%} of games")
print(f" PREV_MIN: mean={nba_games['PREV_MIN'].mean():.1f} (previous game minutes)")
```
Before fitting any model, let's look at the data. Is there a visible relationship
between rest days and performance?
```{python}
# Visualize: game score by rest days
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
sns.boxplot(x='REST_DAYS', y='GAME_SCORE', data=nba_games, ax=axes[0], color='steelblue')
axes[0].set_xlabel('Rest Days')
axes[0].set_ylabel('Game Score')
axes[0].set_title('Game Score by Rest Days')
# Scatter with player quality on x-axis
axes[1].scatter(nba_games['PLAYER_SEASON_AVG'], nba_games['GAME_SCORE'],
alpha=0.02, s=5, color='steelblue')
axes[1].set_xlabel('Player Season Average')
axes[1].set_ylabel('Game Score')
axes[1].set_title('Game Score vs Player Quality\n(the dominant predictor)')
plt.tight_layout()
plt.show()
print("The box plots look almost identical across rest days.")
print("Player quality clearly matters more. Can regression disentangle these effects?")
```
## Step-by-step model building: what does "controlling for" mean?
One of the most important ideas in regression is **controlling for** other variables.
We build the model one predictor at a time and watch how the REST_DAYS coefficient changes.
:::{.callout-tip}
## Think About It
Before looking at the numbers, do you think the rest effect will
be positive or negative? Large or small?
:::
:::{.callout-note}
## The matrix API
Statsmodels also provides a lower-level matrix interface, `sm.OLS(y, X)`, where you pass arrays directly. The formula API (`smf.ols('y ~ x1 + x2', data=df)`) is more readable and handles categorical encoding automatically, so we use it throughout this course.
:::
```{python}
# Model 1: naive — just rest days
model_1 = smf.ols('GAME_SCORE ~ REST_DAYS', data=nba_games).fit()
# Model 2: add player quality
model_2 = smf.ols('GAME_SCORE ~ REST_DAYS + PLAYER_SEASON_AVG', data=nba_games).fit()
# Model 3: full model with all controls
model_full = smf.ols(
'GAME_SCORE ~ REST_DAYS + PLAYER_SEASON_AVG + HOME + OPP_STRENGTH',
data=nba_games
).fit()
print("REST_DAYS coefficient as we add controls:")
print(f" Model 1 (REST_DAYS only): {model_1.params['REST_DAYS']:+.4f} (p = {format_pvalue(model_1.pvalues['REST_DAYS'])})")
print(f" Model 2 (+ PLAYER_SEASON_AVG): {model_2.params['REST_DAYS']:+.4f} (p = {format_pvalue(model_2.pvalues['REST_DAYS'])})")
print(f" Model 3 (+ HOME + OPP_STRENGTH): {model_full.params['REST_DAYS']:+.4f} (p = {format_pvalue(model_full.pvalues['REST_DAYS'])})")
print()
print("This progression illustrates what 'controlling for' means — isolating the rest")
print("effect from confounders like player quality.")
```
## The full regression model
We predict game score from:
- **REST_DAYS**: our variable of interest
- **PLAYER_SEASON_AVG**: controls for player quality
- **HOME**: home court advantage
- **OPP_STRENGTH**: how good is the opponent?
$$\text{GAME\_SCORE}_i = \beta_0 + \beta_1 \cdot \text{REST\_DAYS}_i + \beta_2 \cdot \text{PLAYER\_SEASON\_AVG}_i + \beta_3 \cdot \text{HOME}_i + \beta_4 \cdot \text{OPP\_STRENGTH}_i + \varepsilon_i$$
```{python}
# Full model summary
print(model_full.summary().tables[1])
print()
print(f"R-squared: {model_full.rsquared:.4f}")
print(f"Observations: {int(model_full.nobs)}")
```
## Interpreting the coefficients
:::{.callout-important}
## Definition: Coefficient Interpretation
Each coefficient has a specific interpretation: **"Holding everything else constant,** a one-unit
increase in $x_j$ is associated with a $\hat{\beta}_j$ change in game score."
:::
This phrasing describes what the model estimates -- it does not mean we can literally
hold everything else constant in the real world. In a more advanced course, you may see the same regression coefficients interpreted from three perspectives: prediction, inference, and causality. We explore the distinction in Chapter 18.
```{python}
# Extract and interpret coefficients
coefs = model_full.params
ses = model_full.bse
pvals = model_full.pvalues
print("Coefficient interpretations:")
print()
print(f" REST_DAYS: {coefs['REST_DAYS']:+.3f}")
print(f" -> One more rest day is associated with a {coefs['REST_DAYS']:.3f} change in game score")
print(f" -> Over a 3-day difference: {coefs['REST_DAYS']*3:.2f} points")
print()
print(f" PLAYER_SEASON_AVG: {coefs['PLAYER_SEASON_AVG']:+.3f}")
print(f" -> Better players score more (unsurprisingly)")
print()
print(f" HOME: {coefs['HOME']:+.3f}")
print(f" -> Home court advantage is worth ~{coefs['HOME']:.1f} game score points")
print()
print(f" OPP_STRENGTH: {coefs['OPP_STRENGTH']:+.3f}")
print(f" -> Facing a stronger opponent (by 1 GS point avg) shifts score by {coefs['OPP_STRENGTH']:.2f}")
```
Raw coefficients are hard to compare across predictors with different units. **Standardized coefficients** multiply each coefficient by its predictor's standard deviation, putting all effects on a common scale: "the effect of a one-SD change."
```{python}
# Visualize CIs for all coefficients (exclude intercept for scale)
fig, ax = plt.subplots(figsize=(8, 5))
ci = model_full.conf_int().drop('Intercept')
coef_vals = model_full.params.drop('Intercept')
# Standardize: multiply each coefficient by its predictor's SD
predictor_sds = nba_games[['REST_DAYS', 'PLAYER_SEASON_AVG', 'HOME', 'OPP_STRENGTH']].std()
std_coefs = coef_vals * predictor_sds
std_ci_lower = ci[0] * predictor_sds
std_ci_upper = ci[1] * predictor_sds
labels = ['Rest Days', 'Player Quality\n(Season Avg)', 'Home Court', 'Opponent\nStrength']
y_pos = range(len(labels))
ax.barh(y_pos, std_coefs.values, xerr=[
std_coefs.values - std_ci_lower.values,
std_ci_upper.values - std_coefs.values
], color='steelblue', edgecolor='black', capsize=5, alpha=0.7)
ax.axvline(x=0, color='red', linestyle='--', linewidth=1)
ax.set_yticks(list(y_pos))
ax.set_yticklabels(labels)
ax.set_xlabel('Effect of 1 SD Change in Predictor (Game Score Points)')
ax.set_title('Standardized Regression Coefficients with 95% CIs')
plt.tight_layout()
plt.show()
print("Player quality dominates. REST_DAYS is barely distinguishable from zero.")
```
Which of these coefficients should we trust? The bars for player quality and home court
are clearly separated from zero. The bar for rest days barely clears the line. We need a
formal framework to decide which coefficients are distinguishable from noise.
## Which coefficients matter? The null hypothesis
For each coefficient, the **null hypothesis** states that the predictor contributes nothing
to the model after accounting for all other predictors:
$$H_0: \beta_j = 0 \quad \text{vs} \quad H_a: \beta_j \neq 0$$
If $\beta_j = 0$, then predictor $j$ is dead weight -- the model would be equally good
without it. The question: can we reject that null hypothesis?
## Bootstrap approach: computing a p-value by resampling
One way to assess whether the REST_DAYS coefficient is distinguishable from zero
is to bootstrap the coefficient and see how often the resampled estimate crosses zero.
```{python}
# Bootstrap the REST_DAYS coefficient
np.random.seed(42)
n_boot = 2000
boot_coefs = np.zeros(n_boot)
for i in range(n_boot):
boot_idx = np.random.choice(len(nba_games), size=len(nba_games), replace=True)
boot_data = nba_games.iloc[boot_idx]
boot_model = smf.ols(
'GAME_SCORE ~ REST_DAYS + PLAYER_SEASON_AVG + HOME + OPP_STRENGTH',
data=boot_data
).fit()
boot_coefs[i] = boot_model.params['REST_DAYS']
boot_se = boot_coefs.std()
boot_ci = np.percentile(boot_coefs, [2.5, 97.5])
# Two-sided p-value: fraction of bootstrap samples on the opposite side of zero
if np.mean(boot_coefs) >= 0:
boot_p = 2 * np.mean(boot_coefs <= 0)
else:
boot_p = 2 * np.mean(boot_coefs >= 0)
print("Bootstrap inference for REST_DAYS coefficient:")
print(f" Original estimate: {coefs['REST_DAYS']:.4f}")
print(f" Bootstrap SE: {boot_se:.4f}")
print(f" Bootstrap 95% CI: [{boot_ci[0]:.4f}, {boot_ci[1]:.4f}]")
print(f" Bootstrap p-value: {format_pvalue(boot_p)}")
```
The bootstrap gives us a distribution of the coefficient across resampled datasets.
Now we derive the same result using a formula.
## The t-statistic: a formula shortcut
The bootstrap standard error and the formula-based standard error measure the same thing:
how much the coefficient would vary across repeated samples. The **t-statistic** divides
the coefficient by its standard error.
```{python}
# The coefficient and its SE
beta_rest = coefs['REST_DAYS']
se_rest = ses['REST_DAYS']
# df_resid = degrees of freedom of the residuals (not a DataFrame!)
df_resid = model_full.df_resid
print(f"REST_DAYS coefficient: {beta_rest:.4f}")
print(f"Formula SE: {se_rest:.4f}")
print(f"Bootstrap SE: {boot_se:.4f} (close match)")
print(f"Ratio (coef / SE): {beta_rest / se_rest:.4f}")
print()
print("The ratio is about -7 -- the coefficient sits roughly 7 SEs from zero.")
```
:::{.callout-important}
## Definition: Coefficient t-test
This ratio has a name -- the **t-statistic** -- and we know its distribution under the null hypothesis.
For each coefficient, we test:
$$H_0: \beta_j = 0 \quad \text{vs} \quad H_a: \beta_j \neq 0$$
The test statistic is:
$$t = \frac{\hat{\beta}_j}{\text{SE}(\hat{\beta}_j)}$$
Under $H_0$, this statistic follows a $t$-distribution with $n - p - 1$ degrees of freedom.
:::
(This derivation assumes approximately normal errors -- reasonable here given our large
sample size, $n > 70{,}000$. With smaller samples, you would want to check a residual Q-Q plot.)
```{python}
# T-test for REST_DAYS coefficient — verify by hand
t_stat = beta_rest / se_rest
p_value = 2 * stats.t.sf(abs(t_stat), df_resid)
print("T-test for REST_DAYS coefficient (by hand):")
print(f" t = {beta_rest:.4f} / {se_rest:.4f} = {t_stat:.4f}")
print(f" df = {df_resid:.0f}")
print(f" p-value = {format_pvalue(p_value)}")
print()
print("From statsmodels:")
print(f" t = {model_full.tvalues['REST_DAYS']:.4f}")
print(f" p = {format_pvalue(model_full.pvalues['REST_DAYS'])}")
print()
print("They match! The formula gives the same answer as bootstrap, much faster.")
```
## Confidence interval for a coefficient
A 95% **confidence interval** for $\beta_j$:
$$\hat{\beta}_j \pm t^* \times \text{SE}(\hat{\beta}_j)$$
where $t^*$ is the critical value from the $t$-distribution.
```{python}
# 95% CI for REST_DAYS coefficient
t_star = stats.t.ppf(0.975, df_resid)
ci_lower = beta_rest - t_star * se_rest
ci_upper = beta_rest + t_star * se_rest
print(f"95% CI for REST_DAYS coefficient:")
print(f" {beta_rest:.4f} +/- {t_star:.3f} x {se_rest:.4f}")
print(f" [{ci_lower:.4f}, {ci_upper:.4f}]")
print()
print(f"Interpretation: we are 95% confident that one extra day of rest is associated")
print(f"with between {ci_lower:.3f} and {ci_upper:.3f} additional game score points.")
print()
print(f"From statsmodels: {model_full.conf_int().loc['REST_DAYS'].values}")
print(f"From bootstrap: [{boot_ci[0]:.4f}, {boot_ci[1]:.4f}]")
```
## Can additional covariates flip the rest-days sign?
The REST_DAYS coefficient is negative (or near zero) -- disappointing if you hoped rest
helps performance. Can adding more covariates change the story?
We derive two additional predictors from columns already in the dataset:
- **BACK_TO_BACK**: a binary indicator for playing on consecutive days (REST_DAYS = 1)
- **PREV_MIN**: minutes played in the previous game (a fatigue/usage proxy)
```{python}
# Extended model with additional covariates
nba_extended = nba_games.dropna(subset=['PREV_MIN']).copy()
model_extended = smf.ols(
'GAME_SCORE ~ REST_DAYS + PLAYER_SEASON_AVG + HOME + OPP_STRENGTH + BACK_TO_BACK + PREV_MIN',
data=nba_extended
).fit()
print("REST_DAYS coefficient across model specifications:")
print(f" Base model (4 predictors): {model_full.params['REST_DAYS']:+.4f} (p = {format_pvalue(model_full.pvalues['REST_DAYS'])})")
print(f" Extended model (6 predictors): {model_extended.params['REST_DAYS']:+.4f} (p = {format_pvalue(model_extended.pvalues['REST_DAYS'])})")
print()
print("Extended model coefficients:")
for var in ['REST_DAYS', 'BACK_TO_BACK', 'PREV_MIN', 'PLAYER_SEASON_AVG', 'HOME', 'OPP_STRENGTH']:
print(f" {var:25s} {model_extended.params[var]:+.4f} (p = {format_pvalue(model_extended.pvalues[var])})")
print()
print("The coefficient for REST_DAYS remains negative (or near zero) regardless of")
print("which covariates we include. Additional rest does not appear to improve performance")
print("in this dataset — a genuine finding, not an artifact of omitted variables.")
print("Sometimes the data simply do not support the hypothesis we hoped for.")
```
## Regression diagnostics: checking the model's assumptions
The t-tests and confidence intervals above assume residuals are
approximately normal with constant variance. Diagnostics let us look under the hood.
:::{.callout-important}
## Definition: LINE Conditions
The assumptions for regression inference are remembered by the **LINE** mnemonic:
- **L**inearity: the relationship between predictors and response is linear.
- **I**ndependence: observations are independent of each other.
- **N**ormality: residuals are approximately normally distributed.
- **E**qual variance: the spread of residuals is roughly constant across fitted values (no heteroscedasticity).
:::
Each LINE condition has a characteristic diagnostic signature. As a homework exercise, you'll generate synthetic data that violates each assumption and create the corresponding diagnostic plots.
### Residual plot: do residuals have constant spread?
:::{.callout-important}
## Definition: Residual Plot
A **residual plot** shows residuals (observed $-$ predicted) vs. fitted values. We want
to see a random scatter with no pattern. If the spread of residuals changes with the
fitted value, that is **heteroscedasticity** -- the model's uncertainty is not uniform.
:::
```{python}
# Residual plot for the NBA model
fitted_vals = model_full.fittedvalues
residuals = model_full.resid
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
# Residuals vs fitted
axes[0].scatter(fitted_vals, residuals, alpha=0.02, s=5, color='steelblue')
axes[0].axhline(y=0, color='red', linestyle='--', linewidth=1)
axes[0].set_xlabel('Fitted Values')
axes[0].set_ylabel('Residuals')
axes[0].set_title('Residuals vs Fitted Values')
# Add a LOWESS smoother to highlight any trend
smooth = lowess(residuals, fitted_vals, frac=0.1)
axes[0].plot(smooth[:, 0], smooth[:, 1], color='orange', linewidth=2, label='LOWESS smooth')
axes[0].legend()
# Scale-location plot: sqrt of |standardized residuals| vs fitted
std_resid = model_full.get_influence().resid_studentized_internal
axes[1].scatter(fitted_vals, np.sqrt(np.abs(std_resid)), alpha=0.02, s=5, color='steelblue')
axes[1].set_xlabel('Fitted Values')
axes[1].set_ylabel(r'$\sqrt{|\mathrm{Standardized\ Residuals}|}$')
axes[1].set_title('Scale-Location Plot\n(check for heteroscedasticity)')
smooth2 = lowess(np.sqrt(np.abs(std_resid)), fitted_vals, frac=0.1)
axes[1].plot(smooth2[:, 0], smooth2[:, 1], color='orange', linewidth=2)
plt.tight_layout()
plt.show()
print("The spread of residuals looks roughly constant across fitted values — reassuring.")
```
### Q-Q plot: are residuals approximately normal?
A **Q-Q plot** compares residual quantiles to what we would expect from a normal distribution.
Points on the diagonal indicate normality. Deviations in the tails indicate heavy tails or skew.
```{python}
# Q-Q plot using scipy.stats.probplot for correct scaling
fig, ax = plt.subplots(figsize=(6, 6))
(theoretical_quantiles, ordered_residuals), (slope, intercept, r) = stats.probplot(
residuals, dist='norm'
)
ax.scatter(theoretical_quantiles, ordered_residuals, alpha=0.02, s=5, color='steelblue')
# Reference line: if residuals are normal, points lie on y = intercept + slope * x
ref_x = np.array([theoretical_quantiles.min(), theoretical_quantiles.max()])
ax.plot(ref_x, intercept + slope * ref_x, 'r-', linewidth=1.5, label='Normal reference line')
ax.set_xlabel('Theoretical Quantiles (Normal)')
ax.set_ylabel('Sample Quantiles (Residuals)')
ax.set_title('Q-Q Plot of Residuals\n(NBA regression model)')
ax.legend()
plt.tight_layout()
plt.show()
print("The middle of the distribution follows the normal line closely.")
print("The tails deviate — more extreme game scores occur than a normal")
print("distribution would predict. This pattern is common in sports data.")
```
### Reading a Q-Q plot: a visual gallery
The NBA Q-Q plot above shows heavy tails — but what do other departures from normality look like? The gallery below shows five common patterns alongside the ideal (normal) case. Each panel uses simulated data so the pattern is unmistakable.
```{python}
#| label: fig-qq-gallery
#| fig-cap: "Q-Q plot gallery: six patterns and what they reveal about the residual distribution."
from scipy.stats import probplot, skewnorm, t as t_dist, uniform
np.random.seed(42)
n_sim = 500
# Simulate six residual distributions
sim_data = {
'Normal\n(assumptions met)': np.random.normal(0, 1, n_sim),
'Left-skewed\n(concave / bows down)': skewnorm.rvs(-8, size=n_sim),
'Right-skewed\n(convex / bows up)': skewnorm.rvs(8, size=n_sim),
'Heavy-tailed\n(S-shaped)': t_dist.rvs(df=3, size=n_sim),
'Light-tailed\n(reverse-S)': uniform.rvs(loc=-1, scale=2, size=n_sim),
'Bimodal\n(two clusters)': np.concatenate([
np.random.normal(-2, 0.5, n_sim // 2),
np.random.normal(2, 0.5, n_sim // 2)
]),
}
fig, axes = plt.subplots(2, 3, figsize=(14, 9))
for ax, (label, data) in zip(axes.flat, sim_data.items()):
(tq, sq), (slope, intercept, _) = probplot(data, dist='norm')
ax.scatter(tq, sq, s=10, alpha=0.5, color='steelblue')
ref_x = np.array([tq.min(), tq.max()])
ax.plot(ref_x, intercept + slope * ref_x, 'r-', linewidth=1.5)
ax.set_title(label, fontsize=11)
ax.set_xlabel('')
ax.set_ylabel('')
# Shared axis labels
fig.supxlabel('Theoretical Quantiles (Normal)', fontsize=12)
fig.supylabel('Sample Quantiles', fontsize=12)
plt.tight_layout()
plt.show()
```
:::{.callout-note}
## How to read the gallery
| Q-Q shape | What it means | Example cause |
|:--|:--|:--|
| Straight line | Normal residuals — assumptions satisfied | Well-specified model |
| Concave (bows down) | Left-skewed residuals | Floor effect or bounded response |
| Convex (bows up) | Right-skewed residuals | Untransformed prices or counts |
| S-shaped | Heavy tails — more extreme values than normal | Outlier-prone measurements |
| Reverse-S | Light tails — fewer extreme values than normal | Bounded or truncated data |
| Staircase or gap | Multimodal residuals — distinct subgroups | Missing categorical predictor |
The key diagnostic reflex: when the Q-Q plot bows or bends, ask *why* the residuals deviate from normality. The shape often points directly to a modeling fix (a log transform, a missing predictor, or an outlier filter).
:::
### Why residuals reveal what raw data hides
Looking at residuals instead of the raw response variable is not just a formality. Other predictors add scatter to the raw data that can mask violations of the model's assumptions. Residuals remove the explained variation, leaving only the unexplained part — and that is where patterns become visible.
For example, a plot of price vs. bedrooms for Airbnb listings shows enormous scatter because neighborhood, room type, and amenities all vary across listings. But the residuals from a model that controls for those variables are much cleaner. A nonlinear pattern — say, diminishing returns from extra bedrooms — that was invisible in the raw scatter may jump out in the residual plot.
:::{.callout-tip}
## Think About It
Residuals reveal nonlinearity that raw data hides. Why? Because residuals strip away the variation explained by *other* predictors, making any remaining pattern easier to see.
:::
:::{.callout-warning}
## When diagnostics reveal problems
If residuals show strong non-normality or heteroscedasticity, the formula-based
standard errors and p-values may be unreliable. Options include:
- **Bootstrap confidence intervals** (Chapter 8) do not assume normality -- they let
the data speak for itself.
- **Robust standard errors** adjust for heteroscedasticity without changing the
coefficient estimates.
- **Transform the response** (e.g., log price) to stabilize variance.
:::
For the NBA model, our sample is large ($n > 70{,}000$) and the Q-Q plot is
reasonable in the middle, so the formula-based inference is trustworthy here.
With smaller samples or heavier tails, bootstrap CIs would be the safer choice.
## The significance-vs-importance tension
:::{.callout-tip}
## Think About It
The t-test says REST_DAYS is "statistically significant" ($p < 0.001$).
Yet the rest-days bar in the coefficient plot above barely clears zero.
The model scarcely changes when we add rest days. Should we care?
:::
This example illustrates the tension between **statistical significance** and **practical importance**.
## "Significant" does not mean "important"
A small p-value does not mean a large effect. This principle is one of the most
frequently misunderstood ideas in applied statistics.
Let us return to the NBA model. The REST_DAYS coefficient was statistically significant ($p < 0.001$).
Look at the **effect size**.
```{python}
# Effect size for REST_DAYS in NBA model
rest_coef = model_full.params['REST_DAYS']
rest_se = model_full.bse['REST_DAYS']
game_score_std = nba_games['GAME_SCORE'].std()
# Standardized effect size: coefficient divided by outcome SD
std_effect = rest_coef / game_score_std
print("=== NBA Rest Effect: Significance vs Importance ===")
print()
print(f"Coefficient: {rest_coef:.3f} game score points per rest day")
print(f"Standard deviation of game score: {game_score_std:.2f}")
print(f"Standardized effect size: {std_effect:.4f} SD")
print(f"p-value: {format_pvalue(model_full.pvalues['REST_DAYS'])}")
print()
print("For context, Cohen's d benchmarks for effect sizes:")
print(" Small effect: d = 0.2")
print(" Medium effect: d = 0.5")
print(" Large effect: d = 0.8")
print(f" Our effect: {std_effect:.4f} <-- essentially zero")
print()
print(f"With n = {len(nba_games):,} observations, we can detect effects that are")
print(f"completely negligible in practice.")
```
Placing the rest effect alongside home court advantage and player quality makes the relative magnitudes concrete.
```{python}
# Visualize: comparing effect sizes in the model
fig, ax = plt.subplots(figsize=(8, 5))
effects = {
'One rest day': rest_coef,
'Home court': model_full.params['HOME'],
'1 SD player quality': model_full.params['PLAYER_SEASON_AVG'] * nba_games['PLAYER_SEASON_AVG'].std(),
}
colors = ['#e74c3c', '#3498db', '#2ecc71']
bars = ax.bar(effects.keys(), effects.values(), color=colors, edgecolor='black')
ax.set_ylabel('Effect on Game Score')
ax.set_title('Comparing Effect Sizes\n(rest effect is tiny)')
for bar, val in zip(bars, effects.values()):
ax.text(bar.get_x() + bar.get_width()/2, val + 0.1, f'{val:.2f}',
ha='center', fontsize=10)
plt.tight_layout()
plt.show()
print("The rest effect is dwarfed by player quality and even home court advantage.")
```
:::{.callout-tip}
## Think About It
A sports analytics team comes to you and says "We ran a regression and found a
negative coefficient for rest days -- extra rest actually *hurts* performance,
with $p < 0.001$!" What questions should you ask?
:::
1. How big is the effect? (Tiny -- a fraction of a point, standardized effect near zero.)
2. Is there confounding? (Yes -- coaches strategically rest players before tough opponents.)
3. Does the model capture all relevant factors? (No -- schedule difficulty, travel, fatigue history, etc.)
4. Would this result hold out of sample? (Uncertain.)
Statistical significance with large samples can detect effects **too small to matter**.
## Association is not causation
:::{.callout-warning}
## Association is not causation
Our regression **controls for** player quality, home court, and opponent
strength, but it does NOT estimate a **causal effect** of rest. Coaches choose when to
rest players for strategic reasons we cannot observe in the data. The coefficient tells
us about association, not causation.
We formalize this distinction in Chapter 18 using causal DAGs. For now, remember:
regression can adjust for measured confounders, but it cannot eliminate unmeasured ones.
:::
## Regression inference on Airbnb data
The NBA example showed statistical significance without practical importance. Regression
inference is not always so underwhelming -- sometimes the effect IS real and large.
The question: **is the price premium for an extra bathroom statistically significant,
and is it large enough to matter?**
We filter to listings under \$500/night to focus on typical listings and avoid extreme
luxury outliers that would distort the regression.
```{python}
# Load and clean Airbnb data
airbnb = pd.read_csv(f'{DATA_DIR}/airbnb/listings.csv', low_memory=False)
# Convert price to numeric (handles both numeric and "$150.00" string formats)
airbnb['price_clean'] = pd.to_numeric(
airbnb['price'].astype(str).str.replace(r'[\$,]', '', regex=True),
errors='coerce'
)
# Filter to typical listings (under $500/night removes extreme luxury outliers)
airbnb = airbnb.rename(columns={'neighbourhood_group_cleansed': 'borough'})
airbnb_clean = airbnb[
(airbnb['price_clean'] > 0) &
(airbnb['price_clean'] <= 500) &
(airbnb['bathrooms'].notna()) &
(airbnb['bedrooms'].notna()) &
(airbnb['room_type'] == 'Entire home/apt')
].copy()
print(f"Airbnb listings (filtered): {len(airbnb_clean):,}")
print(f"Price range: ${airbnb_clean['price_clean'].min():.0f} - ${airbnb_clean['price_clean'].max():.0f}")
print(f"Bathrooms range: {airbnb_clean['bathrooms'].min():.0f} - {airbnb_clean['bathrooms'].max():.0f}")
```
Before modeling, let us visualize. The `borough` column represents
NYC boroughs (Manhattan, Brooklyn, Queens, Bronx, Staten Island).
```{python}
# Visualize price by bathrooms
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
sns.boxplot(x='bathrooms', y='price_clean', data=airbnb_clean, ax=axes[0], color='steelblue')
axes[0].set_xlabel('Number of Bathrooms')
axes[0].set_ylabel('Price per Night ($)')
axes[0].set_title('Airbnb Price by Bathrooms')
sns.boxplot(x='borough', y='price_clean',
data=airbnb_clean, ax=axes[1], color='steelblue')
axes[1].set_xlabel('Borough')
axes[1].set_ylabel('Price per Night ($)')
axes[1].set_title('Airbnb Price by Borough')
axes[1].tick_params(axis='x', rotation=30)
plt.tight_layout()
plt.show()
print("More bathrooms = higher price. Does this hold after controlling for bedrooms and borough?")
```
We fit a regression using `smf.ols()` with the formula interface. The `C()` wrapper tells statsmodels to treat borough as a categorical variable, creating indicator (dummy) columns automatically.
```{python}
# Regression: price ~ bathrooms + bedrooms + borough
airbnb_model = smf.ols(
'price_clean ~ bathrooms + bedrooms + C(borough)',
data=airbnb_clean
).fit()
print(airbnb_model.summary().tables[1])
```
Let's isolate the bathrooms coefficient and build a confidence interval for it.
```{python}
# Focus on the bathrooms coefficient
beta_bath = airbnb_model.params['bathrooms']
se_bath = airbnb_model.bse['bathrooms']
ci_bath = airbnb_model.conf_int().loc['bathrooms']
p_bath = airbnb_model.pvalues['bathrooms']
print(f"Bathrooms coefficient: ${beta_bath:.2f} per night")
print(f"95% CI: [${ci_bath[0]:.2f}, ${ci_bath[1]:.2f}]")
print(f"p-value: {format_pvalue(p_bath)}")
print()
print(f"Interpretation: controlling for bedrooms and borough, each additional bathroom")
print(f"is associated with ${beta_bath:.2f} more per night.")
print()
print("This IS practically meaningful -- unlike the NBA rest effect!")
```
## Prediction intervals vs. confidence intervals
We have built confidence intervals for **coefficients** -- but what about predictions?
When we use a regression model to predict at a particular value of $x$, there are
two different questions, and they give different intervals.
- **Confidence interval for the mean response**: How uncertain are we about the
*average* $Y$ at this $x$? This interval reflects estimation uncertainty in $\hat{\beta}$.
- **Prediction interval for a new observation**: Where might a *single new* $Y$ fall?
This interval adds individual variability ($\varepsilon$) on top of estimation uncertainty.
:::{.callout-important}
## Prediction intervals are always wider
A prediction interval accounts for **two sources of uncertainty**: (1) estimation
uncertainty in the regression line, and (2) individual variability around the line
($\sigma^2$). A confidence interval for the mean includes only source (1). In formulas:
$$\text{PI width}^2 \;\approx\; \text{CI width}^2 \;+\; \hat{\sigma}^2$$
where $\hat{\sigma}^2$ is the estimated residual variance. The prediction interval is
always wider, and the gap depends on how noisy the response is.
:::
```{python}
# Prediction interval vs confidence interval — Airbnb model
bath_vals = np.arange(1, airbnb_clean['bathrooms'].max() + 0.5, 0.5)
grid = pd.DataFrame({
'bathrooms': bath_vals,
'bedrooms': [airbnb_clean['bedrooms'].median()] * len(bath_vals),
'borough': ['Manhattan'] * len(bath_vals),
})
grid_pred = airbnb_model.get_prediction(grid).summary_frame(alpha=0.05)
fig, ax = plt.subplots(figsize=(8, 5))
ax.plot(bath_vals, grid_pred['mean'], 'o-', color='steelblue', label='Predicted mean')
ax.fill_between(bath_vals, grid_pred['mean_ci_lower'], grid_pred['mean_ci_upper'],
alpha=0.4, color='steelblue', label='95% CI for mean response')
ax.fill_between(bath_vals, grid_pred['obs_ci_lower'], grid_pred['obs_ci_upper'],
alpha=0.15, color='orange', label='95% prediction interval')
ax.set_xlabel('Number of Bathrooms')
ax.set_ylabel('Predicted Price ($/night)')
ax.set_title('CI for Mean Response vs. Prediction Interval\n(Manhattan, median bedrooms)')
ax.legend()
plt.tight_layout()
plt.show()
print("The blue band (CI for the mean) is narrow — we are fairly sure about the average price.")
print("The orange band (prediction interval) is wide — individual listings vary greatly.")
```
Both intervals for a specific prediction point make the difference concrete.
```{python}
# Numeric example: predicted price for a 2-bathroom, 2-bedroom Manhattan listing
example = pd.DataFrame({
'bathrooms': [2],
'bedrooms': [2],
'borough': ['Manhattan'],
})
example_pred = airbnb_model.get_prediction(example).summary_frame(alpha=0.05)
print("Prediction for a 2-bath, 2-bed Manhattan listing:")
print(f" Predicted price: ${example_pred['mean'].values[0]:.2f}/night")
print(f" 95% CI for mean price: [${example_pred['mean_ci_lower'].values[0]:.2f}, ${example_pred['mean_ci_upper'].values[0]:.2f}]")
print(f" 95% prediction interval:[${example_pred['obs_ci_lower'].values[0]:.2f}, ${example_pred['obs_ci_upper'].values[0]:.2f}]")
print()
print("The CI says the *average* listing like this rents for roughly")
print(f"${example_pred['mean_ci_lower'].values[0]:.0f}-${example_pred['mean_ci_upper'].values[0]:.0f}/night.")
print("The PI says a *single new* listing could fall anywhere in a much wider range,")
print("reflecting the large variation among individual Airbnb listings.")
```
## Putting it together: NBA vs Airbnb
Both analyses used the same tools -- t-tests and confidence intervals on regression
coefficients. The conclusions differ dramatically.
```{python}
# Side-by-side comparison
print("=" * 60)
print(f"{'':30s} {'NBA Rest':>12s} {'Airbnb Bath':>12s}")
print("=" * 60)
print(f"{'Coefficient':30s} {rest_coef:12.3f} {beta_bath:12.2f}")
print(f"{'SE':30s} {rest_se:12.4f} {se_bath:12.2f}")
print(f"{'p-value':30s} {format_pvalue(model_full.pvalues['REST_DAYS']):>12s} {format_pvalue(p_bath):>12s}")
ci_rest = model_full.conf_int().loc['REST_DAYS']
print(f"{'95% CI lower':30s} {ci_rest[0]:12.3f} {ci_bath[0]:12.2f}")
print(f"{'95% CI upper':30s} {ci_rest[1]:12.3f} {ci_bath[1]:12.2f}")
std_effect_label = "Standardized effect (coef/SD)"
print(f"{std_effect_label:30s} {std_effect:12.4f} {beta_bath/airbnb_clean['price_clean'].std():12.4f}")
print(f"{'n':30s} {len(nba_games):12,d} {int(airbnb_model.nobs):12,d}")
print(f"{'Practically meaningful?':30s} {'No':>12s} {'Yes':>12s}")
print("=" * 60)
print()
print("Both are 'statistically significant.' Only one is large enough to act on.")
```
## Key takeaways
1. **Regression inference asks: is each predictor useful?** The **coefficient t-test**
tests $H_0: \beta_j = 0$, asking whether the variable contributes above noise.
Parsimony favors dropping predictors that fail this test.
2. **Building models step by step reveals what "controlling for" means.** Watch how
coefficients change as you add predictors -- this progression makes the idea concrete.
3. **Confidence intervals tell you the range of plausible effect sizes.** A CI that barely
excludes zero should make you cautious.
4. **Statistical significance $\neq$ practical importance.** With large $n$, you can detect
effects too small to matter. Always report **effect sizes** alongside p-values.
5. **Context determines whether a result is actionable.** The bathroom premium (about \$60/night) is actionable for Airbnb hosts.
The rest effect (a fraction of a game score point) is not actionable for NBA coaches.
6. **Regression estimates associations, not causal effects** (unless all confounders are
controlled for -- more in Chapter 18).
We revisit this topic in Chapter 18 when we ask: coaches *choose* when to rest players --
so is the "rest effect" actually a "coaching strategy" effect? That question is causal,
and regression alone cannot answer it.
**Next up:** Chapter 13 is the Act 2 capstone on classification, where we apply
inference ideas to categorical outcomes.
## Study guide
### Key ideas
- **Coefficient t-test**: Tests whether a single regression coefficient is significantly different from zero. $t = \hat{\beta}_j / \text{SE}(\hat{\beta}_j)$.
- **Standard error (of a coefficient)**: The estimated standard deviation of $\hat{\beta}_j$ across repeated samples. Measures uncertainty in the coefficient estimate.
- **Controlling for** / **holding constant**: The regression interpretation -- the effect of one predictor after accounting for the others in the model.
- **Practical significance**: Whether an effect is large enough to matter in context, regardless of the p-value.
- **Effect size**: A standardized measure of how large an effect is (e.g., coefficient divided by the outcome's SD).
- **Residual plot**: Scatter plot of residuals vs. fitted values; used to check for patterns, non-linearity, and heteroscedasticity.
- **Q-Q plot**: Quantile-quantile plot comparing residual quantiles to a normal distribution; points on the diagonal indicate normality.
- **Heteroscedasticity**: When the variance of residuals changes across fitted values (the spread "fans out"). Violates the constant-variance assumption of OLS inference.
- **LINE conditions**: The four assumptions for regression inference -- Linearity, Independence, Normality, Equal variance.
- **Prediction interval**: An interval for where a *single new* observation might fall at a given $x$. Always wider than the confidence interval for the mean -- it includes both estimation uncertainty and individual variability.
- **Parsimony / Occam's razor**: The best model is the simplest one that adequately fits the data. Drop predictors that are not useful.
- The t-test for a coefficient asks: is this predictor useful, or could the observed coefficient be due to noise?
- A confidence interval for a coefficient gives the range of plausible effect sizes, not just a yes/no answer.
- Building models step by step (adding one predictor at a time) reveals what "controlling for" actually does to coefficients.
- Statistical significance with large n can detect effects too small to matter -- always check effect size.
- Regression controls for measured confounders but does not establish causation.
- Residual plots and Q-Q plots check whether the model's assumptions (LINE conditions) hold; when they do not, use bootstrap CIs (Chapter 8) instead of formula CIs.
### Computational tools
- `smf.ols('y ~ x1 + x2', data=df).fit()` -- fit an OLS regression with formula syntax
- `model.summary()` -- full regression output including coefficients, SEs, t-stats, p-values
- `model.params` -- dictionary of fitted coefficients
- `model.pvalues` -- dictionary of p-values for each coefficient's t-test
- `model.conf_int()` -- 95% confidence intervals for all coefficients
- `model.bse` -- standard errors of all coefficients
- `model.fittedvalues`, `model.resid` -- fitted values and residuals for diagnostic plots
- `model.get_prediction(new_data).summary_frame()` -- confidence interval for the mean and prediction interval for a new observation
- `stats.probplot(residuals, dist='norm')` -- Q-Q plot data with correct scaling
- `format_pvalue(p)` -- format p-values in scientific notation when below 0.001
### For the quiz
- Be able to interpret a coefficient table: read off coefficients, SEs, t-statistics, and p-values.
- Know what "controlling for" means in a regression context and how adding a predictor can change other coefficients.
- Distinguish statistical significance (small p-value) from practical importance (large effect size).
- Know the formula $t = \hat{\beta}_j / \text{SE}(\hat{\beta}_j)$ and what it tests.
- Be able to interpret a confidence interval for a regression coefficient in context.
- Know what a residual plot and Q-Q plot show, and what patterns indicate problems (heteroscedasticity, non-normality).
- Know the LINE conditions (Linearity, Independence, Normality, Equal variance) and which diagnostic plots check which conditions.
- Know that bootstrap CIs (Chapter 8) are a fallback when normality assumptions fail.