MSE 125 — Applied Statistics
Wednesday, May 6, 2026
$50,000 renovation for a second bathroom?
listings with more bathrooms charge more — is the premium real, large enough to recoup, stable enough to bet on?
analytics team’s claim: rest boosts performance — restructure the schedule around it?
before they act, let’s check: does the data support that claim?
same toolkit. very different answers.
| question | tool | doesn’t answer | |
|---|---|---|---|
| Q1 | nonzero? | t-test, p-value, CI excludes 0 | how big |
| Q2 | big enough? | coefficient, CI width, Cohen’s d | are assumptions met |
| Q3 | trustworthy? | residual plot, Q-Q, LINE | confounding |
we’ll walk through these questions twice:
regression for decisions
n = 14,689 listings (after filtering)
obvious patterns: more bathrooms → higher price; Manhattan > Brooklyn > rest.
but bathrooms and bedrooms are correlated.
sketch your guess for two numbers in the regression below — sign and rough magnitude.
coef std err t P>|t| [0.025 0.975]
Intercept 89.0594 3.487 25.541 0.000 82.225 95.894
C(borough)[T.Brooklyn] 26.4317 3.347 7.898 0.000 19.871 32.992
C(borough)[T.Manhattan] 78.9870 3.354 23.546 0.000 72.412 85.562
C(borough)[T.Queens] -3.6129 3.563 -1.014 0.310 -10.597 3.371
C(borough)[T.Staten Island] -8.6260 7.063 -1.221 0.222 -22.471 5.219
bathrooms 62.4408 1.860 33.566 0.000 58.795 66.087
bedrooms 34.0263 0.961 35.394 0.000 32.142 35.911
every regression output you’ll see has these columns. we’ll walk through bathrooms end to end, then have a template for the rest of the course.
| column | what it answers | bathrooms |
|---|---|---|
| coef | \hat\beta_j | $62.44 / bath |
| std err | how precise? | $1.86 |
| t | \hat\beta / \widehat{\text{SE}} | 33.6 |
| P>|t| | reject H_0: \beta=0? | < 0.001 |
| [0.025, 0.975] | 95% CI | [$58.80, $66.09] |
plus header: R^2 = 0.40, footer: Cond. No. = 35
“holding everything else constant, a one-unit increase in x_j is associated with a \hat\beta_j change in y”
would adding a bedroom raise YOUR listing price by $34?
think about what changes when you add a bedroom to your apartment.
coefficient null hypothesis
H_0: \beta_j = 0 \quad \text{vs} \quad H_a: \beta_j \neq 0
predictor j contributes nothing given the other predictors in this model
coefficient t-test
t = \frac{\hat\beta_j}{\widehat{\text{SE}}(\hat\beta_j)}
ratio of an estimate to its estimated SE
formula: \hat\beta_j \pm t^* \cdot \widehat{\text{SE}}(\hat\beta_j)
plug in: 62.44 \pm 1.96 \cdot 1.86 = [\$58.80,\ \$66.09]
bootstrap (Ch 8) on 1,000 resamples agrees:
formula CI: [$58.80, $66.09]
bootstrap CI: [$58.78, $66.04]
$59–$66 per bathroom is the range we’re betting on.
yes, the data support the renovation. but two caveats —
PI tracks the gray scatter where data is dense; extrapolated (red shading) at high bathroom counts.
NBA rest: significant, but not important
does extra rest help an NBA player’s game score?
what tangles a raw comparison?
so we control for player quality, opponent strength, home/away.
coef std err t P>|t| [0.025 0.975]
Intercept -0.0072 0.071 -0.101 0.919 -0.146 0.131
REST_DAYS -0.1531 0.022 -7.020 0.000 -0.196 -0.110
PLAYER_SEASON_AVG 1.0163 0.005 192.471 0.000 1.006 1.027
HOME 0.5022 0.057 8.764 0.000 0.390 0.614
OPP_GS_ALLOWED 1.0049 0.011 87.953 0.000 0.982 1.027
REST_DAYS: \hat\beta = -0.15, p < 0.001. statistically significant. done?
Cohen's d = -0.0195 SD
Cohen’s d
coefficient divided by response SD — how many SDs of the outcome does a one-unit predictor change move the prediction
| NBA REST_DAYS | Airbnb bathrooms | |
|---|---|---|
| coefficient | -0.15 | $62.44 |
| p-value | < 0.001 | < 0.001 |
| 95% CI excludes 0 | yes | yes |
| Cohen’s d | -0.02 | 0.30 |
| practically meaningful? | no | yes |
“statistical significance is no substitute for practical importance.”
a sports analytics team comes to you and says:
“we ran a regression and found a negative coefficient for rest days — extra rest hurts performance, p < 0.001”
before they act, what questions should you ask?
player quality dominates. REST_DAYS barely clears zero.
we called it OPP_GS_ALLOWED = mean game score players post against this opponent
coefficient: +1.00 — facing a porous defense → higher game score.
aggregate slope (dashed) steeper than within-quality slopes (colored).
adding PLAYER_SEASON_AVG is the algebraic version of switching from the dashed line to the colored ones — Simpson’s paradox in regression form.
regression controls for measured covariates. it does not estimate causal effects.
coaches choose when to rest:
Q3: is the inference trustworthy?
LINE
assumptions for OLS inference:
each letter has a diagnostic signature. residual plot covers L and E; Q-Q plot covers N; I needs you to ask whether rows are clustered.
Airbnb: spread fans out as fitted price grows (“funnel”). mild heteroscedasticity.
Airbnb residuals are right-skewed: prices have a hard floor at $0 but a long right tail. fix: log-transform price.
Warning
residual plots, Q-Q plots, and the LINE conditions catch statistical problems: non-normality, heteroscedasticity, the wrong functional form
they do not catch the systematic problems that bias a coefficient as an answer to a causal question:
CIs widen with sample noise or omitted variables.
they do necessarily widen with confounding.
you fit a regression for monthly health-insurance claim costs:
cost ~ age + sex + smoking + zip code.
the Q-Q plot of residuals is on the right.
which LINE assumption is failing?
which fix would you try first?

inference for logistic regression
from Ch 7: logistic models a binary outcome via the log-odds.
| linear regression | logistic regression |
|---|---|
| t = \hat\beta_j / \widehat{\text{SE}} | z = \hat\beta_j / \widehat{\text{SE}} |
| exact t_{n-p-1} under LINE | asymptotic normal (Wald, MLE) |
| coef in original units | e^{\hat\beta_j} = odds ratio |
| [\hat\beta - 1.96 \cdot \text{SE},\ \hat\beta + 1.96 \cdot \text{SE}] | exponentiate to get OR CI |
read the table the same way; just remember z instead of t, and exponentiate at the end.
age, smoking, sysBP, glucose: OR > 1, CIs exclude 1
BMI: CI crosses 1 (not distinguishable from “no effect”)
age \widehat{\text{OR}} \approx 1.78 per 1 SD; 1 SD of age \approx 9 years
linear in log-odds \Rightarrow multiplicative on odds
\frac{\text{odds at } x + \Delta}{\text{odds at } x} \;=\; \widehat{\text{OR}}^{\,\Delta / \sigma_x}
age 50 \to 60 is 10 years \approx \tfrac{10}{9} SD:
\frac{\text{odds}(60)}{\text{odds}(50)} \;\approx\; 1.78^{10/9} \;\approx\; 1.90
odds of CHD at 60 are about 2× the odds at 50
how do we ask “is this classifier real?” with the same toolkit?
same template throughout. only the test statistic and the response variable change.
what worked? what didn’t? what’s still confusing?