Applied Statistics: From Data to Decisions
Wednesday, April 1, 2026
four datasets. identical mean, SD, correlation, regression line.
so they must look the same, right?

Matejka & Fitzmaurice, “Same Stats, Different Graphs” (Autodesk Research, 2017)
always look at your data
today we focus on summary — what’s in the data, and what traps are hiding in it
part 1: first look at the data
29,142 listings from Inside Airbnb — every active rental in New York City
you’re a traveler trying to find a good deal. where do you start?
dtypes: mix of int64, float64, object.describe() count row: some columns have fewer entries — missing datasix semantic types: continuous, discrete, nominal, ordinal, text, identifier
part 2: distributions
what does “typical” mean?
dramatic right skew — long tail to $999/night
which single number represents a “typical” listing?
mean
\(\bar{x} = \frac{1}{n}\sum_{i=1}^n x_i\)
median
the middle value when sorted
the mean is $133. the median is $100.
if you price your listing at the mean, roughly what fraction of listings are cheaper than yours?
well over half — the right tail pulls the mean above most of the data
quartiles
divide sorted data into four equal parts
interquartile range (IQR)
Q3 − Q1 — the width of the middle 50%. a robust measure of spread.
for Airbnb prices: Q1 = $67, Q3 = $165, IQR = $98
raw scale
histogram bunched at left, invisible tail
mean pulled far from center
log scale
roughly symmetric
structure in the tail becomes visible
use a log axis to look at skewed data; log transform to model it
part 3: relationships between variables
I’m about to plot price (log scale, y-axis) vs. number of guests (x-axis)
1 min sketch or discuss. then I’ll reveal the plot.
29,000 points stacked on top of each other — you can’t see the pattern
2D histogram: bin both axes, color by count
part 4: categorical variables
room types
boroughs
bar charts make these comparisons easy — pie charts don’t
you’ve seen price distributions for the whole city
30 sec think. 1 min pair. then I’ll show the box plot.
three ways to show room type by borough:
Manhattan has a higher proportion of entire homes than Brooklyn
what’s missing?
11,827 out of 29,142 listings have no deposit listed
is that random, or does it mean something?
MCAR (missing completely at random)
missingness has no pattern — unrelated to any variable
MAR (missing at random)
missingness depends on observed variables
MNAR (missing not at random)
missingness depends on the unobserved value itself — the most dangerous pattern
a host lists a $500/night apartment but doesn’t fill in the security deposit field
1 min think. 2 min pair. 1 min share.


CMS suppresses hospitals with “too few” readmissions to report — the value itself determines whether you see it
association is not causation
median price: Manhattan $135, Brooklyn $90
Manhattan is $45/night more expensive
but is that comparing like with like?
| Manhattan | Brooklyn | gap | |
|---|---|---|---|
| Entire home | $180 | $140 | $40 |
| Private room | $85 | $62 | $23 |
| Shared room | $60 | $35 | $25 |
| Overall | $135 | $90 | $45 |
the overall gap ($45) exceeds every room-type gap ($23–$40)
confounder
a variable that affects both the treatment and the outcome
room type drives both location distribution and price
Manhattan isn’t as much more expensive as it looks — room type is doing some of the work
how much of the grade boost is from office hours, and how much from the kind of student who shows up?
motivation drives both OH attendance and studying — the raw correlation overstates the causal effect
what’s wrong with each of these?
Reuters / C. Chan, 2014. Data: Florida Dept. of Law Enforcement
Georgia Dept. of Public Health, May 2020
we can describe and diagnose the data
how do we clean it? what do we do with missing values, wrong types, messy text?
