Applied Statistics: From Data to Decisions
Monday, April 6, 2026
df.describe() says the Airbnb price column has 3,818 values
df.shape says the DataFrame has 4,229 rows
Q: where did the other 411 rows go?
.describe() silently skips NaN — 411 listings have missing prices.describe() counts against .shapea cell range dragged one row too short. austerity budgets for millions of people.
Reinhart & Rogoff (2010) — Excel error discovered by Thomas Herndon, UMass Amherst
=AVERAGE(L30:L44) — but the data goes to row 49
Reinhart-Rogoff (2010)
cell range one row too short
→ austerity budgets for millions
Public Health England (2020)
Excel row limit: 65,536
→ 15,841 COVID cases lost, contacts never traced
gene naming (2016–2020)
SEPT1 → “1-Sep”
→ 20% of genomics papers corrupted, 27 genes renamed
all three: a tool made a decision without warning
every cleaning decision changes the answer
.isna(), .describe()today: real datasets are rarely so cooperative
. . .
contrast: a random CSV on Kaggle with no documentation

data provenance
the chain of custody from data collection to your notebook
in the AI era: synthetic datasets, undocumented web scrapes, training data contamination
| type | examples | pandas dtype |
|---|---|---|
| continuous | 4.2, \(\pi\), price | float64 |
| discrete | 0, 4, 994, enrollment | int64 |
| nominal | apple, banana, “Public” | object / str |
| ordinal | rarely, sometimes, often | category |
| text | doctor’s note, review | object / str |
| identifier | UNITID, ZIP code, OPEID | looks numeric, isn’t |
Q: which type is MD_EARN_WNE_P10 (median earnings)? what dtype does pandas assign it?
MD_EARN_WNE_P10 — median earnings 10 years after enrollment
dtype: str
Q: why would an earnings column be stored as a string?
Non-numeric values in 'MD_EARN_WNE_P10':
PrivacySuppressed 480
str| encoding | where you’ll find it | pandas detects it? |
|---|---|---|
NaN, None |
standard pandas | yes |
"" (empty string) |
CSV exports | no |
"N/A", "NA" |
manual data entry | sometimes |
"PrivacySuppressed" |
government data | no |
-999, 99, 0 |
sensors, surveys | no |
| absent rows | panel/time-series | no rows to detect |
chapter 2 covered row 1. everything else requires active detection.
classify each scenario as MCAR, MAR, or MNAR:
1 min individual. 1 min compare with a neighbor. 2 min class.
MCAR / MAR / MNAR
College Scorecard earnings: MNAR
small cohort → suppressed → small schools are systematically different
the Scorecard comes in two files:
scorecard.csv — one row per institution (SAT, enrollment, earnings)field_of_study.csv — one row per school-program pair (earnings by major)richer questions require combining tables
| student | major_id |
|---|---|
| Alice | 101 |
| Bob | 102 |
| Carol | 103 |
| major_id | major_name |
|---|---|
| 101 | Physics |
| 102 | English |
| 104 | History |
Q: which students survive an inner join on major_id?
| student | major_id | major_name |
|---|---|---|
| Alice | 101 | Physics |
| Bob | 102 | English |

diagram: Wickham & Grolemund, R for Data Science
| student | major_id | major_name |
|---|---|---|
| Alice | 101 | Physics |
| Bob | 102 | English |
| Carol | 103 | NaN |
| left_id | key |
|---|---|
| L1 | A |
| L2 | A |
| key | right_val |
|---|---|
| A | X |
| A | Y |
| A | Z |
. . .
| left_id | key | right_val |
|---|---|---|
| L1 | A | X |
| L1 | A | Y |
| L1 | A | Z |
| L2 | A | X |
| L2 | A | Y |
| L2 | A | Z |
every key matches → inner join = left join here. the difference only shows when keys are missing.
an assert that passes is documentation
an assert that fails is an alarm
1 min think. 2 min share. 2 min class.
| gained rows? | lost rows? | what happened? | |
|---|---|---|---|
| inner | yes (one-to-many) | yes (unmatched keys) | both at once |
| left | yes (one-to-many) | no | NaN for unmatched |
the join type is a decision, not a default
US voter registration: 100M+ records across 50 states
match on name + date of birth → 800,000 apparent duplicates
Q: are these really the same person?
Goel, Meredith, Morse, Rothschild & Shirani-Mehr, “One Person, One Vote,” APSR 2020
141 registered voters named “John Smith” born in 1970
exactly what chance predicts — no fraud, just combinatorics
the same tradeoff appears everywhere:
OPEID6, different UNITID)same question: “what is the average median earnings?”
| approach | method | result |
|---|---|---|
| drop rows | dropna(subset=['earnings']) |
same mean, fewer rows |
| fill with mean | fillna(mean_val) |
same mean, shrunken std |
| ignore NaN | pandas default (.mean() skips NaN) |
same mean, full DataFrame |
Q: if they all give the same mean, why does it matter?
dropping rows with missing earnings changes which schools remain
an AI assistant “cleans” the College Scorecard by calling .dropna()
2 min think. 3 min pair. 2 min class.
the Scorecard website says Stanford median earnings = $136,000 (4yr after graduation)
but program-level data tells a different story:
| program | credential | earnings (4yr) |
|---|---|---|
| Business Admin | Master’s | $262K |
| Computer Science | Master’s | $256K |
| Human Biology | Bachelor’s | $82K |
| English Literature | Bachelor’s | $82K |
| Ethnic Studies | Bachelor’s | $46K |
one number. five very different realities.
imagine an income column with these non-numeric values:
| value | meaning |
|---|---|
"PrivacySuppressed" |
cohort too small |
">100,000" |
top-coded (Census) |
"<LOD" |
below limit of detection |
"Refused" |
survey respondent declined |
"N/A" |
question didn’t apply |
pd.to_numeric(errors='coerce') turns all five into NaN
five different reasons for missingness → one undifferentiated blank
-999 and 999999 aren’t missing — they’re wrong
Mean WITH sentinels: -208.6°F
Mean WITHOUT sentinels: 72.9°F
unlike NaN, sentinel values participate in every computation without warning
OPEID 00102100 → stored as integer → becomes 102100
ZIP codes, phone numbers, IDs — codes, not numbers
this code merges two tables and computes mean earnings:
how many problems can you find?
the code on the previous slide has at least four bugs
small groups, 3 min. then 2 min share-out.
spreadsheets hide logic
code makes decisions visible
good code is deterministic. same script, same data → same answer every time.
| AI analyzes data directly | AI helps you write code | |
|---|---|---|
| cleaning decisions | hidden | visible as lines you can read |
| reproducibility | different answers each session | deterministic |
| what you learn | nothing about the data | everything about the data |
use AI to write code, not to replace understanding
str columns that should be numeric?“the combination of some data and an aching desire for an answer does not ensure that a reasonable one can be extracted from a given body of data.”
— John Tukey

NaN is just the easy casewe can clean and join data. now we need a language for modeling it.
read: Chapter 4 in the course notes
