Lecture 3: Data munging

Applied Statistics: From Data to Decisions

Professor Madeleine Udell

Monday, April 6, 2026

concept check from last time

df.describe() says the Airbnb price column has 3,818 values

df.shape says the DataFrame has 4,229 rows

Q: where did the other 411 rows go?

  • .describe() silently skips NaN — 411 listings have missing prices
  • always compare .describe() counts against .shape

a cell range dragged one row too short. austerity budgets for millions of people.

Reinhart & Rogoff (2010) — Excel error discovered by Thomas Herndon, UMass Amherst

the spreadsheet

=AVERAGE(L30:L44) — but the data goes to row 49

three data disasters

Reinhart-Rogoff (2010)

cell range one row too short

→ austerity budgets for millions

Public Health England (2020)

Excel row limit: 65,536

→ 15,841 COVID cases lost, contacts never traced

gene naming (2016–2020)

SEPT1 → “1-Sep”

→ 20% of genomics papers corrupted, 27 genes renamed

all three: a tool made a decision without warning

every cleaning decision changes the answer

today’s concepts

  • missing data: the many forms
  • joining datasets: inner vs. left, one-to-many
  • string standardization + deduplication
  • three approaches to handling missing data
  • AI gotchas in data cleaning

recap: what we found last time

  • Airbnb data: relatively tidy, NaN for missing values
  • pandas handled it transparently: .isna(), .describe()
  • but we noticed some dtypes looked wrong…

today: real datasets are rarely so cooperative

  1. the many faces of missing data

today’s dataset: College Scorecard

  • US Department of Education
  • 7,703 institutions, 122 columns
  • SAT scores, enrollment, earnings, debt
  • data dictionary, methodology, public updates

. . .

contrast: a random CSV on Kaggle with no documentation

data provenance

data provenance

the chain of custody from data collection to your notebook

  • documented methodology? ✓
  • data dictionary defining every column? ✓
  • known collection procedures? ✓
  • public and versioned? ✓

in the AI era: synthetic datasets, undocumented web scrapes, training data contamination

data types

type examples pandas dtype
continuous 4.2, \(\pi\), price float64
discrete 0, 4, 994, enrollment int64
nominal apple, banana, “Public” object / str
ordinal rarely, sometimes, often category
text doctor’s note, review object / str
identifier UNITID, ZIP code, OPEID looks numeric, isn’t

Q: which type is MD_EARN_WNE_P10 (median earnings)? what dtype does pandas assign it?

a column that should be numeric

MD_EARN_WNE_P10 — median earnings 10 years after enrollment

dtype: str

Q: why would an earnings column be stored as a string?

the culprit: PrivacySuppressed

Non-numeric values in 'MD_EARN_WNE_P10':
PrivacySuppressed    480
  • a string sitting in a numeric column
  • cohort too small → Department of Education suppresses data to protect privacy
  • pandas treats the whole column as str

the missingness zoo

encoding where you’ll find it pandas detects it?
NaN, None standard pandas yes
"" (empty string) CSV exports no
"N/A", "NA" manual data entry sometimes
"PrivacySuppressed" government data no
-999, 99, 0 sensors, surveys no
absent rows panel/time-series no rows to detect

chapter 2 covered row 1. everything else requires active detection.

Demo: College Scorecard exploration

open in Colab

scorecard = pd.read_csv('college-scorecard/scorecard.csv')
scorecard.dtypes.value_counts()

classify each scenario as MCAR, MAR, or MNAR:

  1. a lab assistant drops a tray; 12 random blood samples are destroyed
  2. older patients are less likely to complete a follow-up survey (age is recorded)
  3. College Scorecard suppresses earnings when the cohort is too small

1 min individual. 1 min compare with a neighbor. 2 min class.

missingness is not random

three patterns of missingness

MCAR / MAR / MNAR

  • MCAR: no pattern — mold destroys random records
  • MAR: depends on observed data — older patients skip surveys
  • MNAR: depends on the missing value itself — sickest patients don’t report

College Scorecard earnings: MNAR

small cohort → suppressed → small schools are systematically different

  1. joining datasets

why join?

the Scorecard comes in two files:

  • scorecard.csv — one row per institution (SAT, enrollment, earnings)
  • field_of_study.csv — one row per school-program pair (earnings by major)

richer questions require combining tables

  • “do engineering programs pay more than humanities at the same school?”
  • “should I take on more debt for a prestigious school?”

inner join vs. left join: a toy example

Students
student major_id
Alice 101
Bob 102
Carol 103
Majors
major_id major_name
101 Physics
102 English
104 History

Q: which students survive an inner join on major_id?

inner join: only matching rows

Inner join result
student major_id major_name
Alice 101 Physics
Bob 102 English
  • Carol is gone (no match for 103)
  • History is gone (no match for 104)
  • only IDs in both tables survive

diagram: Wickham & Grolemund, R for Data Science

left join: keep all left rows

Left join result
student major_id major_name
Alice 101 Physics
Bob 102 English
Carol 103 NaN
  • all three students survive
  • Carol’s major becomes NaN (no match)
  • History still dropped (not in left table)

one-to-many: row multiplication

Left (2 rows)
left_id key
L1 A
L2 A
Right (3 rows)
key right_val
A X
A Y
A Z

. . .

Result: 2 × 3 = 6 rows
left_id key right_val
L1 A X
L1 A Y
L1 A Z
L2 A X
L2 A Y
L2 A Z

every key matches → inner join = left join here. the difference only shows when keys are missing.

defense: check your key before joining

assert scorecard['UNITID'].is_unique, \
    "UNITID is not unique — join will multiply rows!"

an assert that passes is documentation

an assert that fails is an alarm

  • scorecard: 7,703 institutions
  • programs: 228,000 rows across 5,500 institutions
  • predict: how many rows will the inner join produce?
  • more or fewer than 7,703?

1 min think. 2 min share. 2 min class.

Demo: join explosion

open in Colab

inner = pd.merge(scorecard, programs, on='UNITID', how='inner')
print(f"Before: {scorecard.shape[0]:,} → After: {inner.shape[0]:,}")

after every join: check row counts

print(f"Scorecard rows:  {scorecard.shape[0]:,}")
print(f"Inner join rows: {inner.shape[0]:,}")
print(f"Left join rows:  {left.shape[0]:,}")
gained rows? lost rows? what happened?
inner yes (one-to-many) yes (unmatched keys) both at once
left yes (one-to-many) no NaN for unmatched

the join type is a decision, not a default

when is a duplicate not a duplicate?

US voter registration: 100M+ records across 50 states

match on name + date of birth → 800,000 apparent duplicates

Q: are these really the same person?

Goel, Meredith, Morse, Rothschild & Shirani-Mehr, “One Person, One Vote,” APSR 2020

the birthday paradox strikes

141 registered voters named “John Smith” born in 1970

  • expected matching-birthday pairs: \(\binom{141}{2} \times \frac{1}{365} \approx 27\)
  • actual matching pairs found: 27

exactly what chance predicts — no fraud, just combinatorics

deduplication is a value judgment

  • merge too aggressively → remove legitimate distinct voters
  • merge too conservatively → count the same person twice

the same tradeoff appears everywhere:

  • patient records across hospitals
  • customer records across systems
  • College Scorecard branch campuses (same OPEID6, different UNITID)

  1. when tools betray you

three approaches to missing earnings

same question: “what is the average median earnings?”

approach method result
drop rows dropna(subset=['earnings']) same mean, fewer rows
fill with mean fillna(mean_val) same mean, shrunken std
ignore NaN pandas default (.mean() skips NaN) same mean, full DataFrame

Q: if they all give the same mean, why does it matter?

the damage shows in group comparisons

dropping rows with missing earnings changes which schools remain

an AI assistant “cleans” the College Scorecard by calling .dropna()

  • it reports the average earnings is $45,000
  • name two questions you’d ask before trusting that number
  • focus on what was lost in the cleaning step

2 min think. 3 min pair. 2 min class.

Stanford: a case study in hidden complexity

the Scorecard website says Stanford median earnings = $136,000 (4yr after graduation)

but program-level data tells a different story:

program credential earnings (4yr)
Business Admin Master’s $262K
Computer Science Master’s $256K
Human Biology Bachelor’s $82K
English Literature Bachelor’s $82K
Ethnic Studies Bachelor’s $46K

one number. five very different realities.

when coercion erases distinctions

imagine an income column with these non-numeric values:

value meaning
"PrivacySuppressed" cohort too small
">100,000" top-coded (Census)
"<LOD" below limit of detection
"Refused" survey respondent declined
"N/A" question didn’t apply

pd.to_numeric(errors='coerce') turns all five into NaN

five different reasons for missingness → one undifferentiated blank

sentinel values

-999 and 999999 aren’t missing — they’re wrong

Mean WITH sentinels:    -208.6°F
Mean WITHOUT sentinels:   72.9°F

unlike NaN, sentinel values participate in every computation without warning

the identifier trap

OPEID 00102100 → stored as integer → becomes 102100

  • leading zeros vanish
  • later join against zero-padded strings: zero matches, no error

ZIP codes, phone numbers, IDs — codes, not numbers

spot the bugs

this code merges two tables and computes mean earnings:

merged = pd.merge(scorecard, programs, on='UNITID')
mean_earn = merged['EARN_MDN_4YR'].mean()

how many problems can you find?

the code on the previous slide has at least four bugs

  • find as many as you can
  • for each, explain what goes wrong and how to fix it

small groups, 3 min. then 2 min share-out.

why we write code

spreadsheets hide logic

  • formula in cell G47 excludes rows
  • dragged range stops one row short
  • no one notices unless they click every cell

code makes decisions visible

  • every drop, coerce, join is a readable line
  • when wrong, the bug is findable
  • when right, the logic is reproducible

good code is deterministic. same script, same data → same answer every time.

AI analysis vs. AI-assisted code

AI analyzes data directly AI helps you write code
cleaning decisions hidden visible as lines you can read
reproducibility different answers each session deterministic
what you learn nothing about the data everything about the data

use AI to write code, not to replace understanding

data munging checklist

  1. check provenance — where did the data come from?
  2. check types — any str columns that should be numeric?
  3. find the missing data — NaN, sentinels, string placeholders, absent rows
  4. understand why data is missing — MCAR, MAR, MNAR?
  5. check joins — row counts after every merge. assert.
  6. standardize strings — lowercase, strip, collapse variants
  7. verify type conversions — how many values became NaN?
  8. compare groups — is missingness evenly distributed?

“the combination of some data and an aching desire for an answer does not ensure that a reasonable one can be extracted from a given body of data.”

— John Tukey

synthesis

  • missing data hides in many forms — NaN is just the easy case
  • joins can grow or shrink your data — the type is a decision
  • every cleaning choice changes which observations enter your analysis
  • code makes decisions visible — spreadsheets and AI bury them
  • domain knowledge has no substitute — choosing the right metric, spotting leakage, interpreting missingness

next time: linear algebra

we can clean and join data. now we need a language for modeling it.

  • vectors as data points, vectors as features
  • linear combinations → predictions
  • column space → what a model can reach

read: Chapter 4 in the course notes

before next class

  • read Chapter 3 in the course notes
  • try the notebook exercises on your own
  • Quiz 1 is Wednesday — covers Lectures 1-3

one-minute feedback

  1. what was the most useful thing you learned today?
  2. what was the most confusing?

give feedback