Lecture 3: Data munging

Applied Statistics: From Data to Decisions

Professor Madeleine Udell

Monday, April 6, 2026

concept check from last time

df.describe() says the Airbnb price column has 3,818 values

df.shape says the DataFrame has 4,229 rows

Q: where did the other 411 rows go?

  • .describe() silently skips NaN — 411 listings have missing prices
  • always compare .describe() counts against .shape

a cell range dragged one row too short. austerity budgets for millions of people.

Reinhart & Rogoff (2010) — Excel error discovered by Thomas Herndon, UMass Amherst

the spreadsheet

=AVERAGE(L30:L44) — but the data goes to row 49

three data disasters

Reinhart-Rogoff (2010)

cell range one row too short

→ austerity budgets for millions

Public Health England (2020)

.xls row limit: 65,536

→ 15,841 COVID cases lost, contacts never traced

gene naming (2016–2020)

SEPT1 → “1-Sep”

→ 20% of genomics papers corrupted, 27 genes renamed

all three are bugs — but in a spreadsheet, the bug is invisible

why we write code

spreadsheets hide logic

  • formula in cell G47 excludes rows
  • dragged range stops one row short
  • no one notices unless they click every cell

code makes decisions visible

  • every drop, coerce, join is a readable line
  • when wrong, the bug is findable
  • when right, the logic is reproducible

good code is deterministic. same script, same data → same answer every time.

AI analysis vs. AI-assisted code

AI analyzes data directly AI helps you write code
cleaning decisions hidden visible as lines you can read
reproducibility different answers each session deterministic
what you learn nothing about the data everything about the data

use AI to write code, not to replace understanding

today’s concepts

every cleaning decision changes the answer

  1. missing data — the many forms, MCAR / MAR / MNAR
  2. joining datasets — inner vs. left, one-to-many
  3. strings and deduplication — standardization, the birthday paradox
  4. when tools betray you — handling missing data, coercion traps, AI-assisted code

recap: what we found last time

  • Airbnb data: relatively tidy, NaN for missing values
  • pandas handled it transparently: .isna(), .describe()
  • but we noticed some dtypes looked wrong…

today: real datasets are rarely so cooperative

  1. the many faces of missing data

today’s dataset: College Scorecard

  • US Department of Education
  • 7,703 institutions, 122 columns
  • SAT scores, enrollment, earnings, debt
  • data dictionary, methodology, public updates

. . .

contrast: a random CSV on Kaggle with no documentation

data provenance

data provenance

the chain of custody from data collection to your notebook

  • documented methodology? ✓
  • data dictionary defining every column? ✓
  • known collection procedures? ✓
  • public and versioned? ✓

in the AI era: synthetic datasets, undocumented web scrapes, training data contamination

Stanford in the Scorecard

the Scorecard website says Stanford median earnings = $136,000 (4 years after completing)

but program-level data tells a different story:

program credential earnings (4yr)
Business Admin Master’s $262K
Computer Science Master’s $256K
Human Biology Bachelor’s $82K
English Literature Bachelor’s $82K
Ethnic Studies Bachelor’s $46K

$136K is a weighted average dominated by large graduate programs

a column that should be numeric

MD_EARN_WNE_P10 — median earnings, 10 years after entry (working and not enrolled)

dtype: str

Q: why would an earnings column be stored as a string?

the culprit: PrivacySuppressed

Non-numeric values in 'MD_EARN_WNE_P10':
PrivacySuppressed    480
  • a string sitting in a numeric column
  • cohort too small → Department of Education suppresses data to protect privacy
  • pandas treats the whole column as str

the missingness zoo

encoding where you’ll find it pandas detects it?
NaN, None standard pandas yes
"" (empty string) CSV exports no
"N/A", "NA" manual data entry sometimes
"PrivacySuppressed" government data no
-999, 99, 0 sensors, surveys no
absent rows panel/time-series no rows to detect

chapter 2 covered row 1. everything else requires active detection.

why categories are stored as integers

CONTROL: 1 = Public, 2 = Private nonprofit, 3 = Private for-profit

why not store the string?

  • "Management Science and Engineering" = 37 bytes
  • 7 = 1 byte

across 7,703 rows × hundreds of columns, integer codes cut storage by 10–50×

the tradeoff: you need the data dictionary to know what the numbers mean

Demo: College Scorecard exploration

open in Colab

missingness is not random

three patterns of missingness

MCAR / MAR / MNAR

  • MCAR: no pattern — mold destroys random records
  • MAR: depends on observed data — SAT scores missing because school doesn’t require them (degree type is recorded)
  • MNAR: depends on the missing value itself — sickest patients don’t report

College Scorecard earnings: MNAR

small cohort → suppressed → small schools are systematically different

classify each scenario as MCAR, MAR, or MNAR:

  1. a lab assistant drops a tray; 12 random blood samples are destroyed
  2. SAT scores are missing for most certificate programs in the Scorecard
  3. College Scorecard suppresses earnings when the cohort is too small

data leakage

suppose you want to predict graduation rate from admission data

the Scorecard has COMP_ORIG_YR4_RT (4-year completion rate) and MD_EARN_WNE_P10 (earnings 10 years later)

if you include earnings as a predictor of graduation… you’re using the future to predict the past

data leakage

when your model accidentally sees information from the future or the target variable

  1. joining datasets

why join?

the Scorecard comes in two files:

  • scorecard.csv — one row per institution (SAT, enrollment, earnings)
  • field_of_study.csv — one row per school-program pair (earnings by major)

richer questions require combining tables

  • “do engineering programs pay more than humanities at the same school?”
  • “should I take on more debt for a prestigious school?”

joining = matching rows by a shared key

pd.merge(left, right, on='key')    # match rows where key values agree

two flavors:

  • inner join — keep only rows with a match in both tables
  • left join — keep every row from the left table; fill NaN where no match exists

inner join vs. left join: a toy example

Students
student major_id
Alice 101
Bob 102
Carol 103
Majors
major_id major_name
101 Physics
102 English
104 History

Q: which students survive an inner join on major_id?

inner join: only matching rows

Inner join result
student major_id major_name
Alice 101 Physics
Bob 102 English
  • Carol is gone (no match for 103)
  • History is gone (no match for 104)
  • only IDs in both tables survive

diagram: Wickham & Grolemund, R for Data Science

left join: keep all left rows

Left join result
student major_id major_name
Alice 101 Physics
Bob 102 English
Carol 103 NaN
  • all three students survive
  • Carol’s major becomes NaN (no match)
  • History still dropped (not in left table)

one-to-many: row multiplication

Left (2 rows)
left_id key
L1 A
L2 A
Right (3 rows)
key right_val
A X
A Y
A Z

. . .

Result: 2 × 3 = 6 rows
left_id key right_val
L1 A X
L1 A Y
L1 A Z
L2 A X
L2 A Y
L2 A Z

every key matches → inner join = left join here. the difference only shows when keys are missing.

defense: check your key before joining

assert scorecard['UNITID'].is_unique, \
    "UNITID is not unique — join will multiply rows!"

an assert that passes is documentation

an assert that fails is an alarm

  • scorecard: 7,703 institutions
  • programs: 228,000 rows across 5,500 institutions
  • predict: how many rows will the inner join produce?
  • more or fewer than 7,703?

Demo: join explosion

open in Colab

after every join: check row counts

print(f"Scorecard rows:  {scorecard.shape[0]:,}")
print(f"Inner join rows: {inner.shape[0]:,}")
print(f"Left join rows:  {left.shape[0]:,}")
gained rows? lost rows? what happened?
inner yes (one-to-many) yes (unmatched keys) both at once
left yes (one-to-many) no NaN for unmatched

the join type is a decision, not a default

  1. strings and deduplication

string standardization

scorecard['INSTNM'].value_counts().head()
University of Phoenix-Arizona          1
University Of Phoenix-Arizona          1
UNIVERSITY OF PHOENIX - ARIZONA        1
scorecard['INSTNM_clean'] = (scorecard['INSTNM']
    .str.lower()
    .str.strip()
    .str.replace(r'\s+', ' ', regex=True))

lowercase → strip → collapse whitespace. every merge and groupby depends on this.

quick poll: voter registration

  • ✋ who is registered to vote here (Santa Clara County)?
  • ✋ who is registered at home (where your parents live)?
  • ✋ who canceled their previous registration when they registered here?
  • ✋ who plans to vote here? there? …both places?

when is a duplicate not a duplicate?

US voter registration: 104 million records across 50 states

match on name + date of birth → 761,875 apparent duplicates

Q: are these really the same person?

Goel, Meredith, Morse, Rothschild & Shirani-Mehr, “One Person, One Vote,” APSR 2020

the birthday paradox strikes

suppose 141 voters named “John Smith” were born in 1970

  • expected matching-birthday pairs: \(\binom{141}{2} \times \frac{1}{365} \approx 27\)
  • that’s 27 “double voters” — from one common name in one birth year

scale this across every name in the country: most of those 762K matches are coincidences

deduplication is a value judgment

  • merge too aggressively → remove legitimate distinct records
  • merge too conservatively → count the same entity twice

in the voter file: canceling registrations by name+DOB match would disenfranchise 300+ legitimate voters per double vote prevented

the same tradeoff appears everywhere:

  • patient records across hospitals
  • customer records across systems
  • College Scorecard branch campuses (same OPEID6, different UNITID)

  1. when tools betray you

three approaches to missing earnings

same question: “what is the average median earnings?”

approach method mean side effect
drop rows dropna(subset=['earnings']) $42,290 removes schools from all columns
fill with mean fillna(mean_val) $42,290 shrinks std, distorts correlations
ignore NaN pandas default (.mean() skips NaN) $42,290 preserves full DataFrame

same mean — but dropping rows changes which schools remain in your sample

the damage shows in group comparisons

dropping rows with missing earnings changes which schools remain

an AI assistant “cleans” the College Scorecard by calling .dropna()

  • it reports the average earnings is $45,000
  • name two questions you’d ask before trusting that number
  • focus on what was lost in the cleaning step

when coercion erases distinctions

imagine an income column with these non-numeric values:

value meaning
"PrivacySuppressed" cohort too small
">100,000" top-coded (Census)
"<LOD" below limit of detection
"Refused" survey respondent declined
"N/A" question didn’t apply

pd.to_numeric(errors='coerce') turns all five into NaN

five different reasons for missingness → one undifferentiated blank

sentinel values

-999 and 999999 aren’t missing — they’re wrong

Mean WITH sentinels:    -208.6°F
Mean WITHOUT sentinels:   72.9°F

unlike NaN, sentinel values participate in every computation without warning

the identifier trap

OPEID 00102100 → stored as integer → becomes 102100

  • leading zeros vanish
  • later join against zero-padded strings: zero matches, no error

ZIP codes, phone numbers, IDs — codes, not numbers

this code merges two tables and computes mean earnings. it has at least four bugs.

merged = pd.merge(scorecard, programs, on='UNITID')
mean_earn = merged['EARN_MDN_4YR'].mean()

find as many as you can. for each, explain what goes wrong and how to fix it.

data munging checklist

  1. check provenance — where did the data come from?
  2. check types — any str columns that should be numeric?
  3. find the missing data — NaN, sentinels, string placeholders, absent rows
  4. understand why data is missing — MCAR, MAR, MNAR?
  5. check joins — row counts after every merge. assert.
  6. standardize strings — lowercase, strip, collapse variants
  7. verify type conversions — how many values became NaN?
  8. compare groups — is missingness evenly distributed?

“the combination of some data and an aching desire for an answer does not ensure that a reasonable one can be extracted from a given body of data.”

— John Tukey

synthesis

  1. missing data hides in many forms — NaN is just the easy case; MNAR is the dangerous one
  2. joins can grow or shrink your data — check row counts after every merge
  3. strings and deduplication — standardize before matching; the birthday paradox generates false positives at scale
  4. tools betray you — coercion erases distinctions, dropping rows changes the sample, code makes every decision visible

next time: linear algebra

we can clean and join data. now we need a language for modeling it.

  • vectors as data points, vectors as features
  • linear combinations → predictions
  • span → what a model can reach

preview: Chapter 4 in the course notes

before next class

  • fill out the homework discussion section scheduling survey (see Ed)
  • read Chapter 3 in the course notes
  • try the notebook exercises on your own
  • Quiz 1 is Wednesday — covers Lectures 1-3. be on time!

one-minute feedback

  1. what was the most useful thing you learned today?
  2. what was the most confusing?

give feedback