MSE 125 — Slides – Lecture 3: Data munging

concept check from last time

df.describe() says the Airbnb price column has 3,818 values

df.shape says the DataFrame has 4,229 rows

Q: where did the other 411 rows go?

.describe() silently skips NaN — 411 listings have missing prices
always compare .describe() counts against .shape

a cell range dragged one row too short. austerity budgets for millions of people.

Reinhart & Rogoff (2010) — Excel error discovered by Thomas Herndon, UMass Amherst

the spreadsheet

=AVERAGE(L30:L44) — but the data goes to row 49

three data disasters

Reinhart-Rogoff (2010)

cell range one row too short

→ austerity budgets for millions

Public Health England (2020)

.xls row limit: 65,536

→ 15,841 COVID cases lost, contacts never traced

gene naming (2016–2020)

SEPT1 → “1-Sep”

→ 20% of genomics papers corrupted, 27 genes renamed

all three are bugs — but in a spreadsheet, the bug is invisible

why we write code

spreadsheets hide logic

formula in cell G47 excludes rows
dragged range stops one row short
no one notices unless they click every cell

code makes decisions visible

every drop, coerce, join is a readable line
when wrong, the bug is findable
when right, the logic is reproducible

good code is deterministic. same script, same data → same answer every time.

AI analysis vs. AI-assisted code

	AI analyzes data directly	AI helps you write code
cleaning decisions	hidden	visible as lines you can read
reproducibility	different answers each session	deterministic
what you learn	nothing about the data	everything about the data

use AI to write code, not to replace understanding

today’s concepts

every cleaning decision changes the answer

missing data — the many forms, MCAR / MAR / MNAR
joining datasets — inner vs. left, one-to-many
strings and deduplication — standardization, the birthday paradox
when tools betray you — handling missing data, coercion traps, AI-assisted code

recap: what we found last time

Airbnb data: relatively tidy, NaN for missing values
pandas handled it transparently: .isna(), .describe()
but we noticed some dtypes looked wrong…

today: real datasets are rarely so cooperative

the many faces of missing data

today’s dataset: College Scorecard

US Department of Education
7,703 institutions, 122 columns
SAT scores, enrollment, earnings, debt
data dictionary, methodology, public updates

. . .

contrast: a random CSV on Kaggle with no documentation

data provenance

the chain of custody from data collection to your notebook

documented methodology? ✓
data dictionary defining every column? ✓
known collection procedures? ✓
public and versioned? ✓

in the AI era: synthetic datasets, undocumented web scrapes, training data contamination

Stanford in the Scorecard

the Scorecard website says Stanford median earnings = $136,000 (4 years after completing)

but program-level data tells a different story:

program	credential	earnings (4yr)
Business Admin	Master’s	$262K
Computer Science	Master’s	$256K
Human Biology	Bachelor’s	$82K
English Literature	Bachelor’s	$82K
Ethnic Studies	Bachelor’s	$46K

$136K is a weighted average dominated by large graduate programs

navigating a data dictionary

the College Scorecard dictionary: 3,308 variables across 8 tabs

variable name	what it actually means
`CONTROL`	institution type: 1=Public, 2=Private nonprofit, 3=Private for-profit
`MD_EARN_WNE_P10`	median earnings of students working and not enrolled, 10 years after entry
`PREDDEG`	predominant degree: 0–4 coded as integers, but the values are categories
`OPEID`	defined as a string — store as integer and you lose leading zeros

CONTROL = 2 looks numeric. mean(CONTROL) = 1.7. what does that mean?

a column that should be numeric

MD_EARN_WNE_P10 — median earnings, 10 years after entry (working and not enrolled)

dtype: str

Q: why would an earnings column be stored as a string?

the culprit: PrivacySuppressed

Non-numeric values in 'MD_EARN_WNE_P10':
PrivacySuppressed    480

a string sitting in a numeric column
cohort too small → Department of Education suppresses data to protect privacy
pandas treats the whole column as str

the missingness zoo

encoding	where you’ll find it	pandas detects it?
`NaN`, `None`	standard pandas	yes
`""` (empty string)	CSV exports	no
`"N/A"`, `"NA"`	manual data entry	sometimes
`"PrivacySuppressed"`	government data	no
`-999`, `99`, `0`	sensors, surveys	no
absent rows	panel/time-series	no rows to detect

chapter 2 covered row 1. everything else requires active detection.

why categories are stored as integers

CONTROL: 1 = Public, 2 = Private nonprofit, 3 = Private for-profit

why not store the string?

"Management Science and Engineering" = 37 bytes
7 = 1 byte

across 7,703 rows × hundreds of columns, integer codes cut storage by 10–50×

the tradeoff: you need the data dictionary to know what the numbers mean

Demo: College Scorecard exploration

open in Colab

missingness is not random

three patterns of missingness

MCAR / MAR / MNAR

MCAR: no pattern — mold destroys random records
MAR: depends on observed data — SAT scores missing because school doesn’t require them (degree type is recorded)
MNAR: depends on the missing value itself — sickest patients don’t report

College Scorecard earnings: MNAR

small cohort → suppressed → small schools are systematically different

classify each scenario as MCAR, MAR, or MNAR:

a lab assistant drops a tray; 12 random blood samples are destroyed
SAT scores are missing for most certificate programs in the Scorecard
College Scorecard suppresses earnings when the cohort is too small

data leakage

suppose you want to predict graduation rate from admission data

the Scorecard has COMP_ORIG_YR4_RT (4-year completion rate) and MD_EARN_WNE_P10 (earnings 10 years later)

if you include earnings as a predictor of graduation… you’re using the future to predict the past

data leakage

when your model accidentally sees information from the future or the target variable

joining datasets

why join?

the Scorecard comes in two files:

scorecard.csv — one row per institution (SAT, enrollment, earnings)
field_of_study.csv — one row per school-program pair (earnings by major)

richer questions require combining tables

“do engineering programs pay more than humanities at the same school?”
“should I take on more debt for a prestigious school?”

joining = matching rows by a shared key

pd.merge(left, right, on='key')    # match rows where key values agree

two flavors:

inner join — keep only rows with a match in both tables
left join — keep every row from the left table; fill NaN where no match exists

inner join vs. left join: a toy example

**Students**
student	major_id
Alice	101
Bob	102
Carol	103

**Majors**
major_id	major_name
101	Physics
102	English
104	History

Q: which students survive an inner join on major_id?

inner join: only matching rows

**Inner join result**
student	major_id	major_name
Alice	101	Physics
Bob	102	English

Carol is gone (no match for 103)
History is gone (no match for 104)
only IDs in both tables survive

diagram: Wickham & Grolemund, R for Data Science

left join: keep all left rows

**Left join result**
student	major_id	major_name
Alice	101	Physics
Bob	102	English
Carol	103	NaN

all three students survive
Carol’s major becomes NaN (no match)
History still dropped (not in left table)

one-to-many: row multiplication

**Left (2 rows)**
left_id	key
L1	A
L2	A

**Right (3 rows)**
key	right_val
A	X
A	Y
A	Z

. . .

**Result: 2 × 3 = 6 rows**
left_id	key	right_val
L1	A	X
L1	A	Y
L1	A	Z
L2	A	X
L2	A	Y
L2	A	Z

every key matches → inner join = left join here. the difference only shows when keys are missing.

defense: check your key before joining

assert scorecard['UNITID'].is_unique, \
    "UNITID is not unique — join will multiply rows!"

an assert that passes is documentation

an assert that fails is an alarm

scorecard: 7,703 institutions
programs: 228,000 rows across 5,500 institutions
predict: how many rows will the inner join produce?
more or fewer than 7,703?

Demo: join explosion

open in Colab

after every join: check row counts

print(f"Scorecard rows:  {scorecard.shape[0]:,}")
print(f"Inner join rows: {inner.shape[0]:,}")
print(f"Left join rows:  {left.shape[0]:,}")

	gained rows?	lost rows?	what happened?
inner	yes (one-to-many)	yes (unmatched keys)	both at once
left	yes (one-to-many)	no	NaN for unmatched

the join type is a decision, not a default

strings and deduplication

string standardization

scorecard['INSTNM'].value_counts().head()

University of Phoenix-Arizona          1
University Of Phoenix-Arizona          1
UNIVERSITY OF PHOENIX - ARIZONA        1

scorecard['INSTNM_clean'] = (scorecard['INSTNM']
    .str.lower()
    .str.strip()
    .str.replace(r'\s+', ' ', regex=True))

lowercase → strip → collapse whitespace. every merge and groupby depends on this.

quick poll: voter registration

✋ who is registered to vote here (Santa Clara County)?
✋ who is registered at home (where your parents live)?
✋ who canceled their previous registration when they registered here?
✋ who plans to vote here? there? …both places?

when is a duplicate not a duplicate?

US voter registration: 104 million records across 50 states

match on name + date of birth → 761,875 apparent duplicates

Q: are these really the same person?

Goel, Meredith, Morse, Rothschild & Shirani-Mehr, “One Person, One Vote,” APSR 2020

the birthday paradox strikes

suppose 141 voters named “John Smith” were born in 1970

expected matching-birthday pairs: $\binom{141}{2} \times \frac{1}{365} \approx 27$

that’s 27 “double voters” — from one common name in one birth year

scale this across every name in the country: most of those 762K matches are coincidences

deduplication is a value judgment

merge too aggressively → remove legitimate distinct records
merge too conservatively → count the same entity twice

in the voter file: canceling registrations by name+DOB match would disenfranchise 300+ legitimate voters per double vote prevented

the same tradeoff appears everywhere:

patient records across hospitals
customer records across systems
College Scorecard branch campuses (same OPEID6, different UNITID)

when tools betray you

three approaches to missing earnings

same question: “what is the average median earnings?”

approach	method	mean	side effect
drop rows	`dropna(subset=['earnings'])`	$42,290	removes schools from all columns
fill with mean	`fillna(mean_val)`	$42,290	shrinks std, distorts correlations
ignore NaN	pandas default (`.mean()` skips NaN)	$42,290	preserves full DataFrame

same mean — but dropping rows changes which schools remain in your sample

the damage shows in group comparisons

dropping rows with missing earnings changes which schools remain

an AI assistant “cleans” the College Scorecard by calling .dropna()

it reports the average earnings is $45,000
name two questions you’d ask before trusting that number
focus on what was lost in the cleaning step

when coercion erases distinctions

imagine an income column with these non-numeric values:

value	meaning
`"PrivacySuppressed"`	cohort too small
`">100,000"`	top-coded (Census)
`"<LOD"`	below limit of detection
`"Refused"`	survey respondent declined
`"N/A"`	question didn’t apply

pd.to_numeric(errors='coerce') turns all five into NaN

five different reasons for missingness → one undifferentiated blank

sentinel values

-999 and 999999 aren’t missing — they’re wrong

Mean WITH sentinels:    -208.6°F
Mean WITHOUT sentinels:   72.9°F

unlike NaN, sentinel values participate in every computation without warning

the identifier trap

OPEID 00102100 → stored as integer → becomes 102100

leading zeros vanish
later join against zero-padded strings: zero matches, no error

ZIP codes, phone numbers, IDs — codes, not numbers

this code merges two tables and computes mean earnings. it has at least four bugs.

merged = pd.merge(scorecard, programs, on='UNITID')
mean_earn = merged['EARN_MDN_4YR'].mean()

find as many as you can. for each, explain what goes wrong and how to fix it.

data munging checklist

check provenance — where did the data come from?
check types — any str columns that should be numeric?
find the missing data — NaN, sentinels, string placeholders, absent rows
understand why data is missing — MCAR, MAR, MNAR?
check joins — row counts after every merge. assert.
standardize strings — lowercase, strip, collapse variants
verify type conversions — how many values became NaN?
compare groups — is missingness evenly distributed?

“the combination of some data and an aching desire for an answer does not ensure that a reasonable one can be extracted from a given body of data.”

— John Tukey

synthesis

missing data hides in many forms — NaN is just the easy case; MNAR is the dangerous one
joins can grow or shrink your data — check row counts after every merge
strings and deduplication — standardize before matching; the birthday paradox generates false positives at scale
tools betray you — coercion erases distinctions, dropping rows changes the sample, code makes every decision visible

next time: linear algebra

we can clean and join data. now we need a language for modeling it.

vectors as data points, vectors as features
linear combinations → predictions
span → what a model can reach

preview: Chapter 4 in the course notes

before next class

fill out the homework discussion section scheduling survey (see Ed)
read Chapter 3 in the course notes
try the notebook exercises on your own
Quiz 1 is Wednesday — covers Lectures 1-3. be on time!

one-minute feedback

what was the most useful thing you learned today?
what was the most confusing?

give feedback