Lecture 17: Working with AI

MSE 125 — Applied Statistics

Madeleine Udell

Wednesday, May 27, 2026

logistics

HW 5 (AI-assisted analysis audit) due Mon Jun 1
project final report due Fri Jun 5
last week handoff: next two lectures on causal inference taught by I-han Lai (TA, causal inference expert)
today: AI tools, their failure modes, and the audit checklist

you paste a dataset into a chat tool

10 seconds later: fitted model, written analysis, a confident number
the arithmetic ran cleanly
the judgment is still yours

when should you trust the result?

agenda

AutoML
tabular foundation models
LLM agents
where they fail
two worked audits
the checklist

AutoML

from Lecture 13: trees, forests, gradient boosting

decision tree: recursive splits on one feature at a time
random forest: many trees in parallel, predictions averaged
gradient boosting: trees in sequence, each fitting the previous residuals

gradient boosting wins most tabular competitions (XGBoost, LightGBM, CatBoost)

random forest vs. gradient boosting

forest averages independently grown trees
boosting grows the next tree to fix what the ensemble missed

gradient boosting = gradient descent in function space

each new tree fits the residual of the ensemble so far

r_i = y_i - \hat y_i

like gradient descent (Lec 7), but updating a function instead of parameters
works for any loss (logistic, etc.) via the gradient

on Airbnb prices

forest = RandomForestRegressor(n_estimators=200, random_state=42)
gboost = GradientBoostingRegressor(n_estimators=200, max_depth=4, random_state=42)

Random Forest:     CV R-sq = 0.642
Gradient Boosting: CV R-sq = 0.668

sequential error-correction picks up feature interactions and price discontinuities the forest averages over
tree-based models still beat neural networks on tabular data (Grinsztajn et al., NeurIPS 2022)

AutoML

AutoML (automated machine learning)

software that, given a dataset and a target, searches model families and hyperparameters and returns a fitted pipeline. examples: AutoGluon, FLAML, auto-sklearn, H2O AutoML, lightautoml, MLJAR; Vertex AI, SageMaker Autopilot, Azure AutoML

Gijsbers et al., AMLB: an AutoML Benchmark, JMLR 2024 (Fig. 4); scaled to tuned RF = 0, best observed = 1

CASH: combined algorithm selection and hyperparameter optimization

Lec 13: you swept tree depth, scored with 5-fold CV, picked the best
AutoML: sweep which algorithm and all its hyperparameters at once
fold “which algorithm” into one top-level categorical hyperparameter

named CASH in Auto-WEKA (Thornton et al., KDD 2013)

mini AutoML on Airbnb

models = {
    'Linear Regression': Pipeline([('scaler', StandardScaler()),
                                   ('model', LinearRegression())]),
    'Random Forest':     RandomForestRegressor(n_estimators=200, random_state=42),
    'Gradient Boosting': GradientBoostingRegressor(n_estimators=200, max_depth=4, ...),
}
for name, model in models.items():
    r2 = cross_val_score(model, X, y, cv=5, scoring='r2')
    ...

3 model families, one CV loop, pick the winner. AutoML automates this at scale.

a close call between the top two

linear regression clearly worst: Airbnb price is not linear in these features
RF vs GB margin (~0.03) is small relative to the linear-vs-trees gap (~0.10), comparable to one outlier

what did AutoML not do?

it searched models and hyperparameters.

which decisions did you make before any model was fit?

AutoGluon wins by ensembling, not by smarter search

confirms a 2004 ensemble result that predates the AutoML era

Gijsbers et al., JMLR 2024 · Caruana et al., ICML 2004

tabular foundation models

foundation models

TabPFN

TabPFN (Tabular Prior-data Fitted Network)

a transformer pretrained once, offline on millions of synthetic tabular datasets. at deployment, no training, no tuning

Hollmann et al., ICLR 2023; Nature 2025 (v2)

TabPFN on the Airbnb data

model	5-fold CV R²
random forest (200 trees)	0.636
gradient boosting (200 trees, depth 4)	0.652
TabPFN v2 (no tuning)	0.668

no tuning, beats the tuned booster on this subsample

10,000-row subsample (TabPFN v2 row cap)

LLM agents

the loop

data-analysis agent: an LLM + the loop around it

two examples

CAAFE: LLM reads a dataset description, proposes new features as code + explanation, feeds them into a tabular model (TabPFN, often)
AIDE: runs ML engineering as a tree search over candidate code

both call AutoML / foundation models as tools: composition, not reimplementation

Hollmann et al., NeurIPS 2023; Jiang et al., 2025

how well do they actually do?

MLE-bench (OpenAI): 75 real Kaggle competitions, scored against the human leaderboards

original paper: best setup reaches bronze in 16.9% of competitions
early 2026 leader (R&D-Agent, o3 + GPT-4.1): ~30% medal rate

~70% of competitions: no medal at all

Chan et al., ICLR 2025

where they fail

DABStep: easy vs hard

450+ multi-step data-analysis tasks from a financial-analytics platform: same model, same dataset, split by task tier

Easy tier (single-step, well-specified): ~76%
Hard tier (reasoning across documents + code): ~15% at launch (mid-2025)
by late 2025, multi-agent systems (DS-STAR + Gemini 2.5 Pro) pushed Hard to ~45%, still well short of an analyst

vendor claims of “94% accuracy” online usually mean the Easy tier or an aggregate; the context-stripped number this lecture is about

Egg et al., 2025 (launch) · Yoon et al., DS-STAR, 2025 (current SOTA)

what makes a DABStep task Hard?

Easy (~76%)

example questions:

“How many transactions in 2023?”
“Total payment volume by card scheme?”

source: one dataset (payments.csv)

approach: aggregate → answer

Hard (~15%)

example questions:

“Which card scheme had the highest average fraud rate in 2023?”
“If merchant X changed business category, how would fees change?”

sources: payments.csv + fees.json + manual.md

approach: read rules from manual → join → multi-step compute → cross-check

Egg et al., DABstep on Adyen payments + fees + scheme manuals

same pattern across four benchmarks

the tools converge on a narrow set of analytic choices, not the breadth an expert would weigh

DiscoveryBench: Majumder et al. 2024 · DA-Code: Huang et al. EMNLP 2024 · BLADE: Gu et al. EMNLP 2024

one-shot vs. tool-using

code that runs	code that doesn’t run
traceback forces a fix	plausible wrong number printed in prose
can’t fabricate output	“p = 0.03” for a test the code never ran

execution lets you trust the arithmetic. it does not let you trust the analysis design, the assumption checking, or the causal interpretation

the well-documented failure: omission

the tool does the work you specify and skips the work you do not

what we don’t know yet

popular claims the evidence does not support strongly:

“LLMs are confidently wrong”: base models are well-calibrated; chat-model miscalibration recoverable with simple prompts
“LLMs confuse correlation with causation”: failures shown on text benchmarks, not measured on analysis agents
“agents p-hack on their own”: documented version is researchers prompt-hacking the tool

GPT-4 Tech Report Fig. 8 · Tian et al. EMNLP 2023 · Kadavath et al. 2022 · CLadder NeurIPS 2023 · Causal Parrots TMLR 2023 · arXiv:2504.14571, arXiv:2509.08825 (2025)

Course discipline: don’t overclaim what the evidence shows. Three popular claims have weak or no direct empirical support. Calibration: OpenAI’s own technical report shows GPT-4’s pretraining checkpoint near-perfectly calibrated on MCQ (Fig. 8); RLHF post-training degrades it but Tian et al. (EMNLP 2023) recover ~half the calibration with simple prompting; Anthropic’s Kadavath et al. (2022) “Language Models (Mostly) Know What They Know” makes the same point. Causal: CLadder (Jin et al., NeurIPS 2023) and Causal Parrots (Zečević et al., TMLR 2023) show formal-reasoning-on-text failures; no published study measures how often a data-analysis agent reports a regression coefficient as causal. P-hacking: the documented phenomenon is researchers exploiting LLM configuration to fish for results — see “Prompt-Hacking: The New p-Hacking?” (arXiv 2504.14571) and “Large Language Model Hacking” (arXiv 2509.08825) — that’s human misuse amplified by the tool, not autonomous agent behavior.

four axes you can move along

axis	reliable when…	unreliable when…
task complexity	single-step, well-specified	multi-step, open-ended
prompt specificity	analyst names test + assumptions	bare prompt
execution	agent runs code and reads output	one-shot text only
model variant	base, or chat with calibration prompt	unprompted RLHF chat

three of these you control

two worked audits

example 1: College Scorecard

one row per US institution: enrollment, financial aid, graduate earnings

task: predict 10-year earnings from school characteristics

features = ['SAT_AVG', 'UGDS', 'PCTPELL', 'PCTFLOAN', 'RET_FT4',
            'C150_4_POOLED_SUPP', 'CONTROL']
scorecard_complete = scorecard[features + ['MD_EARN_WNE_P10']].dropna()

.dropna() is the standard recipe. what gets dropped?

who survives the filter?

public + private nonprofit: most schools kept
private for-profit: ~0% retained; most lack SAT_AVG

this is MNAR

MNAR (missing not at random)

missingness depends on unobserved factors. the rows you lose are systematically different from the ones you keep.

SAT requirement excludes community colleges, trade schools, for-profits
precisely the schools where earnings ↔︎ school traits is different

AutoML reports a respectable R². none of it tells you which schools the analysis no longer applies to.

example 2: NBA rest days

task: do players score more when rested?

rested     = nba[nba['REST_DAYS'] >= 3]['PTS']
not_rested = nba[nba['REST_DAYS'] <= 1]['PTS']
t_stat, p  = stats.ttest_ind(rested, not_rested)

Rested (3+ days):      9.5 PPG (n=47,612)
Not rested (0-1 days): 11.4 PPG (n=63,830)
t-statistic: -38.86
p-value: 0.00e+00

rested players score fewer points, p \approx 0. that’s backwards.

how have you used AI agents this quarter?

share one specific case:

a mistake the AI made that you caught: what tipped you off?
a case where you’re not sure if it was right: what would you need to check?

DISCUSSION: Think-pair-share (5 min). Think and jot first, then pair and compare; debrief 2–3 students.

This is the live audit. The point isn’t to embarrass anyone — it’s to surface the kinds of errors you’ve already started catching (so we can name them) and the kinds you might be missing (so we can name those too). Listen for: arithmetic errors, plausible-but-wrong code, hallucinated functions, missing data handling, wrong test choice, ignored assumptions, confident framings that turn out to be guesses. Map what students report to the five checklist clusters you’re about to introduce.

If quiet: “raise your hand if you’ve used Claude or ChatGPT or Cursor this quarter on a problem set or project” — almost everyone will. Then ask one of those students to walk through a specific instance.

controlling for player identity

for player in nba['PLAYER_NAME'].unique():
    pdata = nba[nba['PLAYER_NAME'] == player]
    rested_p = pdata[pdata['REST_DAYS'] >= 3]['GAME_SCORE']
    tired_p  = pdata[pdata['REST_DAYS'] <= 1]['GAME_SCORE']
    if len(rested_p) >= 10 and len(tired_p) >= 10:
        t, p = stats.ttest_ind(rested_p, tired_p)

per-player rested − tired in game score
removes the player-quality confound

within-player effects

distribution centered near zero
mean shift under one game-score point against rest

aggregate effect came from confounding, not from rest

the arithmetic was right. the judgment was wrong.

step	tool delivers?
run the t-test	✓
print the numbers	✓
flag Simpson’s paradox	✗
switch to within-player comparison	✗
adjust for multiple testing	✗

knowing this required Lec 11 (Simpson’s, multiple testing), Lec 12 (practical significance)

what remains human

the irreducibly human work

asking the right question: “what predicts X” vs. “what causes X”
knowing the domain: why are for-profits missing SAT data?
questioning assumptions: independence, representativeness, causality
understanding stakes: wrong prediction → real consequences
communicating uncertainty honestly: wide CIs, missing populations
choosing values: fairness has incompatible definitions; choice is yours (Ch 20)

what’s your goal?

trustworthy analysis (working analyst)

let the AI draft
you audit
the rest of this section

learning (homework, study, building intuition)

try the problem yourself first
then ask the AI to teach, hint, critique, quiz
not to hand you the answer

HW 5: the audit case. on practice problems, invert it.

use AI, then audit

use AI for…	always check…
boilerplate code	missing-data handling
quick EDA	axes labels, scale, units
trying many models	what got dropped; assumptions
generating hypotheses	correlation vs. causation
drafting reports	honest uncertainty

let the AI draft. apply your statistical judgment.

prompt decomposition

don’t say: “analyze this dataset”

say:

load + inspect: show first rows, dtypes, missingness
check missingness: which columns? related to other variables?
fit a model: name the model, name the features
check your work: what assumptions? what could go wrong?

Ruta et al. (2025) measured this directly: the same model went from 32.5% accuracy on bare prompts to 92.5% when the analyst named the test and the assumptions to check

the checklist

chapter 1’s promise, delivered

we promised: by the end of this course, you’d know when not to trust a model’s output

five questions to ask before trusting any analysis: yours, a colleague’s, a vendor’s, an AI’s

1. the data: where did it come from, who is missing?

source: data dictionary? who collected it, why?
who’s missing: exclusion changes the conclusion?
types: numeric truly numeric? IDs treated as numbers?

Scorecard: SAT filter dropped ~100% of for-profits

Ch 2–3 · today’s worked example 1

2. the model: was it scored on data it had never seen?

metric: right for the decision?
split: truly held-out? temporal leakage?
leakage: does any input encode the outcome?
distribution shift: same population train and deploy?

Quiz 8 revenue: random R² beat temporal R² by 0.4. the gap is leakage.

Ch 6, 7, 16

3. the signal: real, or one knob the analyst turned?

base rate: is “99% accuracy” trivial?
multiple testing: only hypothesis tested, or only one reported?
uncertainty: CI? how wide?
practical significance: large enough for the decision?

NBA: aggregate effect vanished once we controlled for player

Ch 7, 8, 10, 11, 12 · today’s worked example 2

4. the claim: causal effect, or two things that move together?

confounding: does correlation imply causation here?
causal structure: what would have to be true for the claim to hold?
alternative explanations: what else could produce this pattern?

the audit’s lone unit: next two lectures with I-han turn this into a formal tool

Ch 11 → Ch 18-19 (DAGs, identification)

5. incentives + dynamics: who benefits, how does it change after deployment?

Goodhart: could optimizing this metric cause people to game it?
feedback loops: do predictions change the outcome being measured?
who paid for it: does the vendor have an incentive to show a particular result?

static analyses miss what changes after deployment

Ch 16 + common sense

audit an analysis

pick an analysis from this week’s news.

walk through the five questions. which items can you check from the article? which would need data or code?

“far better an approximate answer to the right question,
which is often vague, than an exact answer to the wrong question,
which can always be made precise”

John Tukey

what we covered

AutoML: automates CASH; doesn’t pick the question, the metric, or the data source
the 2026 frontier: not smarter search; ensembling (AutoGluon) and pretraining across datasets (TabPFN)
LLM agents: execute and read back; ~76% on easy, ~15% on hard
the failure profile: arithmetic class solved, judgment class still yours
the checklist: five questions, every analysis, forever

next week: causal inference with I-han Lai

we can fit models, validate them, audit them

but does X actually cause Y?

Lec 18: DAGs, confounders, colliders
Lec 19: randomization, natural experiments

I-han is the causal inference expert on staff. you are in great hands.

thank you

it’s been an honor to teach you this course
the work I’m proudest of: the discrimination skill, not the generation
you can run any model. you can audit any analysis. that’s the durable skill.

stay in touch.

feedback

forms.gle/feedback

what worked? what didn’t? what would you change?