Lecture 17: Working with AI

MSE 125 — Applied Statistics

Madeleine Udell

Wednesday, May 27, 2026

logistics

  • HW 5 (AI-assisted analysis audit) due Mon Jun 1
  • project final report due Fri Jun 5
  • last week handoff: next two lectures on causal inference taught by I-han Lai (TA, causal inference expert)
  • today: AI tools, their failure modes, and the audit checklist

you paste a dataset into a chat tool

  • 10 seconds later: fitted model, written analysis, a confident number
  • the arithmetic ran cleanly
  • the judgment is still yours

when should you trust the result?

agenda

  • AutoML
  • tabular foundation models
  • LLM agents
  • where they fail
  • two worked audits
  • the checklist

AutoML

from Lecture 13: trees, forests, gradient boosting

  • decision tree: recursive splits on one feature at a time
  • random forest: many trees in parallel, predictions averaged
  • gradient boosting: trees in sequence, each fitting the previous residuals

gradient boosting wins most tabular competitions (XGBoost, LightGBM, CatBoost)

random forest vs. gradient boosting

  • forest averages independently grown trees
  • boosting grows the next tree to fix what the ensemble missed

gradient boosting = gradient descent in function space

each new tree fits the residual of the ensemble so far

r_i = y_i - \hat y_i

  • like gradient descent (Lec 7), but updating a function instead of parameters
  • works for any loss (logistic, etc.) via the gradient

on Airbnb prices

forest = RandomForestRegressor(n_estimators=200, random_state=42)
gboost = GradientBoostingRegressor(n_estimators=200, max_depth=4, random_state=42)
Random Forest:     CV R-sq = 0.642
Gradient Boosting: CV R-sq = 0.668
  • sequential error-correction picks up feature interactions and price discontinuities the forest averages over
  • tree-based models still beat neural networks on tabular data (Grinsztajn et al., NeurIPS 2022)

AutoML

AutoML (automated machine learning)

software that, given a dataset and a target, searches model families and hyperparameters and returns a fitted pipeline. examples: AutoGluon, FLAML, auto-sklearn, H2O AutoML, lightautoml, MLJAR; Vertex AI, SageMaker Autopilot, Azure AutoML

Gijsbers et al., AMLB: an AutoML Benchmark, JMLR 2024 (Fig. 4); scaled to tuned RF = 0, best observed = 1

CASH: combined algorithm selection and hyperparameter optimization

  • Lec 13: you swept tree depth, scored with 5-fold CV, picked the best
  • AutoML: sweep which algorithm and all its hyperparameters at once
  • fold “which algorithm” into one top-level categorical hyperparameter

named CASH in Auto-WEKA (Thornton et al., KDD 2013)

mini AutoML on Airbnb

models = {
    'Linear Regression': Pipeline([('scaler', StandardScaler()),
                                   ('model', LinearRegression())]),
    'Random Forest':     RandomForestRegressor(n_estimators=200, random_state=42),
    'Gradient Boosting': GradientBoostingRegressor(n_estimators=200, max_depth=4, ...),
}
for name, model in models.items():
    r2 = cross_val_score(model, X, y, cv=5, scoring='r2')
    ...

3 model families, one CV loop, pick the winner. AutoML automates this at scale.

a close call between the top two

  • linear regression clearly worst: Airbnb price is not linear in these features
  • RF vs GB margin (~0.03) is small relative to the linear-vs-trees gap (~0.10), comparable to one outlier

what did AutoML not do?

it searched models and hyperparameters.

which decisions did you make before any model was fit?

tabular foundation models

foundation models

TabPFN

TabPFN (Tabular Prior-data Fitted Network)

a transformer pretrained once, offline on millions of synthetic tabular datasets. at deployment, no training, no tuning

Hollmann et al., ICLR 2023; Nature 2025 (v2)

TabPFN on the Airbnb data

model 5-fold CV R²
random forest (200 trees) 0.636
gradient boosting (200 trees, depth 4) 0.652
TabPFN v2 (no tuning) 0.668

no tuning, beats the tuned booster on this subsample

10,000-row subsample (TabPFN v2 row cap)

LLM agents

the loop

data-analysis agent: an LLM + the loop around it

two examples

  • CAAFE: LLM reads a dataset description, proposes new features as code + explanation, feeds them into a tabular model (TabPFN, often)
  • AIDE: runs ML engineering as a tree search over candidate code

both call AutoML / foundation models as tools: composition, not reimplementation

Hollmann et al., NeurIPS 2023; Jiang et al., 2025

how well do they actually do?

MLE-bench (OpenAI): 75 real Kaggle competitions, scored against the human leaderboards

  • original paper: best setup reaches bronze in 16.9% of competitions
  • early 2026 leader (R&D-Agent, o3 + GPT-4.1): ~30% medal rate

~70% of competitions: no medal at all

Chan et al., ICLR 2025

where they fail

DABStep: easy vs hard

450+ multi-step data-analysis tasks from a financial-analytics platform: same model, same dataset, split by task tier

  • Easy tier (single-step, well-specified): ~76%
  • Hard tier (reasoning across documents + code): ~15% at launch (mid-2025)
  • by late 2025, multi-agent systems (DS-STAR + Gemini 2.5 Pro) pushed Hard to ~45%, still well short of an analyst

vendor claims of “94% accuracy” online usually mean the Easy tier or an aggregate; the context-stripped number this lecture is about

Egg et al., 2025 (launch) · Yoon et al., DS-STAR, 2025 (current SOTA)

what makes a DABStep task Hard?

Easy (~76%)

example questions:

  • “How many transactions in 2023?”
  • “Total payment volume by card scheme?”

source: one dataset (payments.csv)

approach: aggregate → answer

Hard (~15%)

example questions:

  • “Which card scheme had the highest average fraud rate in 2023?”
  • “If merchant X changed business category, how would fees change?”

sources: payments.csv + fees.json + manual.md

approach: read rules from manual → join → multi-step compute → cross-check

Egg et al., DABstep on Adyen payments + fees + scheme manuals

same pattern across four benchmarks

the tools converge on a narrow set of analytic choices, not the breadth an expert would weigh

DiscoveryBench: Majumder et al. 2024 · DA-Code: Huang et al. EMNLP 2024 · BLADE: Gu et al. EMNLP 2024

one-shot vs. tool-using

code that runs code that doesn’t run
traceback forces a fix plausible wrong number printed in prose
can’t fabricate output “p = 0.03” for a test the code never ran

execution lets you trust the arithmetic. it does not let you trust the analysis design, the assumption checking, or the causal interpretation

the well-documented failure: omission

the tool does the work you specify and skips the work you do not

what we don’t know yet

popular claims the evidence does not support strongly:

  • “LLMs are confidently wrong”: base models are well-calibrated; chat-model miscalibration recoverable with simple prompts
  • “LLMs confuse correlation with causation”: failures shown on text benchmarks, not measured on analysis agents
  • “agents p-hack on their own”: documented version is researchers prompt-hacking the tool

GPT-4 Tech Report Fig. 8 · Tian et al. EMNLP 2023 · Kadavath et al. 2022 · CLadder NeurIPS 2023 · Causal Parrots TMLR 2023 · arXiv:2504.14571, arXiv:2509.08825 (2025)

four axes you can move along

axis reliable when… unreliable when…
task complexity single-step, well-specified multi-step, open-ended
prompt specificity analyst names test + assumptions bare prompt
execution agent runs code and reads output one-shot text only
model variant base, or chat with calibration prompt unprompted RLHF chat

three of these you control

two worked audits

example 1: College Scorecard

one row per US institution: enrollment, financial aid, graduate earnings

task: predict 10-year earnings from school characteristics

features = ['SAT_AVG', 'UGDS', 'PCTPELL', 'PCTFLOAN', 'RET_FT4',
            'C150_4_POOLED_SUPP', 'CONTROL']
scorecard_complete = scorecard[features + ['MD_EARN_WNE_P10']].dropna()

.dropna() is the standard recipe. what gets dropped?

who survives the filter?

  • public + private nonprofit: most schools kept
  • private for-profit: ~0% retained; most lack SAT_AVG

this is MNAR

MNAR (missing not at random)

missingness depends on unobserved factors. the rows you lose are systematically different from the ones you keep.

  • SAT requirement excludes community colleges, trade schools, for-profits
  • precisely the schools where earnings ↔︎ school traits is different

AutoML reports a respectable R². none of it tells you which schools the analysis no longer applies to.

example 2: NBA rest days

task: do players score more when rested?

rested     = nba[nba['REST_DAYS'] >= 3]['PTS']
not_rested = nba[nba['REST_DAYS'] <= 1]['PTS']
t_stat, p  = stats.ttest_ind(rested, not_rested)
Rested (3+ days):      9.5 PPG (n=47,612)
Not rested (0-1 days): 11.4 PPG (n=63,830)
t-statistic: -38.86
p-value: 0.00e+00

rested players score fewer points, p \approx 0. that’s backwards.

how have you used AI agents this quarter?

share one specific case:

  1. a mistake the AI made that you caught: what tipped you off?
  2. a case where you’re not sure if it was right: what would you need to check?

controlling for player identity

for player in nba['PLAYER_NAME'].unique():
    pdata = nba[nba['PLAYER_NAME'] == player]
    rested_p = pdata[pdata['REST_DAYS'] >= 3]['GAME_SCORE']
    tired_p  = pdata[pdata['REST_DAYS'] <= 1]['GAME_SCORE']
    if len(rested_p) >= 10 and len(tired_p) >= 10:
        t, p = stats.ttest_ind(rested_p, tired_p)
  • per-player rested − tired in game score
  • removes the player-quality confound

within-player effects

  • distribution centered near zero
  • mean shift under one game-score point against rest

aggregate effect came from confounding, not from rest

the arithmetic was right. the judgment was wrong.

step tool delivers?
run the t-test
print the numbers
flag Simpson’s paradox
switch to within-player comparison
adjust for multiple testing

knowing this required Lec 11 (Simpson’s, multiple testing), Lec 12 (practical significance)

what remains human

the irreducibly human work

  • asking the right question: “what predicts X” vs. “what causes X”
  • knowing the domain: why are for-profits missing SAT data?
  • questioning assumptions: independence, representativeness, causality
  • understanding stakes: wrong prediction → real consequences
  • communicating uncertainty honestly: wide CIs, missing populations
  • choosing values: fairness has incompatible definitions; choice is yours (Ch 20)

what’s your goal?

trustworthy analysis (working analyst)

  • let the AI draft
  • you audit
  • the rest of this section

learning (homework, study, building intuition)

  • try the problem yourself first
  • then ask the AI to teach, hint, critique, quiz
  • not to hand you the answer

HW 5: the audit case. on practice problems, invert it.

use AI, then audit

use AI for… always check…
boilerplate code missing-data handling
quick EDA axes labels, scale, units
trying many models what got dropped; assumptions
generating hypotheses correlation vs. causation
drafting reports honest uncertainty

let the AI draft. apply your statistical judgment.

prompt decomposition

don’t say: “analyze this dataset”

say:

  1. load + inspect: show first rows, dtypes, missingness
  2. check missingness: which columns? related to other variables?
  3. fit a model: name the model, name the features
  4. check your work: what assumptions? what could go wrong?

Ruta et al. (2025) measured this directly: the same model went from 32.5% accuracy on bare prompts to 92.5% when the analyst named the test and the assumptions to check

the checklist

chapter 1’s promise, delivered

we promised: by the end of this course, you’d know when not to trust a model’s output

five questions to ask before trusting any analysis: yours, a colleague’s, a vendor’s, an AI’s

1. the data: where did it come from, who is missing?

  • source: data dictionary? who collected it, why?
  • who’s missing: exclusion changes the conclusion?
  • types: numeric truly numeric? IDs treated as numbers?

Scorecard: SAT filter dropped ~100% of for-profits

Ch 2–3 · today’s worked example 1

2. the model: was it scored on data it had never seen?

  • metric: right for the decision?
  • split: truly held-out? temporal leakage?
  • leakage: does any input encode the outcome?
  • distribution shift: same population train and deploy?

Quiz 8 revenue: random R² beat temporal R² by 0.4. the gap is leakage.

Ch 6, 7, 16

3. the signal: real, or one knob the analyst turned?

  • base rate: is “99% accuracy” trivial?
  • multiple testing: only hypothesis tested, or only one reported?
  • uncertainty: CI? how wide?
  • practical significance: large enough for the decision?

NBA: aggregate effect vanished once we controlled for player

Ch 7, 8, 10, 11, 12 · today’s worked example 2

4. the claim: causal effect, or two things that move together?

  • confounding: does correlation imply causation here?
  • causal structure: what would have to be true for the claim to hold?
  • alternative explanations: what else could produce this pattern?

the audit’s lone unit: next two lectures with I-han turn this into a formal tool

Ch 11 → Ch 18-19 (DAGs, identification)

5. incentives + dynamics: who benefits, how does it change after deployment?

  • Goodhart: could optimizing this metric cause people to game it?
  • feedback loops: do predictions change the outcome being measured?
  • who paid for it: does the vendor have an incentive to show a particular result?

static analyses miss what changes after deployment

Ch 16 + common sense

audit an analysis

pick an analysis from this week’s news.

walk through the five questions. which items can you check from the article? which would need data or code?

“far better an approximate answer to the right question,
which is often vague, than an exact answer to the wrong question,
which can always be made precise”

John Tukey

what we covered

  • AutoML: automates CASH; doesn’t pick the question, the metric, or the data source
  • the 2026 frontier: not smarter search; ensembling (AutoGluon) and pretraining across datasets (TabPFN)
  • LLM agents: execute and read back; ~76% on easy, ~15% on hard
  • the failure profile: arithmetic class solved, judgment class still yours
  • the checklist: five questions, every analysis, forever

next week: causal inference with I-han Lai

we can fit models, validate them, audit them

but does X actually cause Y?

  • Lec 18: DAGs, confounders, colliders
  • Lec 19: randomization, natural experiments

I-han is the causal inference expert on staff. you are in great hands.

thank you

  • it’s been an honor to teach you this course
  • the work I’m proudest of: the discrimination skill, not the generation
  • you can run any model. you can audit any analysis. that’s the durable skill.

stay in touch.

feedback

what worked? what didn’t? what would you change?