MSE 125 — Applied Statistics
Wednesday, May 27, 2026
when should you trust the result?
AutoML
gradient boosting wins most tabular competitions (XGBoost, LightGBM, CatBoost)
each new tree fits the residual of the ensemble so far
r_i = y_i - \hat y_i
Random Forest: CV R-sq = 0.642
Gradient Boosting: CV R-sq = 0.668
AutoML (automated machine learning)
software that, given a dataset and a target, searches model families and hyperparameters and returns a fitted pipeline. examples: AutoGluon, FLAML, auto-sklearn, H2O AutoML, lightautoml, MLJAR; Vertex AI, SageMaker Autopilot, Azure AutoML
Gijsbers et al., AMLB: an AutoML Benchmark, JMLR 2024 (Fig. 4); scaled to tuned RF = 0, best observed = 1
named CASH in Auto-WEKA (Thornton et al., KDD 2013)
models = {
'Linear Regression': Pipeline([('scaler', StandardScaler()),
('model', LinearRegression())]),
'Random Forest': RandomForestRegressor(n_estimators=200, random_state=42),
'Gradient Boosting': GradientBoostingRegressor(n_estimators=200, max_depth=4, ...),
}
for name, model in models.items():
r2 = cross_val_score(model, X, y, cv=5, scoring='r2')
...3 model families, one CV loop, pick the winner. AutoML automates this at scale.
what did AutoML not do?
it searched models and hyperparameters.
which decisions did you make before any model was fit?
confirms a 2004 ensemble result that predates the AutoML era
Gijsbers et al., JMLR 2024 · Caruana et al., ICML 2004
tabular foundation models
TabPFN (Tabular Prior-data Fitted Network)
a transformer pretrained once, offline on millions of synthetic tabular datasets. at deployment, no training, no tuning
Hollmann et al., ICLR 2023; Nature 2025 (v2)
| model | 5-fold CV R² |
|---|---|
| random forest (200 trees) | 0.636 |
| gradient boosting (200 trees, depth 4) | 0.652 |
| TabPFN v2 (no tuning) | 0.668 |
no tuning, beats the tuned booster on this subsample
10,000-row subsample (TabPFN v2 row cap)
LLM agents
data-analysis agent: an LLM + the loop around it
both call AutoML / foundation models as tools: composition, not reimplementation
Hollmann et al., NeurIPS 2023; Jiang et al., 2025
MLE-bench (OpenAI): 75 real Kaggle competitions, scored against the human leaderboards
~70% of competitions: no medal at all
Chan et al., ICLR 2025
where they fail
450+ multi-step data-analysis tasks from a financial-analytics platform: same model, same dataset, split by task tier
vendor claims of “94% accuracy” online usually mean the Easy tier or an aggregate; the context-stripped number this lecture is about
Egg et al., 2025 (launch) · Yoon et al., DS-STAR, 2025 (current SOTA)
Easy (~76%)
example questions:
source: one dataset (payments.csv)
approach: aggregate → answer
Hard (~15%)
example questions:
sources: payments.csv + fees.json + manual.md
approach: read rules from manual → join → multi-step compute → cross-check
Egg et al., DABstep on Adyen payments + fees + scheme manuals
the tools converge on a narrow set of analytic choices, not the breadth an expert would weigh
DiscoveryBench: Majumder et al. 2024 · DA-Code: Huang et al. EMNLP 2024 · BLADE: Gu et al. EMNLP 2024
| code that runs | code that doesn’t run |
|---|---|
| traceback forces a fix | plausible wrong number printed in prose |
| can’t fabricate output | “p = 0.03” for a test the code never ran |
execution lets you trust the arithmetic. it does not let you trust the analysis design, the assumption checking, or the causal interpretation
the tool does the work you specify and skips the work you do not
popular claims the evidence does not support strongly:
GPT-4 Tech Report Fig. 8 · Tian et al. EMNLP 2023 · Kadavath et al. 2022 · CLadder NeurIPS 2023 · Causal Parrots TMLR 2023 · arXiv:2504.14571, arXiv:2509.08825 (2025)
| axis | reliable when… | unreliable when… |
|---|---|---|
| task complexity | single-step, well-specified | multi-step, open-ended |
| prompt specificity | analyst names test + assumptions | bare prompt |
| execution | agent runs code and reads output | one-shot text only |
| model variant | base, or chat with calibration prompt | unprompted RLHF chat |
three of these you control
two worked audits
one row per US institution: enrollment, financial aid, graduate earnings
task: predict 10-year earnings from school characteristics
.dropna() is the standard recipe. what gets dropped?
MNAR (missing not at random)
missingness depends on unobserved factors. the rows you lose are systematically different from the ones you keep.
AutoML reports a respectable R². none of it tells you which schools the analysis no longer applies to.
task: do players score more when rested?
Rested (3+ days): 9.5 PPG (n=47,612)
Not rested (0-1 days): 11.4 PPG (n=63,830)
t-statistic: -38.86
p-value: 0.00e+00
rested players score fewer points, p \approx 0. that’s backwards.
how have you used AI agents this quarter?
share one specific case:
aggregate effect came from confounding, not from rest
| step | tool delivers? |
|---|---|
| run the t-test | ✓ |
| print the numbers | ✓ |
| flag Simpson’s paradox | ✗ |
| switch to within-player comparison | ✗ |
| adjust for multiple testing | ✗ |
knowing this required Lec 11 (Simpson’s, multiple testing), Lec 12 (practical significance)
what remains human
trustworthy analysis (working analyst)
learning (homework, study, building intuition)
HW 5: the audit case. on practice problems, invert it.
| use AI for… | always check… |
|---|---|
| boilerplate code | missing-data handling |
| quick EDA | axes labels, scale, units |
| trying many models | what got dropped; assumptions |
| generating hypotheses | correlation vs. causation |
| drafting reports | honest uncertainty |
let the AI draft. apply your statistical judgment.
don’t say: “analyze this dataset”
say:
Ruta et al. (2025) measured this directly: the same model went from 32.5% accuracy on bare prompts to 92.5% when the analyst named the test and the assumptions to check
the checklist
we promised: by the end of this course, you’d know when not to trust a model’s output
five questions to ask before trusting any analysis: yours, a colleague’s, a vendor’s, an AI’s

Scorecard: SAT filter dropped ~100% of for-profits
Ch 2–3 · today’s worked example 1

Quiz 8 revenue: random R² beat temporal R² by 0.4. the gap is leakage.
Ch 6, 7, 16

NBA: aggregate effect vanished once we controlled for player
Ch 7, 8, 10, 11, 12 · today’s worked example 2

the audit’s lone unit: next two lectures with I-han turn this into a formal tool
Ch 11 → Ch 18-19 (DAGs, identification)

static analyses miss what changes after deployment
Ch 16 + common sense
audit an analysis
pick an analysis from this week’s news.
walk through the five questions. which items can you check from the article? which would need data or code?
“far better an approximate answer to the right question,
which is often vague, than an exact answer to the wrong question,
which can always be made precise”
John Tukey
we can fit models, validate them, audit them
but does X actually cause Y?
I-han is the causal inference expert on staff. you are in great hands.
stay in touch.
what worked? what didn’t? what would you change?