AutoML, LLMs, and the Future of Data Analysis

You’ve been using random forests since Lecture 6 — fitting them to Airbnb data, watching test error plateau in Lecture 7. Today you learn why they work. And you’ll see what happens when you hand your data to an AI and say “analyze this.”

By the end of today, you’ll understand why knowing statistics makes you a better user of AI tools, not an obsolete one. You’ll also open the black box on trees: how recursive splitting works, why bootstrap samples and feature subsets make trees decorrelated, and how gradient boosting connects back to the gradient descent you saw in Lecture 13.

Code
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor, plot_tree, export_text
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.model_selection import cross_val_score, KFold
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.metrics import r2_score, mean_absolute_error
import warnings
warnings.filterwarnings('ignore')
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (7, 4.5)
plt.rcParams['font.size'] = 12

DATA_DIR = 'data'

How trees and forests work

In Lectures 6–7, you built decision trees and random forests for prediction. Trees split data recursively to minimize prediction error; forests average many overfit trees, using bootstrap samples and feature subsets to decorrelate them so averaging reduces variance. Here we examine which features matter most.

Code
# Load College Scorecard data
scorecard = pd.read_csv(f'{DATA_DIR}/college-scorecard/scorecard.csv')

feature_cols = ['SAT_AVG', 'UGDS', 'PCTPELL', 'PCTFLOAN', 'RET_FT4',
                'C150_4_POOLED_SUPP', 'CONTROL']
target_col = 'MD_EARN_WNE_P10'

for col in feature_cols + [target_col]:
    scorecard[col] = pd.to_numeric(scorecard[col], errors='coerce')

model_data = scorecard[feature_cols + [target_col]].dropna()
print(f"Complete cases: {len(model_data)} out of {len(scorecard)} ({len(model_data)/len(scorecard):.1%})")
Complete cases: 1236 out of 7703 (16.0%)
Code
earnings_features = model_data[feature_cols]
earnings_target = model_data[target_col]

forest = RandomForestRegressor(n_estimators=100, random_state=42)
forest.fit(earnings_features, earnings_target)
forest_cv = cross_val_score(forest, earnings_features, earnings_target, cv=5, scoring='r2')
Code
# Feature importances: one of the most practically useful RF outputs
importances = forest.feature_importances_
sorted_idx = np.argsort(importances)

fig, ax = plt.subplots(figsize=(8, 5))
ax.barh(np.array(feature_cols)[sorted_idx], importances[sorted_idx], color='#4C72B0')
ax.set_xlabel('Feature importance (mean decrease in impurity)')
ax.set_title('Random forest feature importances for predicting graduate earnings')
plt.tight_layout()
plt.show()

TipThink About It

Why does averaging many overfitting trees produce a good model? Hint: each tree overfits to different noise, given that it sees different bootstrap samples and different feature subsets.

Gradient boosting: learning from mistakes

There’s another way to combine trees. Instead of growing them independently and averaging, gradient boosting grows them sequentially: each new tree is trained on the errors of the previous trees.

Each new tree fits the negative gradient (pseudo-residuals) of the loss. Just as gradient descent (Lecture 13) takes steps in parameter space, gradient boosting takes steps in function space by adding a tree that approximates the negative gradient. Each added tree nudges the ensemble’s prediction a little closer to the truth.

Code
# Gradient boosting: sequential error correction
gboost = GradientBoostingRegressor(n_estimators=100, random_state=42)
gboost.fit(earnings_features, earnings_target)
gboost_cv = cross_val_score(gboost, earnings_features, earnings_target, cv=5, scoring='r2')

print(f"Random Forest:     CV R-sq = {forest_cv.mean():.3f}")
print(f"Gradient Boosting: CV R-sq = {gboost_cv.mean():.3f}")
print(f"\nGradient boosting often edges out random forests on tabular data.")
Random Forest:     CV R-sq = 0.532
Gradient Boosting: CV R-sq = 0.504

Gradient boosting often edges out random forests on tabular data.

Now you know what’s inside the black box. Let’s see how AutoML uses these tools — and what it still can’t do.

The AutoML promise

ImportantDefinition: AutoML

AutoML (automated machine learning) takes this further: tools like auto-sklearn, H2O AutoML, and Google Vertex AI automatically try hundreds of model configurations — including trees, forests, gradient boosting, and others — and pick the best one. No feature engineering, no hyperparameter tuning.

Under the hood, some AutoML systems also try additive models and splines — smooth nonlinear models that are more interpretable than random forests but more flexible than linear regression.

Let’s simulate this with a “poor man’s AutoML”: try several models with cross-validation and pick the winner.

Code
# Poor man's AutoML: try 3 models, pick the best by 5-fold CV
# Use Pipeline to prevent data leakage (scaler only sees training folds)
kf = KFold(n_splits=5, shuffle=True, random_state=42)

models = {
    'Linear Regression': Pipeline([('scaler', StandardScaler()), ('model', LinearRegression())]),
    'Random Forest\n(100 trees)': RandomForestRegressor(n_estimators=100, random_state=42),
    'Gradient Boosting\n(100 trees)': GradientBoostingRegressor(n_estimators=100, random_state=42),
}

cv_results = {}
print(f"{'Model':<35} {'Mean R-sq':>10} {'Std':>8} {'Mean MAE':>10}")
print("=" * 68)
best_score = -np.inf
best_name = None

for name, model in models.items():
    features = model_data[feature_cols].values
    target = model_data[target_col].values
    r2_scores = cross_val_score(model, features, target, cv=kf, scoring='r2')
    mae_scores = -cross_val_score(model, features, target, cv=kf, scoring='neg_mean_absolute_error')
    cv_results[name] = r2_scores
    display_name = name.replace('\n', ' ')
    print(f"{display_name:<35} {r2_scores.mean():>10.3f} {r2_scores.std():>8.3f} {mae_scores.mean():>10.0f}")
    if r2_scores.mean() > best_score:
        best_score = r2_scores.mean()
        best_name = display_name

print(f"\nWinner: {best_name} (R-sq = {best_score:.3f})")
print("AutoML would do this with hundreds of models and fancier tuning.")
Model                                Mean R-sq      Std   Mean MAE
====================================================================
Linear Regression                        0.496    0.042       5113
Random Forest (100 trees)                0.524    0.063       4834
Gradient Boosting (100 trees)            0.522    0.044       4770

Winner: Random Forest (100 trees) (R-sq = 0.524)
AutoML would do this with hundreds of models and fancier tuning.
Code
# Visualize the CV results
fig, ax = plt.subplots(figsize=(8, 5))
bp = ax.boxplot([scores for scores in cv_results.values()],
                labels=list(cv_results.keys()),
                patch_artist=True)
colors = ['#4C72B0', '#F0A30A', '#C44E52']
for patch, color in zip(bp['boxes'], colors):
    patch.set_facecolor(color)
    patch.set_alpha(0.7)
ax.set_ylabel('R-squared (5-fold CV)')
ax.set_title('Mini AutoML Shootout: Predicting Graduate Earnings')
ax.grid(axis='y', alpha=0.3)
plt.tight_layout()
plt.show()

Not bad! Random Forest edges out gradient boosting here. AutoML works — for the modeling step.

TipThink About It

Before we continue — what did AutoML NOT do here? What decisions did we make before any model was fit?

What AutoML can’t do

Let’s look at what got swept under the rug. What kinds of schools did we lose by requiring complete data?

Code
# What did we lose by dropping missing values?
print("Institutions with complete data vs. all institutions:")
print(f"  All institutions:    {len(scorecard):,}")
print(f"  With SAT_AVG:        {scorecard['SAT_AVG'].notna().sum():,}")
print(f"  With earnings data:  {pd.to_numeric(scorecard[target_col], errors='coerce').notna().sum():,}")
print(f"  Complete cases:      {len(model_data):,}")
Institutions with complete data vs. all institutions:
  All institutions:    7,703
  With SAT_AVG:        1,304
  With earnings data:  5,693
  Complete cases:      1,236
Code
# What kinds of schools are we dropping?
scorecard_analysis = scorecard.copy()
scorecard_analysis['has_data'] = scorecard.index.isin(model_data.index)
control_labels = {1: 'Public', 2: 'Private nonprofit', 3: 'Private for-profit'}
scorecard_analysis['control_label'] = pd.to_numeric(scorecard_analysis['CONTROL'], errors='coerce').map(control_labels)

retention_data = []
print("Representation by institution type:")
for label in ['Public', 'Private nonprofit', 'Private for-profit']:
    total = (scorecard_analysis['control_label'] == label).sum()
    kept = ((scorecard_analysis['control_label'] == label) & scorecard_analysis['has_data']).sum()
    if total > 0:
        pct = kept / total
        print(f"  {label:<25s} {kept:>5d} / {total:>5d} ({pct:.0%} kept)")
        retention_data.append({'type': label, 'kept': kept, 'total': total, 'pct': pct})
Representation by institution type:
  Public                      483 /  2044 (24% kept)
  Private nonprofit           748 /  1956 (38% kept)
  Private for-profit            5 /  3703 (0% kept)
Code
# Visualize the selection bias
retention_df = pd.DataFrame(retention_data)
fig, ax = plt.subplots(figsize=(8, 4))
bars = ax.barh(retention_df['type'], retention_df['pct'], color=['#4C72B0', '#F0A30A', '#C44E52'])
ax.set_xlabel('Fraction retained in analysis')
ax.set_title('Which schools survive the missing-data filter?')
ax.set_xlim(0, 1)
for bar, row in zip(bars, retention_data):
    ax.text(bar.get_width() + 0.02, bar.get_y() + bar.get_height()/2,
            f"{row['pct']:.0%}", va='center')
plt.tight_layout()
plt.show()

This exclusion pattern is selection bias from Lecture 3. By requiring SAT scores, we’ve excluded community colleges, trade schools, and most for-profit institutions — precisely the schools where the relationship between institutional characteristics and earnings might be different.

AutoML optimized R-squared on the available data. It has no way to know the available data isn’t representative.

The LLM revolution

Large Language Models (LLMs) like ChatGPT, Claude, and Gemini go further than AutoML. You saw LLM featurization in Lecture 6 — using an LLM to extract structured features from unstructured text. Today we see how LLMs go beyond featurization to full analysis: you can say “analyze this dataset” in plain English and get working code, visualizations, and interpretations back.

Let’s see what happens when we ask an LLM to analyze data. Below is the kind of code an LLM might produce — and it looks reasonable.

Code
# What an LLM might generate when asked:
# "Load the College Scorecard data and find what predicts higher earnings"
# (LLM-generated code tends to use short variable names like this)

# Step 1: Load data (LLM gets this right)
sc = pd.read_csv(f'{DATA_DIR}/college-scorecard/scorecard.csv')

# Step 2: "Clean" the data (LLM drops rows with any missing values)
numeric_cols = ['SAT_AVG', 'UGDS', 'PCTPELL', 'PCTFLOAN',
                'MD_EARN_WNE_P10', 'RET_FT4']
for col in numeric_cols:
    sc[col] = pd.to_numeric(sc[col], errors='coerce')

sc_clean = sc.dropna(subset=numeric_cols)
print(f"LLM 'cleaned' data: {len(sc_clean)} rows (dropped {len(sc) - len(sc_clean)})")
LLM 'cleaned' data: 1239 rows (dropped 6464)
Code
# Step 3: Fit a regression (LLM picks a standard model)
X_llm = sc_clean[['SAT_AVG', 'UGDS', 'PCTPELL', 'PCTFLOAN', 'RET_FT4']]
y_llm = sc_clean['MD_EARN_WNE_P10']

model_llm = LinearRegression()
model_llm.fit(X_llm, y_llm)

# Notice: this is R-squared on the *training* data -- no cross-validation.
# From Lecture 7, we know this is optimistic.
print(f"R-squared: {model_llm.score(X_llm, y_llm):.3f}")
print("\nCoefficients:")
for feat, coef in zip(X_llm.columns, model_llm.coef_):
    print(f"  {feat:<20s} {coef:+.2f}")
print(f"\nLLM conclusion: 'SAT scores are the strongest predictor of earnings.'")
R-squared: 0.499

Coefficients:
  SAT_AVG              +24.92
  UGDS                 -0.00
  PCTPELL              -12459.42
  PCTFLOAN             +2207.48
  RET_FT4              +29289.83

LLM conclusion: 'SAT scores are the strongest predictor of earnings.'

The code is clean. The R-squared is decent. The LLM would probably wrap this up with a confident interpretation: “Higher SAT scores are associated with higher post-graduation earnings, suggesting that more selective institutions produce better-earning graduates.”

The code runs. The output looks reasonable. But the conclusions are wrong. Let’s count the problems.

Where LLMs fail

Problem 1: Silent data loss

The LLM dropped most of the data without flagging it. The missingness here is not random — schools that don’t report SAT scores are systematically different from those that do. This pattern is the non-random missingness discussed in Lecture 2: schools with low SAT averages may be less likely to report them, and schools that don’t require SATs (community colleges, trade schools) have no score to report at all.

Code
# The LLM didn't check what it dropped
print("What the LLM threw away:")
print(f"  Total institutions: {len(sc):,}")
print(f"  Kept: {len(sc_clean):,} ({len(sc_clean)/len(sc):.0%})")
print(f"  Dropped: {len(sc) - len(sc_clean):,} ({(len(sc) - len(sc_clean))/len(sc):.0%})")
What the LLM threw away:
  Total institutions: 7,703
  Kept: 1,239 (16%)
  Dropped: 6,464 (84%)

Problem 2: Confusing correlation with causation

The LLM said SAT scores “predict” higher earnings. But does attending a higher-SAT school cause you to earn more? Or do higher-SAT schools simply admit students who would have earned more anyway?

This result reflects a confounding problem — the kind you saw in Lecture 11, formalized with DAGs in Lecture 18. Family income, parental education, and geographic location all affect both SAT scores and earnings. The LLM has no way to distinguish confounded association from causal effect.

Code
# Show the confounding: Pell Grant % (a proxy for family income) correlates with both
fig, axes = plt.subplots(1, 2, figsize=(10, 4))

axes[0].scatter(sc_clean['PCTPELL'], sc_clean['SAT_AVG'], alpha=0.4, s=15)
axes[0].set_xlabel('Pell Grant Recipients (%)')
axes[0].set_ylabel('Average SAT Score')
axes[0].set_title('Pell % vs. SAT Average')

axes[1].scatter(sc_clean['PCTPELL'], sc_clean['MD_EARN_WNE_P10'], alpha=0.4, s=15)
axes[1].set_xlabel('Pell Grant Recipients (%)')
axes[1].set_ylabel('Median Earnings (10yr post)')
axes[1].set_title('Pell % vs. Median Earnings')

plt.tight_layout()
plt.show()

print("Family income (proxied by Pell %) drives BOTH SAT scores AND earnings.")
print("The SAT-earnings correlation is largely confounded by socioeconomic status.")

Family income (proxied by Pell %) drives BOTH SAT scores AND earnings.
The SAT-earnings correlation is largely confounded by socioeconomic status.

Problem 3: No assumption checking

The LLM fit a linear regression without checking:

  • Are the residuals approximately normal? (Probably not — earnings are right-skewed)
  • Is the relationship actually linear?
  • Are there influential outliers?

Recall from Lecture 12: regression diagnostics aren’t optional. They tell you whether your model’s conclusions are trustworthy.

Code
# Quick residual check the LLM skipped
residuals = y_llm - model_llm.predict(X_llm)

fig, axes = plt.subplots(1, 2, figsize=(10, 4))

axes[0].scatter(model_llm.predict(X_llm), residuals, alpha=0.3, s=10)
axes[0].axhline(y=0, color='red', linestyle='--')
axes[0].set_xlabel('Predicted Earnings')
axes[0].set_ylabel('Residual')
axes[0].set_title('Residuals vs Fitted (LLM skipped this)')
# Annotate the fan shape
axes[0].annotate('Variance increases\nwith predicted earnings',
                 xy=(0.95, 0.95), xycoords='axes fraction',
                 ha='right', va='top', fontsize=10, color='red', style='italic')

axes[1].hist(residuals, bins=50, edgecolor='black', alpha=0.7)
axes[1].axvline(x=np.median(residuals), color='red', linestyle='--', label='Median')
axes[1].set_xlabel('Residual')
axes[1].set_ylabel('Count')
axes[1].set_title('Residual Distribution (right-skewed)')
axes[1].legend()

plt.tight_layout()
plt.show()

print("The residuals are right-skewed and show heteroscedasticity.")
print("The linear model's confidence intervals are unreliable.")
print("The LLM didn't check.")

The residuals are right-skewed and show heteroscedasticity.
The linear model's confidence intervals are unreliable.
The LLM didn't check.

Problem 4: Hallucinated confidence

WarningLLMs Don’t Calibrate Confidence

LLMs don’t reliably calibrate their confidence. They may hedge on a simple calculation and then present a complex causal claim with the same confident tone. An LLM will:

  • Report a p-value without checking if the test’s assumptions are met
  • Claim a result is “statistically significant” without considering multiple comparisons
  • Make causal claims from observational data
TipThink About It

Would you trust a colleague who never says “I don’t know”? What would you do differently if you knew your analysis tool couldn’t distinguish what it knows from what it’s guessing?

LLM analysis vs. your analysis

Let’s test this concretely. If you asked an LLM “Do NBA players perform better after rest?”, here’s roughly what it would do.

Code
# Load NBA data
nba = pd.read_csv(f'{DATA_DIR}/nba/nba_load_management.csv')
nba['GAME_DATE'] = pd.to_datetime(nba['GAME_DATE'])
print(f"NBA dataset: {len(nba):,} player-games")
print(f"Seasons: {nba['SEASON'].unique()}")
NBA dataset: 78,335 player-games
Seasons: <StringArray>
['2021-22', '2022-23', '2023-24']
Length: 3, dtype: str
Code
# What an LLM would do: run a simple comparison
rested = nba[nba['REST_DAYS'] >= 3]['PTS']
not_rested = nba[nba['REST_DAYS'] <= 1]['PTS']

t_stat, p_value = stats.ttest_ind(rested, not_rested)

print("LLM's analysis: 'Do players score more when rested?'")
print(f"  Rested (3+ days):      {rested.mean():.1f} PPG (n={len(rested):,})")
print(f"  Not rested (0-1 days): {not_rested.mean():.1f} PPG (n={len(not_rested):,})")
print(f"  t-statistic: {t_stat:.2f}")
print(f"  p-value: {p_value:.2e}")
print()
if rested.mean() < not_rested.mean():
    print("Wait -- rested players score FEWER points?")
    print("The LLM would either report this surprising finding uncritically")
    print("or quietly reverse the groups to match its prior belief.")
LLM's analysis: 'Do players score more when rested?'
  Rested (3+ days):      9.3 PPG (n=23,653)
  Not rested (0-1 days): 11.3 PPG (n=11,288)
  t-statistic: -20.50
  p-value: 7.78e-93

Wait -- rested players score FEWER points?
The LLM would either report this surprising finding uncritically
or quietly reverse the groups to match its prior belief.

Wait — rested players score fewer points? That’s backwards from what you’d expect.

TipThink About It

Why might this be backwards? Think about which players tend to have many rest days, and which players play on back-to-backs. Take 30 seconds before reading on.

This reversal is Simpson’s paradox from Lecture 11. The aggregate comparison is backwards, given that rest is confounded with player quality:

  • Bench players (low scorers) have more rest days — they don’t play every game
  • Stars (high scorers) play on back-to-backs — coaches need them

The LLM also used ttest_ind without noting that the same players appear many times — the observations within a player are correlated, violating the independence assumption.

The LLM doesn’t know any of this. It would report the naive comparison and either:

  1. Confidently claim rest hurts performance (wrong interpretation)
  2. Hedge with “more analysis needed” (correct but unhelpful)

For the within-player analysis we switch from PTS to GAME_SCORE, a composite metric that accounts for rebounds, assists, turnovers, and efficiency — a better measure of overall performance than raw points.

Code
# What WE know to do: control for player identity
# Within-player comparison (from Lecture 11)
player_effects = []
players_tested = 0
significant_naive = 0

for player in nba['PLAYER_NAME'].unique():
    pdata = nba[nba['PLAYER_NAME'] == player]
    rested_p = pdata[pdata['REST_DAYS'] >= 3]['GAME_SCORE']
    tired_p = pdata[pdata['REST_DAYS'] <= 1]['GAME_SCORE']

    if len(rested_p) >= 10 and len(tired_p) >= 10:
        t, p = stats.ttest_ind(rested_p, tired_p)
        players_tested += 1
        if p < 0.05:
            significant_naive += 1
        player_effects.append({
            'player': player,
            'diff': rested_p.mean() - tired_p.mean(),
            'p_value': p
        })
Code
# Summarize the within-player results
effects_df = pd.DataFrame(player_effects)

print(f"Within-player analysis (the right way):")
print(f"  Players tested: {players_tested}")
print(f"  'Significant' at p<0.05: {significant_naive} ({significant_naive/players_tested:.1%})")
print(f"  Expected by chance (5%): {players_tested * 0.05:.0f}")
print()
print(f"  Mean effect (rested - tired): {effects_df['diff'].mean():+.2f} game score points")
print(f"  Median effect:                {effects_df['diff'].median():+.2f}")
print()
print("After controlling for player identity, the effect shrinks dramatically.")
print("Most 'significant' results are likely false positives from")
print("multiple testing (Lecture 11).")
print()
print("Even this within-player t-test is a simplification -- a proper analysis")
print("would account for game context. But the point is that controlling for")
print("player identity greatly reduces the naive finding.")
Within-player analysis (the right way):
  Players tested: 431
  'Significant' at p<0.05: 37 (8.6%)
  Expected by chance (5%): 22

  Mean effect (rested - tired): -0.59 game score points
  Median effect:                -0.63

After controlling for player identity, the effect shrinks dramatically.
Most 'significant' results are likely false positives from
multiple testing (Lecture 11).

Even this within-player t-test is a simplification -- a proper analysis
would account for game context. But the point is that controlling for
player identity greatly reduces the naive finding.

The LLM gave you a result in seconds. But it took 16 lectures of statistical training to know:

  1. The aggregate comparison is confounded — this is Simpson’s paradox (Lecture 11), formalized causally in Lecture 18
  2. You need within-player comparisons (Lecture 11)
  3. Testing many players inflates false positives (Lecture 11)
  4. The effect size is tiny even when “significant” (Lecture 12)

The LLM got the computation right. It got the thinking wrong.

What remains irreducibly human

AI tools are getting better fast. But some things remain fundamentally human:

Asking the right question. “What predicts earnings?” is a different question from “What causes higher earnings?” or “Which schools provide the most value-added?” The data can’t tell you which question to ask.

NoteTukey on Asking the Right Question

“Far better an approximate answer to the right question, which is often vague, than an exact answer to the wrong question, which can always be made precise.” — John Tukey

You first heard this quote in Lecture 1. Seventeen lectures later, the meaning is concrete: the LLM gave a precise answer to the wrong question.

Knowing the domain. Why are for-profit schools missing SAT data? Why do bench players have more rest days? Why does AQI spike in August? Domain knowledge is the difference between a plausible-looking wrong answer and a correct one.

Questioning assumptions. Every statistical method makes assumptions. LLMs rarely check them. You now know to ask: Is this data representative? Is the relationship causal? Are the residuals well-behaved? Are we testing too many hypotheses?

Understanding stakes and consequences. A wrong prediction about air quality could mean people go outside during a hazardous event. A wrong conclusion about school quality could redirect billions in funding. The statistical method is the same; the consequences are not.

Communicating uncertainty honestly. LLMs project confidence. Good statisticians communicate uncertainty. “The model suggests X, but the confidence interval is wide and the data is missing for the most vulnerable populations” is more honest — and more useful — than “X.”

TipThink About It

Which of these skills could an AI eventually learn? Which require human judgment that no amount of training data can replace?

A practical guide: using AI tools well

AI tools aren’t going away. Here’s how to use them as a statistically literate person:

Use AI for… But always check…
Writing boilerplate code Does it handle missing data correctly?
Quick EDA and visualization Are the axes labeled? Is the scale misleading?
Trying multiple models What data was dropped? What assumptions were violated?
Generating hypotheses Is this correlation or causation?
Drafting reports Are the uncertainty statements honest?

The pattern: let the AI draft, then apply your statistical judgment. This principle is why you took the course.

This checklist is worth keeping — save it for when you use AI tools in your projects.

Prompt decomposition: how to use AI effectively

Don’t ask an LLM to “analyze this dataset.” Break the task into smaller steps where you can verify each one:

  1. Load and inspect: “Load this CSV and show me the first few rows, dtypes, and missing value counts.”
  2. Check missingness: “Which columns have missing values? Show me how missingness relates to other variables.”
  3. Fit a model: “Fit a linear regression of Y on X1, X2, X3. Show residual diagnostics.”
  4. Check your work: “What assumptions did this analysis make? What could go wrong?”

Each step gives you a checkpoint. If the LLM silently drops 84% of the data in step 1, you catch it before it propagates through the entire analysis.

TipThink About It

You’ve probably used AI tools for homework this quarter. Think of a time the AI gave you something that looked right but wasn’t — or a time you caught a mistake the AI missed. What course concept helped you catch it?

Key takeaways

  • AutoML finds good models automatically, but it can’t choose the right question, check data quality, or interpret results in context. It optimizes math, not thinking.

  • LLMs write working code fast, but they don’t check assumptions, notice subtle data issues, or distinguish correlation from causation. They’re confidently wrong about causal claims.

  • Trees, forests, and boosting are the workhorse models behind most AutoML systems. Decision trees split data recursively; random forests average many trees to reduce variance; gradient boosting trains trees sequentially on residuals.

  • The biggest errors aren’t computational. They’re conceptual: asking the wrong question, using biased data, confusing correlation with causation, ignoring multiple testing. These are human judgment calls.

  • AI tools are powerful assistants. Use them for drafting code, trying models, and generating hypotheses. But always apply your statistical judgment before trusting results.

NoteThe Goal

The goal of this course was never to make you compute faster than a machine. It was to make you think better than one.

Study guide

Key ideas

  • Decision tree: a model that recursively partitions the feature space by splitting on one feature at a time, predicting the average outcome in each leaf.
  • Node, split, leaf: a node is a decision point in the tree; a split is the threshold that divides data at a node; a leaf is a terminal node that makes a prediction.
  • Random forest: an ensemble of many decision trees, each trained on a bootstrap sample with random feature subsets, whose predictions are averaged.
  • Bagging (bootstrap aggregation): training many models on random resamples of the data and averaging their predictions to reduce variance.
  • Gradient boosting: an ensemble method that trains trees sequentially, with each tree correcting the errors of the previous ones by following the gradient of the loss function.
  • AutoML: automated machine learning — systematic search over models, features, and hyperparameters with cross-validation.
  • Selection bias: systematic exclusion of certain groups from an analysis, leading to conclusions that don’t generalize.
  • LLM (Large Language Model): an AI model trained on text that can generate code and analysis, but lacks the ability to check statistical assumptions or reason causally.
  • Prompt decomposition: breaking an analysis into small, verifiable steps so you can catch AI errors before they propagate.

Computational tools

  • DecisionTreeRegressor(max_depth=k) — fit a decision tree with controlled depth
  • plot_tree(tree, feature_names=...) — visualize a fitted decision tree
  • RandomForestRegressor(n_estimators=100) — fit a random forest (average of 100 trees)
  • GradientBoostingRegressor(n_estimators=100) — fit a gradient boosting model (100 sequential trees)
  • cross_val_score(model, X, y, cv=5) — evaluate model with k-fold cross-validation
  • Pipeline([('scaler', StandardScaler()), ('model', ...)]) — chain preprocessing and modeling to prevent data leakage

For the quiz

  • Understand how a decision tree makes a prediction (follow splits from root to leaf).
  • Know why deep trees overfit and how random forests fix this (bagging reduces variance).
  • Be able to explain the connection between gradient boosting and gradient descent.
  • Know the four main ways LLMs fail at data analysis (silent data loss, confounding correlation with causation, skipping assumption checks, hallucinated confidence).
  • Understand why prompt decomposition helps catch AI errors.
  • Be able to identify selection bias in a dataset where rows are dropped for missing values.