---
title: "AutoML, LLMs, and the Future of Data Analysis"
execute:
enabled: true
jupyter: python3
---
You've been using random forests since Lecture 6 — fitting them to Airbnb data, watching test error plateau in Lecture 7. Today you learn *why* they work. And you'll see what happens when you hand your data to an AI and say "analyze this."
By the end of today, you'll understand why knowing statistics makes you a *better* user of AI tools, not an obsolete one. You'll also open the black box on trees: how recursive splitting works, why bootstrap samples and feature subsets make trees decorrelated, and how gradient boosting connects back to the gradient descent you saw in Lecture 13.
```{python}
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor, plot_tree, export_text
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.model_selection import cross_val_score, KFold
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.metrics import r2_score, mean_absolute_error
import warnings
warnings.filterwarnings('ignore')
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (7, 4.5)
plt.rcParams['font.size'] = 12
DATA_DIR = 'data'
```
## How trees and forests work
In Lectures 6--7, you built decision trees and random forests for prediction. Trees split data recursively to minimize prediction error; forests average many overfit trees, using bootstrap samples and feature subsets to decorrelate them so averaging reduces variance. Here we examine which features matter most.
```{python}
# Load College Scorecard data
scorecard = pd.read_csv(f'{DATA_DIR}/college-scorecard/scorecard.csv')
feature_cols = ['SAT_AVG', 'UGDS', 'PCTPELL', 'PCTFLOAN', 'RET_FT4',
'C150_4_POOLED_SUPP', 'CONTROL']
target_col = 'MD_EARN_WNE_P10'
for col in feature_cols + [target_col]:
scorecard[col] = pd.to_numeric(scorecard[col], errors='coerce')
model_data = scorecard[feature_cols + [target_col]].dropna()
print(f"Complete cases: {len(model_data)} out of {len(scorecard)} ({len(model_data)/len(scorecard):.1%})")
```
```{python}
earnings_features = model_data[feature_cols]
earnings_target = model_data[target_col]
forest = RandomForestRegressor(n_estimators=100, random_state=42)
forest.fit(earnings_features, earnings_target)
forest_cv = cross_val_score(forest, earnings_features, earnings_target, cv=5, scoring='r2')
```
```{python}
# Feature importances: one of the most practically useful RF outputs
importances = forest.feature_importances_
sorted_idx = np.argsort(importances)
fig, ax = plt.subplots(figsize=(8, 5))
ax.barh(np.array(feature_cols)[sorted_idx], importances[sorted_idx], color='#4C72B0')
ax.set_xlabel('Feature importance (mean decrease in impurity)')
ax.set_title('Random forest feature importances for predicting graduate earnings')
plt.tight_layout()
plt.show()
```
:::{.callout-tip}
## Think About It
Why does averaging many overfitting trees produce a good model? Hint: each tree overfits to *different* noise, given that it sees different bootstrap samples and different feature subsets.
:::
### Gradient boosting: learning from mistakes
There's another way to combine trees. Instead of growing them independently and averaging, **gradient boosting** grows them *sequentially*: each new tree is trained on the *errors* of the previous trees.
Each new tree fits the negative gradient (pseudo-residuals) of the loss. Just as gradient descent (Lecture 13) takes steps in parameter space, gradient boosting takes steps in *function space* by adding a tree that approximates the negative gradient. Each added tree nudges the ensemble's prediction a little closer to the truth.
```{python}
# Gradient boosting: sequential error correction
gboost = GradientBoostingRegressor(n_estimators=100, random_state=42)
gboost.fit(earnings_features, earnings_target)
gboost_cv = cross_val_score(gboost, earnings_features, earnings_target, cv=5, scoring='r2')
print(f"Random Forest: CV R-sq = {forest_cv.mean():.3f}")
print(f"Gradient Boosting: CV R-sq = {gboost_cv.mean():.3f}")
print(f"\nGradient boosting often edges out random forests on tabular data.")
```
Now you know what's inside the black box. Let's see how AutoML uses these tools — and what it still can't do.
## The AutoML promise
:::{.callout-important}
## Definition: AutoML
**AutoML** (automated machine learning) takes this further: tools like auto-sklearn, H2O AutoML, and Google Vertex AI automatically try hundreds of model configurations — including trees, forests, gradient boosting, and others — and pick the best one. No feature engineering, no hyperparameter tuning.
:::
Under the hood, some AutoML systems also try additive models and splines — smooth nonlinear models that are more interpretable than random forests but more flexible than linear regression.
Let's simulate this with a "poor man's AutoML": try several models with cross-validation and pick the winner.
```{python}
# Poor man's AutoML: try 3 models, pick the best by 5-fold CV
# Use Pipeline to prevent data leakage (scaler only sees training folds)
kf = KFold(n_splits=5, shuffle=True, random_state=42)
models = {
'Linear Regression': Pipeline([('scaler', StandardScaler()), ('model', LinearRegression())]),
'Random Forest\n(100 trees)': RandomForestRegressor(n_estimators=100, random_state=42),
'Gradient Boosting\n(100 trees)': GradientBoostingRegressor(n_estimators=100, random_state=42),
}
cv_results = {}
print(f"{'Model':<35} {'Mean R-sq':>10} {'Std':>8} {'Mean MAE':>10}")
print("=" * 68)
best_score = -np.inf
best_name = None
for name, model in models.items():
features = model_data[feature_cols].values
target = model_data[target_col].values
r2_scores = cross_val_score(model, features, target, cv=kf, scoring='r2')
mae_scores = -cross_val_score(model, features, target, cv=kf, scoring='neg_mean_absolute_error')
cv_results[name] = r2_scores
display_name = name.replace('\n', ' ')
print(f"{display_name:<35} {r2_scores.mean():>10.3f} {r2_scores.std():>8.3f} {mae_scores.mean():>10.0f}")
if r2_scores.mean() > best_score:
best_score = r2_scores.mean()
best_name = display_name
print(f"\nWinner: {best_name} (R-sq = {best_score:.3f})")
print("AutoML would do this with hundreds of models and fancier tuning.")
```
```{python}
# Visualize the CV results
fig, ax = plt.subplots(figsize=(8, 5))
bp = ax.boxplot([scores for scores in cv_results.values()],
labels=list(cv_results.keys()),
patch_artist=True)
colors = ['#4C72B0', '#F0A30A', '#C44E52']
for patch, color in zip(bp['boxes'], colors):
patch.set_facecolor(color)
patch.set_alpha(0.7)
ax.set_ylabel('R-squared (5-fold CV)')
ax.set_title('Mini AutoML Shootout: Predicting Graduate Earnings')
ax.grid(axis='y', alpha=0.3)
plt.tight_layout()
plt.show()
```
Not bad! Random Forest edges out gradient boosting here. AutoML *works* — for the modeling step.
:::{.callout-tip}
## Think About It
Before we continue — what did AutoML NOT do here? What decisions did *we* make before any model was fit?
:::
## What AutoML can't do
Let's look at what got swept under the rug. What kinds of schools did we lose by requiring complete data?
```{python}
# What did we lose by dropping missing values?
print("Institutions with complete data vs. all institutions:")
print(f" All institutions: {len(scorecard):,}")
print(f" With SAT_AVG: {scorecard['SAT_AVG'].notna().sum():,}")
print(f" With earnings data: {pd.to_numeric(scorecard[target_col], errors='coerce').notna().sum():,}")
print(f" Complete cases: {len(model_data):,}")
```
```{python}
# What kinds of schools are we dropping?
scorecard_analysis = scorecard.copy()
scorecard_analysis['has_data'] = scorecard.index.isin(model_data.index)
control_labels = {1: 'Public', 2: 'Private nonprofit', 3: 'Private for-profit'}
scorecard_analysis['control_label'] = pd.to_numeric(scorecard_analysis['CONTROL'], errors='coerce').map(control_labels)
retention_data = []
print("Representation by institution type:")
for label in ['Public', 'Private nonprofit', 'Private for-profit']:
total = (scorecard_analysis['control_label'] == label).sum()
kept = ((scorecard_analysis['control_label'] == label) & scorecard_analysis['has_data']).sum()
if total > 0:
pct = kept / total
print(f" {label:<25s} {kept:>5d} / {total:>5d} ({pct:.0%} kept)")
retention_data.append({'type': label, 'kept': kept, 'total': total, 'pct': pct})
```
```{python}
# Visualize the selection bias
retention_df = pd.DataFrame(retention_data)
fig, ax = plt.subplots(figsize=(8, 4))
bars = ax.barh(retention_df['type'], retention_df['pct'], color=['#4C72B0', '#F0A30A', '#C44E52'])
ax.set_xlabel('Fraction retained in analysis')
ax.set_title('Which schools survive the missing-data filter?')
ax.set_xlim(0, 1)
for bar, row in zip(bars, retention_data):
ax.text(bar.get_width() + 0.02, bar.get_y() + bar.get_height()/2,
f"{row['pct']:.0%}", va='center')
plt.tight_layout()
plt.show()
```
This exclusion pattern is **selection bias** from Lecture 3. By requiring SAT scores, we've excluded community colleges, trade schools, and most for-profit institutions — precisely the schools where the relationship between institutional characteristics and earnings might be *different*.
AutoML optimized R-squared on the available data. It has no way to know the available data isn't representative.
## The LLM revolution
**Large Language Models** (**LLMs**) like ChatGPT, Claude, and Gemini go further than AutoML. You saw LLM featurization in Lecture 6 — using an LLM to extract structured features from unstructured text. Today we see how LLMs go beyond featurization to full analysis: you can say "analyze this dataset" in plain English and get working code, visualizations, and interpretations back.
Let's see what happens when we ask an LLM to analyze data. Below is the kind of code an LLM might produce — and it *looks* reasonable.
```{python}
# What an LLM might generate when asked:
# "Load the College Scorecard data and find what predicts higher earnings"
# (LLM-generated code tends to use short variable names like this)
# Step 1: Load data (LLM gets this right)
sc = pd.read_csv(f'{DATA_DIR}/college-scorecard/scorecard.csv')
# Step 2: "Clean" the data (LLM drops rows with any missing values)
numeric_cols = ['SAT_AVG', 'UGDS', 'PCTPELL', 'PCTFLOAN',
'MD_EARN_WNE_P10', 'RET_FT4']
for col in numeric_cols:
sc[col] = pd.to_numeric(sc[col], errors='coerce')
sc_clean = sc.dropna(subset=numeric_cols)
print(f"LLM 'cleaned' data: {len(sc_clean)} rows (dropped {len(sc) - len(sc_clean)})")
```
```{python}
# Step 3: Fit a regression (LLM picks a standard model)
X_llm = sc_clean[['SAT_AVG', 'UGDS', 'PCTPELL', 'PCTFLOAN', 'RET_FT4']]
y_llm = sc_clean['MD_EARN_WNE_P10']
model_llm = LinearRegression()
model_llm.fit(X_llm, y_llm)
# Notice: this is R-squared on the *training* data -- no cross-validation.
# From Lecture 7, we know this is optimistic.
print(f"R-squared: {model_llm.score(X_llm, y_llm):.3f}")
print("\nCoefficients:")
for feat, coef in zip(X_llm.columns, model_llm.coef_):
print(f" {feat:<20s} {coef:+.2f}")
print(f"\nLLM conclusion: 'SAT scores are the strongest predictor of earnings.'")
```
The code is clean. The R-squared is decent. The LLM would probably wrap this up with a confident interpretation: *"Higher SAT scores are associated with higher post-graduation earnings, suggesting that more selective institutions produce better-earning graduates."*
The code runs. The output looks reasonable. But the *conclusions* are wrong. Let's count the problems.
## Where LLMs fail
### Problem 1: Silent data loss
The LLM dropped most of the data without flagging it. The missingness here is not random — schools that don't report SAT scores are *systematically different* from those that do. This pattern is the non-random missingness discussed in Lecture 2: schools with low SAT averages may be less likely to report them, and schools that don't require SATs (community colleges, trade schools) have no score to report at all.
```{python}
# The LLM didn't check what it dropped
print("What the LLM threw away:")
print(f" Total institutions: {len(sc):,}")
print(f" Kept: {len(sc_clean):,} ({len(sc_clean)/len(sc):.0%})")
print(f" Dropped: {len(sc) - len(sc_clean):,} ({(len(sc) - len(sc_clean))/len(sc):.0%})")
```
### Problem 2: Confusing correlation with causation
The LLM said SAT scores "predict" higher earnings. But does attending a higher-SAT school *cause* you to earn more? Or do higher-SAT schools simply admit students who would have earned more anyway?
This result reflects a **confounding** problem — the kind you saw in Lecture 11, formalized with DAGs in Lecture 18. Family income, parental education, and geographic location all affect both SAT scores and earnings. The LLM has no way to distinguish confounded association from causal effect.
```{python}
# Show the confounding: Pell Grant % (a proxy for family income) correlates with both
fig, axes = plt.subplots(1, 2, figsize=(10, 4))
axes[0].scatter(sc_clean['PCTPELL'], sc_clean['SAT_AVG'], alpha=0.4, s=15)
axes[0].set_xlabel('Pell Grant Recipients (%)')
axes[0].set_ylabel('Average SAT Score')
axes[0].set_title('Pell % vs. SAT Average')
axes[1].scatter(sc_clean['PCTPELL'], sc_clean['MD_EARN_WNE_P10'], alpha=0.4, s=15)
axes[1].set_xlabel('Pell Grant Recipients (%)')
axes[1].set_ylabel('Median Earnings (10yr post)')
axes[1].set_title('Pell % vs. Median Earnings')
plt.tight_layout()
plt.show()
print("Family income (proxied by Pell %) drives BOTH SAT scores AND earnings.")
print("The SAT-earnings correlation is largely confounded by socioeconomic status.")
```
### Problem 3: No assumption checking
The LLM fit a linear regression without checking:
- Are the residuals approximately normal? (Probably not — earnings are right-skewed)
- Is the relationship actually linear?
- Are there influential outliers?
Recall from Lecture 12: regression diagnostics aren't optional. They tell you whether your model's conclusions are trustworthy.
```{python}
# Quick residual check the LLM skipped
residuals = y_llm - model_llm.predict(X_llm)
fig, axes = plt.subplots(1, 2, figsize=(10, 4))
axes[0].scatter(model_llm.predict(X_llm), residuals, alpha=0.3, s=10)
axes[0].axhline(y=0, color='red', linestyle='--')
axes[0].set_xlabel('Predicted Earnings')
axes[0].set_ylabel('Residual')
axes[0].set_title('Residuals vs Fitted (LLM skipped this)')
# Annotate the fan shape
axes[0].annotate('Variance increases\nwith predicted earnings',
xy=(0.95, 0.95), xycoords='axes fraction',
ha='right', va='top', fontsize=10, color='red', style='italic')
axes[1].hist(residuals, bins=50, edgecolor='black', alpha=0.7)
axes[1].axvline(x=np.median(residuals), color='red', linestyle='--', label='Median')
axes[1].set_xlabel('Residual')
axes[1].set_ylabel('Count')
axes[1].set_title('Residual Distribution (right-skewed)')
axes[1].legend()
plt.tight_layout()
plt.show()
print("The residuals are right-skewed and show heteroscedasticity.")
print("The linear model's confidence intervals are unreliable.")
print("The LLM didn't check.")
```
### Problem 4: Hallucinated confidence
:::{.callout-warning}
## LLMs Don't Calibrate Confidence
LLMs don't reliably calibrate their confidence. They may hedge on a simple calculation and then present a complex causal claim with the same confident tone. An LLM will:
- Report a p-value without checking if the test's assumptions are met
- Claim a result is "statistically significant" without considering multiple comparisons
- Make causal claims from observational data
:::
:::{.callout-tip}
## Think About It
Would you trust a colleague who never says "I don't know"? What would you do differently if you knew your analysis tool couldn't distinguish what it knows from what it's guessing?
:::
## LLM analysis vs. your analysis
Let's test this concretely. If you asked an LLM "Do NBA players perform better after rest?", here's roughly what it would do.
```{python}
# Load NBA data
nba = pd.read_csv(f'{DATA_DIR}/nba/nba_load_management.csv')
nba['GAME_DATE'] = pd.to_datetime(nba['GAME_DATE'])
print(f"NBA dataset: {len(nba):,} player-games")
print(f"Seasons: {nba['SEASON'].unique()}")
```
```{python}
# What an LLM would do: run a simple comparison
rested = nba[nba['REST_DAYS'] >= 3]['PTS']
not_rested = nba[nba['REST_DAYS'] <= 1]['PTS']
t_stat, p_value = stats.ttest_ind(rested, not_rested)
print("LLM's analysis: 'Do players score more when rested?'")
print(f" Rested (3+ days): {rested.mean():.1f} PPG (n={len(rested):,})")
print(f" Not rested (0-1 days): {not_rested.mean():.1f} PPG (n={len(not_rested):,})")
print(f" t-statistic: {t_stat:.2f}")
print(f" p-value: {p_value:.2e}")
print()
if rested.mean() < not_rested.mean():
print("Wait -- rested players score FEWER points?")
print("The LLM would either report this surprising finding uncritically")
print("or quietly reverse the groups to match its prior belief.")
```
Wait — rested players score *fewer* points? That's backwards from what you'd expect.
:::{.callout-tip}
## Think About It
Why might this be backwards? Think about *which* players tend to have many rest days, and *which* players play on back-to-backs. Take 30 seconds before reading on.
:::
This reversal is **Simpson's paradox** from Lecture 11. The aggregate comparison is *backwards*, given that rest is confounded with player quality:
- Bench players (low scorers) have more rest days — they don't play every game
- Stars (high scorers) play on back-to-backs — coaches need them
The LLM also used `ttest_ind` without noting that the same players appear many times — the observations within a player are correlated, violating the independence assumption.
The LLM doesn't know any of this. It would report the naive comparison and either:
1. Confidently claim rest *hurts* performance (wrong interpretation)
2. Hedge with "more analysis needed" (correct but unhelpful)
For the within-player analysis we switch from `PTS` to `GAME_SCORE`, a composite metric that accounts for rebounds, assists, turnovers, and efficiency — a better measure of overall performance than raw points.
```{python}
# What WE know to do: control for player identity
# Within-player comparison (from Lecture 11)
player_effects = []
players_tested = 0
significant_naive = 0
for player in nba['PLAYER_NAME'].unique():
pdata = nba[nba['PLAYER_NAME'] == player]
rested_p = pdata[pdata['REST_DAYS'] >= 3]['GAME_SCORE']
tired_p = pdata[pdata['REST_DAYS'] <= 1]['GAME_SCORE']
if len(rested_p) >= 10 and len(tired_p) >= 10:
t, p = stats.ttest_ind(rested_p, tired_p)
players_tested += 1
if p < 0.05:
significant_naive += 1
player_effects.append({
'player': player,
'diff': rested_p.mean() - tired_p.mean(),
'p_value': p
})
```
```{python}
# Summarize the within-player results
effects_df = pd.DataFrame(player_effects)
print(f"Within-player analysis (the right way):")
print(f" Players tested: {players_tested}")
print(f" 'Significant' at p<0.05: {significant_naive} ({significant_naive/players_tested:.1%})")
print(f" Expected by chance (5%): {players_tested * 0.05:.0f}")
print()
print(f" Mean effect (rested - tired): {effects_df['diff'].mean():+.2f} game score points")
print(f" Median effect: {effects_df['diff'].median():+.2f}")
print()
print("After controlling for player identity, the effect shrinks dramatically.")
print("Most 'significant' results are likely false positives from")
print("multiple testing (Lecture 11).")
print()
print("Even this within-player t-test is a simplification -- a proper analysis")
print("would account for game context. But the point is that controlling for")
print("player identity greatly reduces the naive finding.")
```
The LLM gave you a result in seconds. But it took 16 lectures of statistical training to know:
1. The aggregate comparison is confounded — this is Simpson's paradox (Lecture 11), formalized causally in Lecture 18
2. You need within-player comparisons (Lecture 11)
3. Testing many players inflates false positives (Lecture 11)
4. The effect size is tiny even when "significant" (Lecture 12)
The LLM got the *computation* right. It got the *thinking* wrong.
## What remains irreducibly human
AI tools are getting better fast. But some things remain fundamentally human:
**Asking the right question.** "What predicts earnings?" is a different question from "What *causes* higher earnings?" or "Which schools provide the most *value-added*?" The data can't tell you which question to ask.
:::{.callout-note}
## Tukey on Asking the Right Question
"Far better an approximate answer to the right question, which is often vague, than an exact answer to the wrong question, which can always be made precise." — John Tukey
You first heard this quote in Lecture 1. Seventeen lectures later, the meaning is concrete: the LLM gave a precise answer to the wrong question.
:::
**Knowing the domain.** Why are for-profit schools missing SAT data? Why do bench players have more rest days? Why does AQI spike in August? Domain knowledge is the difference between a plausible-looking wrong answer and a correct one.
**Questioning assumptions.** Every statistical method makes assumptions. LLMs rarely check them. You now know to ask: Is this data representative? Is the relationship causal? Are the residuals well-behaved? Are we testing too many hypotheses?
**Understanding stakes and consequences.** A wrong prediction about air quality could mean people go outside during a hazardous event. A wrong conclusion about school quality could redirect billions in funding. The statistical method is the same; the consequences are not.
**Communicating uncertainty honestly.** LLMs project confidence. Good statisticians communicate uncertainty. "The model suggests X, but the confidence interval is wide and the data is missing for the most vulnerable populations" is more honest — and more useful — than "X."
:::{.callout-tip}
## Think About It
Which of these skills could an AI eventually learn? Which require human judgment that no amount of training data can replace?
:::
## A practical guide: using AI tools well
AI tools aren't going away. Here's how to use them as a *statistically literate* person:
| Use AI for... | But always check... |
|--------------|-------------------|
| Writing boilerplate code | Does it handle missing data correctly? |
| Quick EDA and visualization | Are the axes labeled? Is the scale misleading? |
| Trying multiple models | What data was dropped? What assumptions were violated? |
| Generating hypotheses | Is this correlation or causation? |
| Drafting reports | Are the uncertainty statements honest? |
The pattern: **let the AI draft, then apply your statistical judgment.** This principle is why you took the course.
This checklist is worth keeping — save it for when you use AI tools in your projects.
### Prompt decomposition: how to use AI effectively
Don't ask an LLM to "analyze this dataset." Break the task into smaller steps where you can verify each one:
1. **Load and inspect:** "Load this CSV and show me the first few rows, dtypes, and missing value counts."
2. **Check missingness:** "Which columns have missing values? Show me how missingness relates to other variables."
3. **Fit a model:** "Fit a linear regression of Y on X1, X2, X3. Show residual diagnostics."
4. **Check your work:** "What assumptions did this analysis make? What could go wrong?"
Each step gives you a checkpoint. If the LLM silently drops 84% of the data in step 1, you catch it before it propagates through the entire analysis.
:::{.callout-tip}
## Think About It
You've probably used AI tools for homework this quarter. Think of a time the AI gave you something that looked right but wasn't — or a time you caught a mistake the AI missed. What course concept helped you catch it?
:::
## Key takeaways
- **AutoML finds good models automatically**, but it can't choose the right question, check data quality, or interpret results in context. It optimizes math, not thinking.
- **LLMs write working code fast**, but they don't check assumptions, notice subtle data issues, or distinguish correlation from causation. They're confidently wrong about causal claims.
- **Trees, forests, and boosting** are the workhorse models behind most AutoML systems. Decision trees split data recursively; random forests average many trees to reduce variance; gradient boosting trains trees sequentially on residuals.
- **The biggest errors aren't computational.** They're conceptual: asking the wrong question, using biased data, confusing correlation with causation, ignoring multiple testing. These are human judgment calls.
- **AI tools are powerful assistants.** Use them for drafting code, trying models, and generating hypotheses. But always apply your statistical judgment before trusting results.
:::{.callout-note}
## The Goal
The goal of this course was never to make you compute faster than a machine. It was to make you *think* better than one.
:::
## Study guide
### Key ideas
- **Decision tree:** a model that recursively partitions the feature space by splitting on one feature at a time, predicting the average outcome in each leaf.
- **Node, split, leaf:** a node is a decision point in the tree; a split is the threshold that divides data at a node; a leaf is a terminal node that makes a prediction.
- **Random forest:** an ensemble of many decision trees, each trained on a bootstrap sample with random feature subsets, whose predictions are averaged.
- **Bagging (bootstrap aggregation):** training many models on random resamples of the data and averaging their predictions to reduce variance.
- **Gradient boosting:** an ensemble method that trains trees sequentially, with each tree correcting the errors of the previous ones by following the gradient of the loss function.
- **AutoML:** automated machine learning — systematic search over models, features, and hyperparameters with cross-validation.
- **Selection bias:** systematic exclusion of certain groups from an analysis, leading to conclusions that don't generalize.
- **LLM (Large Language Model):** an AI model trained on text that can generate code and analysis, but lacks the ability to check statistical assumptions or reason causally.
- **Prompt decomposition:** breaking an analysis into small, verifiable steps so you can catch AI errors before they propagate.
### Computational tools
- `DecisionTreeRegressor(max_depth=k)` — fit a decision tree with controlled depth
- `plot_tree(tree, feature_names=...)` — visualize a fitted decision tree
- `RandomForestRegressor(n_estimators=100)` — fit a random forest (average of 100 trees)
- `GradientBoostingRegressor(n_estimators=100)` — fit a gradient boosting model (100 sequential trees)
- `cross_val_score(model, X, y, cv=5)` — evaluate model with k-fold cross-validation
- `Pipeline([('scaler', StandardScaler()), ('model', ...)])` — chain preprocessing and modeling to prevent data leakage
### For the quiz
- Understand how a decision tree makes a prediction (follow splits from root to leaf).
- Know why deep trees overfit and how random forests fix this (bagging reduces variance).
- Be able to explain the connection between gradient boosting and gradient descent.
- Know the four main ways LLMs fail at data analysis (silent data loss, confounding correlation with causation, skipping assumption checks, hallucinated confidence).
- Understand why prompt decomposition helps catch AI errors.
- Be able to identify selection bias in a dataset where rows are dropped for missing values.