Appendix: Computational Tools
This appendix introduces the software tools used throughout this book and collects the Python functions you’ll encounter most often. If you’ve never written code before, start here. If you have programming experience, skim the first few sections and use the function reference at the end as a cheat sheet.
The toolkit
Four tools form the backbone of the computational work in this course:
| Tool | What it is | Role in this course |
|---|---|---|
| Python | A general-purpose programming language | Write and run data analysis code |
| Jupyter notebooks | An interactive environment that mixes code, text, and plots | Where you do your homework and projects |
| Google Colab | A free, cloud-based Jupyter environment from Google | Run notebooks without installing anything |
| AI assistants | Tools like ChatGPT, Claude, and Gemini | Help you write code, debug, and learn |
You don’t need to master any of these tools before starting the course. You’ll pick them up as you go.
Python
Python is the most widely used language for data science. The code in this book uses Python along with a handful of libraries:
- pandas — load, clean, and transform data
- numpy — numerical operations and linear algebra
- matplotlib and seaborn — visualization
- scipy — statistical distributions and hypothesis tests
- scikit-learn — machine learning models
You don’t need to memorize these. Each chapter introduces the functions it uses, and the reference tables at the end of this appendix collect them in one place.
Jupyter notebooks
A Jupyter notebook is a document made of cells. Each cell is either code (Python that you can run) or text (explanations, formatted with Markdown). You run cells one at a time, and each code cell prints its output directly below it — a table, a plot, a number.
This format is ideal for data analysis because you can see each step and its result before moving on. Every chapter of this book is also available as a Jupyter notebook.
Running code: local vs. cloud
There are two ways to run a Jupyter notebook.
Cloud (Google Colab). Open colab.research.google.com, upload a notebook (or click the “Open in Colab” badge on any chapter), and start running cells. Google provides a free virtual machine with Python and the major libraries pre-installed. Your code runs on Google’s servers. You need nothing on your own computer except a web browser.
Local. Install Python (we recommend the Anaconda distribution), then launch Jupyter from a terminal with jupyter notebook or jupyter lab. Your code runs on your own machine. You have full control over installed packages and file access, but you’re responsible for setup and troubleshooting.
Which should you use? For this course, Colab is the path of least resistance. It works on any laptop, requires no installation, and is free for the workloads in this course. If you already have a local Python setup you’re comfortable with, use that.
| Google Colab | Local (Anaconda) | |
|---|---|---|
| Setup | None — open a browser | Install Anaconda, configure environment |
| Hardware | Google’s servers (free GPU available) | Your laptop’s CPU and RAM |
| File access | Upload files or mount Google Drive | Direct access to local files |
| Packages | Most pre-installed; !pip install for others |
You manage your own environment |
| Persistence | Notebooks saved to Google Drive; runtime resets after ~12 hours of inactivity | Everything stays on your machine |
| Best for | Homework, quick exploration, collaboration | Large projects, custom environments |
AI tools
AI assistants are useful at every stage of a data analysis — writing code, debugging errors, and understanding concepts. Here’s how to think about the options.
Built-in AI in Colab
Google Colab has AI features built directly into the notebook:
- Code completion. Write a comment describing what you want (e.g.,
# plot a histogram of prices), press Tab, and Colab suggests code. Review the suggestion before accepting it. - Ask a question. Highlight code or an error message, right-click, and choose “Ask Gemini.” Colab explains what the code does or what went wrong.
- Generate cells. Click the “+ Code” button with the sparkle icon, type a natural-language prompt, and Colab generates a code cell.
These features are fast and convenient for small, well-defined tasks: “make a scatter plot,” “compute the mean by group,” “why am I getting a KeyError?”
Advanced AI assistants
For larger or more open-ended tasks, use a full AI assistant: ChatGPT, Claude, or Gemini. These are better when you need to:
- Understand a concept. “Explain what a p-value means in the context of A/B testing.”
- Debug a tricky error. Paste your code and the full traceback. The assistant can trace through the logic in a way that Colab’s inline help cannot.
- Plan an analysis. “I have a dataset of Airbnb listings with price, neighborhood, and number of reviews. What’s a good way to explore whether neighborhood predicts price?”
- Review your work. Paste your homework solution and ask: “Is my interpretation of this coefficient correct?”
Tip: Upload course materials (lecture notes, homework prompts, the textbook chapter) to give the assistant context about what you’re learning and the notation we use. A question like “explain regularization” gets a generic answer; a question like “explain regularization the way it’s covered in Chapter 6 of this textbook” gets one tailored to your course.
The boundary
AI can write code and generate analyses, but it cannot replace your understanding. In this course, you’ll be asked to defend your work in person — in review sessions, quizzes, and the project presentation. The test is not whether the code runs, but whether you can explain why it’s the right analysis and what the results mean.
Use AI to write code, not to replace understanding. Use it to get unstuck faster, not to skip the thinking.
Python function reference
The tables below collect the most important functions used in this book, organized by library. Each function links to the chapter where it first appears. This is a reference, not a reading assignment — come back to it when you need a reminder.
pandas: loading and inspecting data
| Function | What it does |
|---|---|
pd.read_csv(path) |
Load a CSV file into a DataFrame |
df.head(), df.tail() |
Preview the first or last rows |
df.shape |
Number of rows and columns |
df.info() |
Column names, types, and missing-value counts |
df.dtypes |
Data type of each column |
df.describe() |
Summary statistics for numeric columns |
df.columns |
List of column names |
pandas: selecting and filtering
| Function | What it does |
|---|---|
df['col'] |
Select a single column (returns a Series) |
df[['col1', 'col2']] |
Select multiple columns (returns a DataFrame) |
df[df['col'] > value] |
Filter rows by a condition |
df.loc[rows, cols] |
Select by label |
df.iloc[rows, cols] |
Select by position |
pandas: cleaning
| Function | What it does |
|---|---|
df.dropna() |
Remove rows with missing values |
df.fillna(value) |
Replace missing values |
df.isna() |
Boolean mask of missing values |
df.duplicated() |
Boolean mask of duplicate rows |
df.drop_duplicates() |
Remove duplicate rows |
df.astype(dtype) |
Convert column types |
df.str.lower(), .strip(), .replace() |
String cleaning methods |
pandas: grouping and aggregation
| Function | What it does |
|---|---|
df.groupby('col') |
Group rows by a column |
.mean(), .sum(), .std(), .median() |
Aggregate a group |
.agg({'col': 'mean'}) |
Aggregate with specific functions per column |
df['col'].value_counts() |
Count occurrences of each value |
df['col'].unique() |
Array of unique values |
pd.crosstab(df['a'], df['b']) |
Contingency table |
pd.pivot_table(df, values, index, columns) |
Pivot table with aggregation |
pandas: combining DataFrames
| Function | What it does |
|---|---|
pd.merge(left, right, on='key') |
Join two DataFrames on a shared column |
pd.concat([df1, df2]) |
Stack DataFrames vertically or horizontally |
pandas: reshaping and encoding
| Function | What it does |
|---|---|
pd.get_dummies(df['col']) |
One-hot encode a categorical column |
pd.cut(series, bins) |
Bin a continuous variable into intervals |
df.unstack() |
Pivot from long to wide format |
numpy
| Function | What it does |
|---|---|
np.array(list) |
Create an array |
np.zeros(n), np.ones(n) |
Arrays of zeros or ones |
np.arange(start, stop, step) |
Evenly spaced values (integers) |
np.linspace(start, stop, n) |
Evenly spaced values (floats) |
np.mean(x), np.std(x), np.median(x) |
Summary statistics |
np.sqrt(x), np.exp(x), np.log(x) |
Element-wise math |
np.dot(a, b) |
Dot product |
np.linalg.solve(A, b) |
Solve \(Ax = b\) |
np.where(cond, x, y) |
Element-wise conditional selection |
np.random.normal(mu, sigma, n) |
Sample from a normal distribution |
np.random.permutation(x) |
Randomly shuffle an array |
matplotlib
| Function | What it does |
|---|---|
fig, ax = plt.subplots() |
Create a figure and axes |
ax.plot(x, y) |
Line plot |
ax.scatter(x, y) |
Scatter plot |
ax.hist(x, bins=...) |
Histogram |
ax.bar(x, height) |
Bar chart |
ax.set_xlabel(), ax.set_ylabel(), ax.set_title() |
Axis labels and title |
ax.axhline(y), ax.axvline(x) |
Horizontal or vertical reference line |
ax.legend() |
Add a legend |
plt.tight_layout() |
Auto-adjust spacing |
seaborn
| Function | What it does |
|---|---|
sns.histplot(data, x=...) |
Histogram with optional density curve |
sns.scatterplot(data, x=..., y=..., hue=...) |
Scatter plot with grouping |
sns.boxplot(data, x=..., y=...) |
Box plot by category |
sns.violinplot(data, x=..., y=...) |
Distribution shape by category |
sns.heatmap(matrix, annot=True) |
Annotated heatmap |
sns.regplot(data, x=..., y=...) |
Scatter plot with regression line |
sns.set_style('whitegrid') |
Set plot theme |
scipy.stats
| Function | What it does |
|---|---|
norm.pdf(x, mu, sigma) |
Normal density at \(x\) |
norm.cdf(x, mu, sigma) |
Normal CDF (probability \(\leq x\)) |
norm.ppf(q, mu, sigma) |
Inverse CDF (quantile function) |
binom.pmf(k, n, p) |
Binomial probability of exactly \(k\) successes |
stats.ttest_ind(a, b) |
Two-sample t-test (Welch’s) |
stats.ttest_1samp(a, mu) |
One-sample t-test |
stats.pearsonr(x, y) |
Pearson correlation and p-value |
scikit-learn
| Function | What it does |
|---|---|
train_test_split(X, y, test_size=...) |
Split data into training and test sets |
cross_val_score(model, X, y, cv=...) |
Cross-validation scores |
StandardScaler() |
Standardize features to mean 0, std 1 |
PolynomialFeatures(degree=...) |
Generate polynomial features |
LinearRegression() |
Ordinary least squares |
Lasso(alpha=...) |
L1-regularized regression |
Ridge(alpha=...) |
L2-regularized regression |
DecisionTreeRegressor() |
Decision tree for regression |
RandomForestRegressor() |
Random forest for regression |
LogisticRegression() |
Logistic regression for classification |
PCA(n_components=...) |
Principal component analysis |
KMeans(n_clusters=...) |
K-means clustering |
model.fit(X, y) |
Train a model |
model.predict(X) |
Generate predictions |
model.score(X, y) |
R\(^2\) (regression) or accuracy (classification) |
r2_score(y_true, y_pred) |
Coefficient of determination |
mean_squared_error(y_true, y_pred) |
Mean squared error |
confusion_matrix(y_true, y_pred) |
Classification confusion matrix |
roc_curve(y_true, y_score) |
ROC curve arrays |