AI Professionals Bootcamp | Week 2
2025-12-24
Goal: ship a clean EDA notebook + exported figures + a short written summary (offline‑first, reproducible).
Bootcamp • SDAIA Academy
notebooks/day4_eda.ipynb + export figures + write reports/summary.mdBy the end of today, you can:
reports/figures/*.html (offline-friendly)reports/summary.md with findings + definitions + caveatsDay 4 EDA reads from processed data (not raw).
From repo root:
macOS/Linux
Windows PowerShell
Checkpoint: prints True.
macOS/Linux
Windows PowerShell
orders_enriched.parquetNote
Today is the “EDA handoff”: the notebook answers questions and the repo contains exported figures + a written summary.
notebooks/day4_eda.ipynb (reads from data/processed/)reports/figures/ with 3–6 Plotly exports (.html required; .png optional)reports/summary.md (findings + definitions + caveats)data/processed/orders_features.parquet (analysis-ready; built once, reused)Chart choice (data‑to‑viz thinking) + notebook workflow rules
By the end of this session, you can:
Notebooks are for exploration + explanation.
Rules:
data/processed/ (never data/raw/)Warning
If your notebook can’t be re-run top-to-bottom without manual fixes, it’s not a deliverable.
You’ll choose charts based on question type + variable types.
Use the simplest chart that answers the question:
Tip
Your default should be “table → bar/line/scatter/hist”. Make pies rare.
| You want to… | Data types | Best default | Common trap |
|---|---|---|---|
| Compare categories | categorical + numeric | bar (sorted) | unsorted bars, too many categories |
| See a trend | datetime + numeric | line | mixing time zones / missing dates |
| Understand spread | numeric | histogram | using mean only on skewed data |
| Compare distributions | categorical + numeric | histogram facets / box | box plots with tiny groups |
| Find relationships | numeric + numeric | scatter | overplotting without aggregation |
| Show composition | categorical + categorical | stacked bar | pie charts for many categories |
For each question, pick the best default chart:
Checkpoint: answer with 4 chart choices (one each).
status_clean (plus show group sizes)Note
Most chart quality comes from: sorting, labeling, and choosing the right level of aggregation.
A job-ready EDA notebook answers 3–6 questions:
For each question:
Question: Why do we export figures to reports/figures/ instead of leaving them only inside the notebook?
Answer: figures are stable artifacts you can share, review, and reference from the written summary (and diff in Git if needed).
When you return: we’ll learn the Plotly patterns we’ll use for every chart today.
Plotly Express patterns + exporting figures (HTML/PNG)
By the end of this session, you can:
plotly.express.html (offline-friendly) and optionally .pngplotly.express → fast chart creation (px.bar, px.line, …)fig.update_layout(...) → titles, margins, legend, themefig.update_xaxes/yaxes(...) → axis labels, tick formattingfig.write_html(...) ✅ works offline (recommended)fig.write_image(...) ✅ requires kaleido (optional)If you want .png exports:
If you can’t install it (offline), export .html and keep going.
import plotly.express as px
# bar (categories)
fig = px.bar(df, x="country", y="revenue", title="Revenue by country")
# line (trend)
fig = px.line(df, x="month", y="revenue", title="Monthly revenue")
# histogram (distribution)
fig = px.histogram(df, x="amount", nbins=50, title="Amount distribution")
# scatter (relationship)
fig = px.scatter(df, x="quantity", y="amount", title="Amount vs quantity")Plotly won’t magically sort categories the way you intend.
Do this:
Tip
If you copy a chart into a slide with no context, it should still make sense.
Prefer HTML exports with embedded JS:
Optional PNG (if Kaleido installed):
You have a bar chart that:
Write the 3 lines to fix it.
Checkpoint: you wrote: sort + update_layout + axis labels.
If you want multiple metrics on the same chart or facet:
melt() to go wide → longExample:
When you return: we’ll add “comparison thinking” + bootstrap intervals so our conclusions are honest.
Comparison thinking + bootstrap uncertainty + writing the summary
By the end of this session, you can:
reports/summary.mdWhen you compare groups, always include:
Warning
Averages without n are how analysts accidentally lie.
Example: refund rate by country
n_refundn_totaln_refund / n_totalYour report should show all three.
Bootstrap answers: “If I re-sampled my data, how much could this estimate vary?”
Use it when:
Keep it reproducible:
import numpy as np
import pandas as pd
def bootstrap_diff_means(a: pd.Series, b: pd.Series, *, n_boot: int = 2000, seed: int = 0) -> dict:
rng = np.random.default_rng(seed)
a = pd.to_numeric(a, errors="coerce").dropna().to_numpy()
b = pd.to_numeric(b, errors="coerce").dropna().to_numpy()
assert len(a) > 0 and len(b) > 0
diffs = []
for _ in range(n_boot):
sa = rng.choice(a, size=len(a), replace=True)
sb = rng.choice(b, size=len(b), replace=True)
diffs.append(sa.mean() - sb.mean())
diffs = np.array(diffs)
return {
"diff_mean": float(a.mean() - b.mean()),
"ci_low": float(np.quantile(diffs, 0.025)),
"ci_high": float(np.quantile(diffs, 0.975)),
"n_a": int(len(a)),
"n_b": int(len(b)),
}You compute:
diff_mean = +3.295% CI = [-0.4, +6.9]Write one honest sentence.
Checkpoint: mention uncertainty and that zero is inside the CI.
“Group A’s mean is about +3.2 higher than Group B, but the bootstrap interval includes 0, so the difference is uncertain with this data (may be small or not real).”
reports/summary.md (template)Your summary should include:
Tip
If your summary can’t be read in under 2 minutes, it’s too long.
When you return: we’ll build the Day 4 notebook and export your figures + summary.
Build: notebooks/day4_eda.ipynb → figures + summary (Plotly)
By the end, you should have:
data/processed/orders_features.parquet (created once, reused)notebooks/day4_eda.ipynb that runs top-to-bottomreports/figures/ with 3–6 exports (.html required)reports/summary.md including at least one bootstrap CIFrom repo root:
(Windows PowerShell)
orders_features.parquet (20 minutes)Create scripts/run_day4_make_features.py that:
data/processed/orders_enriched.parquetcreated_at (UTC)month, dow, hour)amount_winsordata/processed/orders_features.parquet (idempotent)Checkpoint: the file exists and reloading shows > 0 rows.
scripts/run_day4_make_features.pyfrom __future__ import annotations
from pathlib import Path
import pandas as pd
from bootcamp_data.io import read_parquet, write_parquet
from bootcamp_data.checks import require_columns, assert_non_empty
from bootcamp_data.transforms import parse_datetime, add_time_parts, winsorize, add_outlier_flag, add_missing_flags
def main() -> None:
root = Path(__file__).resolve().parents[1]
in_path = root / "data" / "processed" / "orders_enriched.parquet"
out_path = root / "data" / "processed" / "orders_features.parquet"
df = read_parquet(in_path)
assert_non_empty(df, df_name="orders_enriched")
require_columns(df, ["order_id", "user_id", "amount", "created_at", "country", "status_clean"], df_name="orders_enriched")
d = df.copy()
d = parse_datetime(d, "created_at", utc=True)
d = add_time_parts(d, "created_at")
d["amount_winsor"] = winsorize(d["amount"])
d = add_outlier_flag(d, "amount", k=1.5)
d = add_missing_flags(d, ["amount", "created_at", "country", "status_clean"])
write_parquet(d, out_path)
print("wrote:", out_path, "rows:", len(d))
if __name__ == "__main__":
main()macOS/Linux
Windows PowerShell
notebooks/day4_eda.ipynb (10 minutes)In VS Code:
notebooks/day4_eda.ipynbuv environmentYour notebook should have these sections:
reports/summary.mdfrom pathlib import Path
import pandas as pd
import plotly.express as px
ROOT = Path("..") # notebooks/ → repo root
FIGS = ROOT / "reports" / "figures"
FIGS.mkdir(parents=True, exist_ok=True)
def save_html(fig, name: str) -> Path:
p = FIGS / f"{name}.html"
fig.write_html(p, include_plotlyjs="embed")
return pCheckpoint: you see rows, and created_at is datetime-like.
Make at least these three:
amount_winsor)Export each with save_html(fig, "day4_<name>").
by_country = (
df.groupby("country", dropna=False)["amount"]
.sum(min_count=1)
.rename("revenue")
.reset_index()
.sort_values("revenue", ascending=False)
.head(15)
)
fig = px.bar(by_country, x="country", y="revenue", title="Revenue by country (top 15)")
fig.update_layout(title={"x": 0.02})
fig.update_xaxes(title_text="Country")
fig.update_yaxes(title_text="Revenue")
save_html(fig, "day4_revenue_by_country")Choose one:
amount in top country vs 2nd countryamount for paid vs refundCompute a bootstrap interval and include it as a bullet in your summary.
Interpretation rule: if CI crosses 0, be cautious.
reports/summary.md from the notebook (20 minutes)Minimum content:
summary = f"""# Project Summary (Day 4)
## Key findings
- Top revenue country: **{by_country.iloc[0]['country']}** (revenue={by_country.iloc[0]['revenue']:.2f})
- (Add 2–4 more bullets from your tables/charts.)
- Bootstrap mean diff ({top2[0]} - {top2[1]}): **{res['diff_mean']:.2f}** (95% CI [{res['ci_low']:.2f}, {res['ci_high']:.2f}], n={res['n_a']} vs {res['n_b']})
## Figures
- [Revenue by country](figures/day4_revenue_by_country.html)
- (Add your other figure links.)
## Caveats / assumptions
- Totals use raw `amount`; `amount_winsor` is for visualization only.
- Outliers are flagged, not removed.
- Date parsing errors become missing timestamps.
"""
out = ROOT / "reports" / "summary.md"
out.write_text(summary, encoding="utf-8")
print("wrote:", out)Commit your Day 4 deliverables.
If you get stuck:
scripts.run_day2_validate_and_report)uv add plotlyuv add ipykernelIn 1–2 sentences:
How do you decide between a bar chart, line chart, histogram, and scatter plot? (Answer using: question type + variable types.)
Due: before Day 5 starts (2025-12-25)
notebooks/day4_eda.ipynb runs top-to-bottomreports/figures/*.htmlreports/summary.md including one bootstrap CITip
If your repo ignores data/processed/, that’s fine—your scripts must recreate it reliably.