Data Work (ETL + EDA)

AI Professionals Bootcamp | Week 2

2025-12-24

Day 4: EDA with Plotly — Chart Choice + Uncertainty

Goal: ship a clean EDA notebook + exported figures + a short written summary (offline‑first, reproducible).

Bootcamp • SDAIA Academy

Today’s Flow

Session 1 (60m): Chart choice (data‑to‑viz thinking) + notebook workflow rules
Asr Prayer (20m)
Session 2 (60m): Plotly Express patterns + exporting figures (HTML/PNG)
Maghrib Prayer (20m)
Session 3 (60m): Comparison thinking + bootstrap uncertainty + writing the summary
Isha Prayer (20m)
Hands-on (120m): Build notebooks/day4_eda.ipynb + export figures + write reports/summary.md

Learning Objectives

By the end of today, you can:

choose an appropriate visualization based on question + data types (data-to-viz style)
use Plotly Express as your one plotting library (and customize labels/layout)
export interactive figures to reports/figures/*.html (offline-friendly)
answer 3–6 concrete questions with tables + charts + short interpretations
compute a practical uncertainty estimate using a bootstrap interval
ship a readable reports/summary.md with findings + definitions + caveats

Warm-up (5 minutes): verify inputs exist

Day 4 EDA reads from processed data (not raw).

From repo root:

macOS/Linux

uv run python -m scripts.run_day2_validate_and_report
uv run python -c "from pathlib import Path; print(Path('data/processed/orders_enriched.parquet').exists())"

Windows PowerShell

uv run python -m scripts.run_day2_validate_and_report
uv run python -c "from pathlib import Path; print(Path('data/processed/orders_enriched.parquet').exists())"

Checkpoint: prints True.

Warm-up (5 minutes): verify Plotly is available

macOS/Linux

uv run python -c "import plotly; print('plotly ok', plotly.__version__)"

Windows PowerShell

uv run python -c "import plotly; print('plotly ok', plotly.__version__)"

Tip

If this fails, install Plotly now (required today):

uv add plotly

Quick review: Week 2 so far

Day 1: scaffold + typed I/O + offline-first data layout
Day 2: checks + cleaning + safe join → orders_enriched.parquet
Day 3: datetimes + outliers (flag/cap), EDA helpers (tables)

Note

Today is the “EDA handoff”: the notebook answers questions and the repo contains exported figures + a written summary.

Today’s end state (what you will commit)

notebooks/day4_eda.ipynb (reads from data/processed/)
reports/figures/ with 3–6 Plotly exports (.html required; .png optional)
reports/summary.md (findings + definitions + caveats)
data/processed/orders_features.parquet (analysis-ready; built once, reused)

Session 1

Chart choice (data‑to‑viz thinking) + notebook workflow rules

Session 1 objectives

By the end of this session, you can:

map a question → a chart type
recognize your variable types (categorical / numeric / datetime)
avoid the most common chart mistakes (pie, unsorted bars, missing n, dual axes)
explain “why notebook, not script” for EDA (and what belongs where)

Notebook workflow rules (bootcamp standard)

Notebooks are for exploration + explanation.

Rules:

Notebook reads from data/processed/ (never data/raw/)
ETL belongs in scripts/modules (idempotent), not hidden notebook cells
Every chart has:
- a real title (time window + metric + segment)
- axis labels with units
- enough context to not mislead

Warning

If your notebook can’t be re-run top-to-bottom without manual fixes, it’s not a deliverable.

Variable types (fast mental model)

Categorical: country, status, product_category
Numeric: amount, quantity, rating
Datetime: created_at, event_time, month
Boolean: is_refund, is_outlier

You’ll choose charts based on question type + variable types.

Chart choice: question → chart (default moves)

Use the simplest chart that answers the question:

Compare categories (cat → value) → sorted bar / dot plot
Trend over time (time → value) → line
Distribution (numeric) → histogram (and maybe box for quick compare)
Relationship (num vs num) → scatter
Composition (parts of whole across a dimension) → stacked bar / stacked area
Rates vs volume → use two charts (avoid dual axes)

Tip

Your default should be “table → bar/line/scatter/hist”. Make pies rare.

Chart choice cheat sheet (data-to-viz style)

You want to…	Data types	Best default	Common trap
Compare categories	categorical + numeric	bar (sorted)	unsorted bars, too many categories
See a trend	datetime + numeric	line	mixing time zones / missing dates
Understand spread	numeric	histogram	using mean only on skewed data
Compare distributions	categorical + numeric	histogram facets / box	box plots with tiny groups
Find relationships	numeric + numeric	scatter	overplotting without aggregation
Show composition	categorical + categorical	stacked bar	pie charts for many categories

Micro-exercise (6 minutes): choose the chart

For each question, pick the best default chart:

“Which countries generate the most revenue?”
“How does revenue change month-to-month?”
“Is order amount skewed or roughly normal?”
“Do refunds have lower order amounts than paid orders?”

Checkpoint: answer with 4 chart choices (one each).

Solution

Sorted bar (top countries by revenue)
Line (monthly revenue)
Histogram (amount distribution; consider log scale if heavy tail)
Histogram facets or box by status_clean (plus show group sizes)

The 5 most common “EDA chart mistakes”

Bars not sorted (hard to read)
Missing units/time window in titles
Comparing averages without showing n
Dual y-axes (often misleading)
Pie charts for more than ~3–4 categories

Note

Most chart quality comes from: sorting, labeling, and choosing the right level of aggregation.

EDA structure (what your notebook should look like)

A job-ready EDA notebook answers 3–6 questions:

For each question:

define metric + filters
compute a table
chart the table (or chart the raw distribution)
write 2–3 sentences: what it means + 1 caveat

Quick Check

Question: Why do we export figures to reports/figures/ instead of leaving them only inside the notebook?

Answer: figures are stable artifacts you can share, review, and reference from the written summary (and diff in Git if needed).

Asr break

20 minutes

When you return: we’ll learn the Plotly patterns we’ll use for every chart today.

Session 2

Plotly Express patterns + exporting figures (HTML/PNG)

Session 2 objectives

By the end of this session, you can:

build the core chart types with plotly.express
customize chart anatomy (titles, labels, axes)
export figures to .html (offline-friendly) and optionally .png
use long/tidy data when plotting grouped metrics

Plotly mental model (2 minutes)

plotly.express → fast chart creation (px.bar, px.line, …)
fig.update_layout(...) → titles, margins, legend, theme
fig.update_xaxes/yaxes(...) → axis labels, tick formatting
Export:
- fig.write_html(...) ✅ works offline (recommended)
- fig.write_image(...) ✅ requires kaleido (optional)

Install note: PNG export is optional

If you want .png exports:

uv add kaleido

If you can’t install it (offline), export .html and keep going.

Canonical Plotly patterns (you will reuse these)

import plotly.express as px

# bar (categories)
fig = px.bar(df, x="country", y="revenue", title="Revenue by country")

# line (trend)
fig = px.line(df, x="month", y="revenue", title="Monthly revenue")

# histogram (distribution)
fig = px.histogram(df, x="amount", nbins=50, title="Amount distribution")

# scatter (relationship)
fig = px.scatter(df, x="quantity", y="amount", title="Amount vs quantity")

Sorting categories (non-negotiable for readability)

Plotly won’t magically sort categories the way you intend.

Do this:

d = df.sort_values("revenue", ascending=False)
fig = px.bar(d, x="country", y="revenue", title="Revenue by country (sorted)")

Make charts self-contained (labels + units)

fig.update_layout(title={"x": 0.02})
fig.update_xaxes(title_text="Country")
fig.update_yaxes(title_text="Revenue (USD)")

Tip

If you copy a chart into a slide with no context, it should still make sense.

Export figures (offline-first)

Prefer HTML exports with embedded JS:

from pathlib import Path

out = Path("reports/figures")
out.mkdir(parents=True, exist_ok=True)

fig.write_html(out / "day4_revenue_by_country.html", include_plotlyjs="embed")

Optional PNG (if Kaleido installed):

fig.write_image(out / "day4_revenue_by_country.png", scale=2)

Micro-exercise (8 minutes): fix a bad chart

You have a bar chart that:

is unsorted
has no axis labels
has a vague title

Write the 3 lines to fix it.

Checkpoint: you wrote: sort + update_layout + axis labels.

Solution

d = d.sort_values("revenue", ascending=False)
fig = px.bar(d, x="country", y="revenue", title="Revenue by country (Jan–Dec 2025)")
fig.update_layout(title={"x": 0.02})
fig.update_xaxes(title_text="Country")
fig.update_yaxes(title_text="Revenue (USD)")

Tidy data reminder (for plotting)

If you want multiple metrics on the same chart or facet:

long format is usually easiest
use melt() to go wide → long

Example:

long = wide.melt(
    id_vars=["month"],
    value_vars=["revenue", "n_orders"],
    var_name="metric",
    value_name="value",
)
fig = px.line(long, x="month", y="value", color="metric", title="Metrics over time")

Maghrib break

20 minutes

When you return: we’ll add “comparison thinking” + bootstrap intervals so our conclusions are honest.

Session 3

Comparison thinking + bootstrap uncertainty + writing the summary

Session 3 objectives

By the end of this session, you can:

compute rates/ratios correctly (with denominators)
report effect sizes (absolute + relative)
build a bootstrap interval for a difference in means or rates
write a short, credible reports/summary.md

Comparison thinking (the EDA upgrade)

When you compare groups, always include:

n (sample size)
absolute difference (e.g., +$2.10)
relative difference (e.g., +8%)
a caveat if groups are tiny or biased

Warning

Averages without n are how analysts accidentally lie.

Rates/ratios: always show numerator + denominator

Example: refund rate by country

numerator: n_refund
denominator: n_total
rate: n_refund / n_total

Your report should show all three.

Bootstrap intervals (practical uncertainty)

Bootstrap answers: “If I re-sampled my data, how much could this estimate vary?”

Use it when:

comparing means/medians between groups
comparing rates (refund rate, conversion rate)
you want a rough uncertainty range without heavy theory

Keep it reproducible:

fix a random seed

Minimal bootstrap template (difference in means)

import numpy as np
import pandas as pd

def bootstrap_diff_means(a: pd.Series, b: pd.Series, *, n_boot: int = 2000, seed: int = 0) -> dict:
    rng = np.random.default_rng(seed)
    a = pd.to_numeric(a, errors="coerce").dropna().to_numpy()
    b = pd.to_numeric(b, errors="coerce").dropna().to_numpy()
    assert len(a) > 0 and len(b) > 0

    diffs = []
    for _ in range(n_boot):
        sa = rng.choice(a, size=len(a), replace=True)
        sb = rng.choice(b, size=len(b), replace=True)
        diffs.append(sa.mean() - sb.mean())

    diffs = np.array(diffs)
    return {
        "diff_mean": float(a.mean() - b.mean()),
        "ci_low": float(np.quantile(diffs, 0.025)),
        "ci_high": float(np.quantile(diffs, 0.975)),
        "n_a": int(len(a)),
        "n_b": int(len(b)),
    }

Micro-exercise (7 minutes): interpret a bootstrap result

You compute:

diff_mean = +3.2
95% CI = [-0.4, +6.9]

Write one honest sentence.

Checkpoint: mention uncertainty and that zero is inside the CI.

Solution

“Group A’s mean is about +3.2 higher than Group B, but the bootstrap interval includes 0, so the difference is uncertain with this data (may be small or not real).”

Writing `reports/summary.md` (template)

Your summary should include:

Key findings (3–6 bullets with numbers)
Definitions (what filters/metrics mean)
Data quality caveats (missingness, join coverage, outliers)
Next questions (what you’d check next)

Tip

If your summary can’t be read in under 2 minutes, it’s too long.

Isha break

20 minutes

When you return: we’ll build the Day 4 notebook and export your figures + summary.

Hands-on

Build: notebooks/day4_eda.ipynb → figures + summary (Plotly)

Hands-on success criteria (today)

By the end, you should have:

data/processed/orders_features.parquet (created once, reused)
notebooks/day4_eda.ipynb that runs top-to-bottom
reports/figures/ with 3–6 exports (.html required)
reports/summary.md including at least one bootstrap CI
at least 2 commits pushed to GitHub

Project layout (Day 4 focus)

data/
  processed/
    orders_enriched.parquet
    orders_features.parquet      # NEW today
notebooks/
  day4_eda.ipynb                 # NEW today
reports/
  figures/
    day4_*.html                  # NEW today (required)
    day4_*.png                   # optional (kaleido)
  summary.md                     # NEW today
scripts/
  run_day4_make_features.py      # NEW today (small, idempotent)

Task 0 — Create the notebook folder (2 minutes)

From repo root:

mkdir -p notebooks

(Windows PowerShell)

New-Item -ItemType Directory -Force notebooks | Out-Null

Task 1 — Build `orders_features.parquet` (20 minutes)

Create scripts/run_day4_make_features.py that:

reads data/processed/orders_enriched.parquet
parses created_at (UTC)
adds time parts (month, dow, hour)
flags outliers + creates amount_winsor
writes data/processed/orders_features.parquet (idempotent)

Checkpoint: the file exists and reloading shows > 0 rows.

Task 1 — Solution: `scripts/run_day4_make_features.py`

from __future__ import annotations

from pathlib import Path
import pandas as pd

from bootcamp_data.io import read_parquet, write_parquet
from bootcamp_data.checks import require_columns, assert_non_empty
from bootcamp_data.transforms import parse_datetime, add_time_parts, winsorize, add_outlier_flag, add_missing_flags


def main() -> None:
    root = Path(__file__).resolve().parents[1]
    in_path = root / "data" / "processed" / "orders_enriched.parquet"
    out_path = root / "data" / "processed" / "orders_features.parquet"

    df = read_parquet(in_path)
    assert_non_empty(df, df_name="orders_enriched")
    require_columns(df, ["order_id", "user_id", "amount", "created_at", "country", "status_clean"], df_name="orders_enriched")

    d = df.copy()
    d = parse_datetime(d, "created_at", utc=True)
    d = add_time_parts(d, "created_at")
    d["amount_winsor"] = winsorize(d["amount"])
    d = add_outlier_flag(d, "amount", k=1.5)
    d = add_missing_flags(d, ["amount", "created_at", "country", "status_clean"])

    write_parquet(d, out_path)
    print("wrote:", out_path, "rows:", len(d))


if __name__ == "__main__":
    main()

Task 1 — Run it (2 minutes)

macOS/Linux

uv run python -m scripts.run_day4_make_features

Windows PowerShell

uv run python -m scripts.run_day4_make_features

Task 2 — Create `notebooks/day4_eda.ipynb` (10 minutes)

In VS Code:

Create a new notebook: notebooks/day4_eda.ipynb
Select the interpreter from your uv environment
Add a title cell + an “inputs” cell

Note

If your notebook kernel can’t start, install ipykernel:

uv add ipykernel

Task 2 — Notebook outline (copy this structure)

Your notebook should have these sections:

Load
Audit (shape, dtypes, missingness, outlier rate)
Questions (3–6):
- table → chart → interpretation (+ caveat)
Bootstrap comparison (1 example)
Export figures + write reports/summary.md

Task 3 — Add reusable helpers (copy into notebook)

from pathlib import Path
import pandas as pd
import plotly.express as px

ROOT = Path("..")  # notebooks/ → repo root
FIGS = ROOT / "reports" / "figures"
FIGS.mkdir(parents=True, exist_ok=True)

def save_html(fig, name: str) -> Path:
    p = FIGS / f"{name}.html"
    fig.write_html(p, include_plotlyjs="embed")
    return p

Task 4 — Load features (copy into notebook)

df = pd.read_parquet(ROOT / "data" / "processed" / "orders_features.parquet")
df.head()

Checkpoint: you see rows, and created_at is datetime-like.

Task 5 — Your 3 “required charts” (45 minutes)

Make at least these three:

Top countries by revenue (sorted bar)
Monthly revenue trend (line)
Amount distribution (histogram; consider using amount_winsor)

Export each with save_html(fig, "day4_<name>").

Task 5 — Example: revenue by country (bar)

by_country = (
    df.groupby("country", dropna=False)["amount"]
      .sum(min_count=1)
      .rename("revenue")
      .reset_index()
      .sort_values("revenue", ascending=False)
      .head(15)
)

fig = px.bar(by_country, x="country", y="revenue", title="Revenue by country (top 15)")
fig.update_layout(title={"x": 0.02})
fig.update_xaxes(title_text="Country")
fig.update_yaxes(title_text="Revenue")
save_html(fig, "day4_revenue_by_country")

Task 6 — Bootstrap: pick one comparison (25 minutes)

Choose one:

mean amount in top country vs 2nd country
mean amount for paid vs refund
refund rate in top country vs all others

Compute a bootstrap interval and include it as a bullet in your summary.

Task 6 — Example: bootstrap mean difference (two countries)

import numpy as np

top2 = by_country["country"].head(2).tolist()
a = df.loc[df["country"].eq(top2[0]), "amount"]
b = df.loc[df["country"].eq(top2[1]), "amount"]

res = bootstrap_diff_means(a, b, n_boot=2000, seed=0)
res

Interpretation rule: if CI crosses 0, be cautious.

Task 7 — Write `reports/summary.md` from the notebook (20 minutes)

Minimum content:

3–6 bullet findings (with numbers)
1 bootstrap result bullet (diff + CI + n)
caveats (missingness, join coverage, outliers)
links to exported figures

Task 7 — Example summary writer (copy into notebook)

summary = f"""# Project Summary (Day 4)

## Key findings
- Top revenue country: **{by_country.iloc[0]['country']}** (revenue={by_country.iloc[0]['revenue']:.2f})
- (Add 2–4 more bullets from your tables/charts.)
- Bootstrap mean diff ({top2[0]} - {top2[1]}): **{res['diff_mean']:.2f}** (95% CI [{res['ci_low']:.2f}, {res['ci_high']:.2f}], n={res['n_a']} vs {res['n_b']})

## Figures
- [Revenue by country](figures/day4_revenue_by_country.html)
- (Add your other figure links.)

## Caveats / assumptions
- Totals use raw `amount`; `amount_winsor` is for visualization only.
- Outliers are flagged, not removed.
- Date parsing errors become missing timestamps.
"""

out = ROOT / "reports" / "summary.md"
out.write_text(summary, encoding="utf-8")
print("wrote:", out)

Task 8 — Git checkpoint (5 minutes)

Commit your Day 4 deliverables.

git add -A
git commit -m "w2d4: plotly EDA notebook + figures + summary"
git push

Debug playbook (Day 4)

If you get stuck:

Confirm you’re in the repo root (paths matter)
Rebuild inputs: run Day 2 again (scripts.run_day2_validate_and_report)
If Plotly import fails: uv add plotly
If notebook kernel fails: uv add ipykernel
If a chart looks wrong: check the table first (row counts + sorting)
If bootstrap looks unstable: print group sizes; avoid tiny groups

Exit Ticket

In 1–2 sentences:

How do you decide between a bar chart, line chart, histogram, and scatter plot? (Answer using: question type + variable types.)

Day 4 assignment (end of day)

Due: before Day 5 starts (2025-12-25)

Notebook: notebooks/day4_eda.ipynb runs top-to-bottom
Export at least 3 Plotly figures to reports/figures/*.html
Write reports/summary.md including one bootstrap CI
Push at least 2 commits (one mid-way, one final)

Tip

If your repo ignores data/processed/, that’s fine—your scripts must recreate it reliably.

Data Work (ETL + EDA)

Day 4: EDA with Plotly — Chart Choice + Uncertainty

Today’s Flow

Learning Objectives

Warm-up (5 minutes): verify inputs exist

Warm-up (5 minutes): verify Plotly is available

Quick review: Week 2 so far

Today’s end state (what you will commit)

Session 1

Session 1 objectives

Notebook workflow rules (bootcamp standard)

Variable types (fast mental model)

Chart choice: question → chart (default moves)

Chart choice cheat sheet (data-to-viz style)

Micro-exercise (6 minutes): choose the chart

Solution

The 5 most common “EDA chart mistakes”

EDA structure (what your notebook should look like)

Quick Check

Asr break

20 minutes

Session 2

Session 2 objectives

Plotly mental model (2 minutes)

Install note: PNG export is optional

Canonical Plotly patterns (you will reuse these)

Sorting categories (non-negotiable for readability)

Make charts self-contained (labels + units)

Export figures (offline-first)

Micro-exercise (8 minutes): fix a bad chart

Solution

Tidy data reminder (for plotting)

Maghrib break

20 minutes

Session 3

Session 3 objectives

Comparison thinking (the EDA upgrade)

Rates/ratios: always show numerator + denominator

Bootstrap intervals (practical uncertainty)

Minimal bootstrap template (difference in means)

Micro-exercise (7 minutes): interpret a bootstrap result

Solution

Writing reports/summary.md (template)

Isha break

20 minutes

Hands-on

Hands-on success criteria (today)

Project layout (Day 4 focus)

Task 0 — Create the notebook folder (2 minutes)

Task 1 — Build orders_features.parquet (20 minutes)

Task 1 — Solution: scripts/run_day4_make_features.py

Task 1 — Run it (2 minutes)

Task 2 — Create notebooks/day4_eda.ipynb (10 minutes)

Task 2 — Notebook outline (copy this structure)

Task 3 — Add reusable helpers (copy into notebook)

Task 4 — Load features (copy into notebook)

Task 5 — Your 3 “required charts” (45 minutes)

Task 5 — Example: revenue by country (bar)

Task 6 — Bootstrap: pick one comparison (25 minutes)

Task 6 — Example: bootstrap mean difference (two countries)

Task 7 — Write reports/summary.md from the notebook (20 minutes)

Task 7 — Example summary writer (copy into notebook)

Task 8 — Git checkpoint (5 minutes)

Debug playbook (Day 4)

Exit Ticket

Day 4 assignment (end of day)

Thank You!

Writing `reports/summary.md` (template)

Task 1 — Build `orders_features.parquet` (20 minutes)

Task 1 — Solution: `scripts/run_day4_make_features.py`

Task 2 — Create `notebooks/day4_eda.ipynb` (10 minutes)

Task 7 — Write `reports/summary.md` from the notebook (20 minutes)