Data Work (ETL + EDA)

AI Professionals Bootcamp | Week 2

2025-12-24

Day 4: EDA with Plotly — Chart Choice + Uncertainty

Goal: ship a clean EDA notebook + exported figures + a short written summary (offline‑first, reproducible).

Bootcamp • SDAIA Academy

Today’s Flow

  • Session 1 (60m): Chart choice (data‑to‑viz thinking) + notebook workflow rules
  • Asr Prayer (20m)
  • Session 2 (60m): Plotly Express patterns + exporting figures (HTML/PNG)
  • Maghrib Prayer (20m)
  • Session 3 (60m): Comparison thinking + bootstrap uncertainty + writing the summary
  • Isha Prayer (20m)
  • Hands-on (120m): Build notebooks/day4_eda.ipynb + export figures + write reports/summary.md

Learning Objectives

By the end of today, you can:

  • choose an appropriate visualization based on question + data types (data-to-viz style)
  • use Plotly Express as your one plotting library (and customize labels/layout)
  • export interactive figures to reports/figures/*.html (offline-friendly)
  • answer 3–6 concrete questions with tables + charts + short interpretations
  • compute a practical uncertainty estimate using a bootstrap interval
  • ship a readable reports/summary.md with findings + definitions + caveats

Warm-up (5 minutes): verify inputs exist

Day 4 EDA reads from processed data (not raw).

From repo root:

macOS/Linux

uv run python -m scripts.run_day2_validate_and_report
uv run python -c "from pathlib import Path; print(Path('data/processed/orders_enriched.parquet').exists())"

Windows PowerShell

uv run python -m scripts.run_day2_validate_and_report
uv run python -c "from pathlib import Path; print(Path('data/processed/orders_enriched.parquet').exists())"

Checkpoint: prints True.

Warm-up (5 minutes): verify Plotly is available

macOS/Linux

uv run python -c "import plotly; print('plotly ok', plotly.__version__)"

Windows PowerShell

uv run python -c "import plotly; print('plotly ok', plotly.__version__)"

Tip

If this fails, install Plotly now (required today):

uv add plotly

Quick review: Week 2 so far

  • Day 1: scaffold + typed I/O + offline-first data layout
  • Day 2: checks + cleaning + safe join → orders_enriched.parquet
  • Day 3: datetimes + outliers (flag/cap), EDA helpers (tables)

Note

Today is the “EDA handoff”: the notebook answers questions and the repo contains exported figures + a written summary.

Today’s end state (what you will commit)

  • notebooks/day4_eda.ipynb (reads from data/processed/)
  • reports/figures/ with 3–6 Plotly exports (.html required; .png optional)
  • reports/summary.md (findings + definitions + caveats)
  • data/processed/orders_features.parquet (analysis-ready; built once, reused)

Session 1

Chart choice (data‑to‑viz thinking) + notebook workflow rules

Session 1 objectives

By the end of this session, you can:

  • map a question → a chart type
  • recognize your variable types (categorical / numeric / datetime)
  • avoid the most common chart mistakes (pie, unsorted bars, missing n, dual axes)
  • explain “why notebook, not script” for EDA (and what belongs where)

Notebook workflow rules (bootcamp standard)

Notebooks are for exploration + explanation.

Rules:

  • Notebook reads from data/processed/ (never data/raw/)
  • ETL belongs in scripts/modules (idempotent), not hidden notebook cells
  • Every chart has:
    • a real title (time window + metric + segment)
    • axis labels with units
    • enough context to not mislead

Warning

If your notebook can’t be re-run top-to-bottom without manual fixes, it’s not a deliverable.

Variable types (fast mental model)

  • Categorical: country, status, product_category
  • Numeric: amount, quantity, rating
  • Datetime: created_at, event_time, month
  • Boolean: is_refund, is_outlier

You’ll choose charts based on question type + variable types.

Chart choice: question → chart (default moves)

Use the simplest chart that answers the question:

  • Compare categories (cat → value) → sorted bar / dot plot
  • Trend over time (time → value) → line
  • Distribution (numeric) → histogram (and maybe box for quick compare)
  • Relationship (num vs num) → scatter
  • Composition (parts of whole across a dimension) → stacked bar / stacked area
  • Rates vs volume → use two charts (avoid dual axes)

Tip

Your default should be “table → bar/line/scatter/hist”. Make pies rare.

Chart choice cheat sheet (data-to-viz style)

You want to… Data types Best default Common trap
Compare categories categorical + numeric bar (sorted) unsorted bars, too many categories
See a trend datetime + numeric line mixing time zones / missing dates
Understand spread numeric histogram using mean only on skewed data
Compare distributions categorical + numeric histogram facets / box box plots with tiny groups
Find relationships numeric + numeric scatter overplotting without aggregation
Show composition categorical + categorical stacked bar pie charts for many categories

Micro-exercise (6 minutes): choose the chart

For each question, pick the best default chart:

  1. “Which countries generate the most revenue?”
  2. “How does revenue change month-to-month?”
  3. “Is order amount skewed or roughly normal?”
  4. “Do refunds have lower order amounts than paid orders?”

Checkpoint: answer with 4 chart choices (one each).

Solution

  1. Sorted bar (top countries by revenue)
  2. Line (monthly revenue)
  3. Histogram (amount distribution; consider log scale if heavy tail)
  4. Histogram facets or box by status_clean (plus show group sizes)

The 5 most common “EDA chart mistakes”

  1. Bars not sorted (hard to read)
  2. Missing units/time window in titles
  3. Comparing averages without showing n
  4. Dual y-axes (often misleading)
  5. Pie charts for more than ~3–4 categories

Note

Most chart quality comes from: sorting, labeling, and choosing the right level of aggregation.

EDA structure (what your notebook should look like)

A job-ready EDA notebook answers 3–6 questions:

For each question:

  1. define metric + filters
  2. compute a table
  3. chart the table (or chart the raw distribution)
  4. write 2–3 sentences: what it means + 1 caveat

Quick Check

Question: Why do we export figures to reports/figures/ instead of leaving them only inside the notebook?

Answer: figures are stable artifacts you can share, review, and reference from the written summary (and diff in Git if needed).

Asr break

20 minutes

When you return: we’ll learn the Plotly patterns we’ll use for every chart today.

Session 2

Plotly Express patterns + exporting figures (HTML/PNG)

Session 2 objectives

By the end of this session, you can:

  • build the core chart types with plotly.express
  • customize chart anatomy (titles, labels, axes)
  • export figures to .html (offline-friendly) and optionally .png
  • use long/tidy data when plotting grouped metrics

Plotly mental model (2 minutes)

  • plotly.express → fast chart creation (px.bar, px.line, …)
  • fig.update_layout(...) → titles, margins, legend, theme
  • fig.update_xaxes/yaxes(...) → axis labels, tick formatting
  • Export:
    • fig.write_html(...) ✅ works offline (recommended)
    • fig.write_image(...) ✅ requires kaleido (optional)

Install note: PNG export is optional

If you want .png exports:

uv add kaleido

If you can’t install it (offline), export .html and keep going.

Canonical Plotly patterns (you will reuse these)

import plotly.express as px

# bar (categories)
fig = px.bar(df, x="country", y="revenue", title="Revenue by country")

# line (trend)
fig = px.line(df, x="month", y="revenue", title="Monthly revenue")

# histogram (distribution)
fig = px.histogram(df, x="amount", nbins=50, title="Amount distribution")

# scatter (relationship)
fig = px.scatter(df, x="quantity", y="amount", title="Amount vs quantity")

Sorting categories (non-negotiable for readability)

Plotly won’t magically sort categories the way you intend.

Do this:

d = df.sort_values("revenue", ascending=False)
fig = px.bar(d, x="country", y="revenue", title="Revenue by country (sorted)")

Make charts self-contained (labels + units)

fig.update_layout(title={"x": 0.02})
fig.update_xaxes(title_text="Country")
fig.update_yaxes(title_text="Revenue (USD)")

Tip

If you copy a chart into a slide with no context, it should still make sense.

Export figures (offline-first)

Prefer HTML exports with embedded JS:

from pathlib import Path

out = Path("reports/figures")
out.mkdir(parents=True, exist_ok=True)

fig.write_html(out / "day4_revenue_by_country.html", include_plotlyjs="embed")

Optional PNG (if Kaleido installed):

fig.write_image(out / "day4_revenue_by_country.png", scale=2)

Micro-exercise (8 minutes): fix a bad chart

You have a bar chart that:

  • is unsorted
  • has no axis labels
  • has a vague title

Write the 3 lines to fix it.

Checkpoint: you wrote: sort + update_layout + axis labels.

Solution

d = d.sort_values("revenue", ascending=False)
fig = px.bar(d, x="country", y="revenue", title="Revenue by country (Jan–Dec 2025)")
fig.update_layout(title={"x": 0.02})
fig.update_xaxes(title_text="Country")
fig.update_yaxes(title_text="Revenue (USD)")

Tidy data reminder (for plotting)

If you want multiple metrics on the same chart or facet:

  • long format is usually easiest
  • use melt() to go wide → long

Example:

long = wide.melt(
    id_vars=["month"],
    value_vars=["revenue", "n_orders"],
    var_name="metric",
    value_name="value",
)
fig = px.line(long, x="month", y="value", color="metric", title="Metrics over time")

Maghrib break

20 minutes

When you return: we’ll add “comparison thinking” + bootstrap intervals so our conclusions are honest.

Session 3

Comparison thinking + bootstrap uncertainty + writing the summary

Session 3 objectives

By the end of this session, you can:

  • compute rates/ratios correctly (with denominators)
  • report effect sizes (absolute + relative)
  • build a bootstrap interval for a difference in means or rates
  • write a short, credible reports/summary.md

Comparison thinking (the EDA upgrade)

When you compare groups, always include:

  • n (sample size)
  • absolute difference (e.g., +$2.10)
  • relative difference (e.g., +8%)
  • a caveat if groups are tiny or biased

Warning

Averages without n are how analysts accidentally lie.

Rates/ratios: always show numerator + denominator

Example: refund rate by country

  • numerator: n_refund
  • denominator: n_total
  • rate: n_refund / n_total

Your report should show all three.

Bootstrap intervals (practical uncertainty)

Bootstrap answers: “If I re-sampled my data, how much could this estimate vary?”

Use it when:

  • comparing means/medians between groups
  • comparing rates (refund rate, conversion rate)
  • you want a rough uncertainty range without heavy theory

Keep it reproducible:

  • fix a random seed

Minimal bootstrap template (difference in means)

import numpy as np
import pandas as pd

def bootstrap_diff_means(a: pd.Series, b: pd.Series, *, n_boot: int = 2000, seed: int = 0) -> dict:
    rng = np.random.default_rng(seed)
    a = pd.to_numeric(a, errors="coerce").dropna().to_numpy()
    b = pd.to_numeric(b, errors="coerce").dropna().to_numpy()
    assert len(a) > 0 and len(b) > 0

    diffs = []
    for _ in range(n_boot):
        sa = rng.choice(a, size=len(a), replace=True)
        sb = rng.choice(b, size=len(b), replace=True)
        diffs.append(sa.mean() - sb.mean())

    diffs = np.array(diffs)
    return {
        "diff_mean": float(a.mean() - b.mean()),
        "ci_low": float(np.quantile(diffs, 0.025)),
        "ci_high": float(np.quantile(diffs, 0.975)),
        "n_a": int(len(a)),
        "n_b": int(len(b)),
    }

Micro-exercise (7 minutes): interpret a bootstrap result

You compute:

  • diff_mean = +3.2
  • 95% CI = [-0.4, +6.9]

Write one honest sentence.

Checkpoint: mention uncertainty and that zero is inside the CI.

Solution

“Group A’s mean is about +3.2 higher than Group B, but the bootstrap interval includes 0, so the difference is uncertain with this data (may be small or not real).”

Writing reports/summary.md (template)

Your summary should include:

  1. Key findings (3–6 bullets with numbers)
  2. Definitions (what filters/metrics mean)
  3. Data quality caveats (missingness, join coverage, outliers)
  4. Next questions (what you’d check next)

Tip

If your summary can’t be read in under 2 minutes, it’s too long.

Isha break

20 minutes

When you return: we’ll build the Day 4 notebook and export your figures + summary.

Hands-on

Build: notebooks/day4_eda.ipynb → figures + summary (Plotly)

Hands-on success criteria (today)

By the end, you should have:

  • data/processed/orders_features.parquet (created once, reused)
  • notebooks/day4_eda.ipynb that runs top-to-bottom
  • reports/figures/ with 3–6 exports (.html required)
  • reports/summary.md including at least one bootstrap CI
  • at least 2 commits pushed to GitHub

Project layout (Day 4 focus)

data/
  processed/
    orders_enriched.parquet
    orders_features.parquet      # NEW today
notebooks/
  day4_eda.ipynb                 # NEW today
reports/
  figures/
    day4_*.html                  # NEW today (required)
    day4_*.png                   # optional (kaleido)
  summary.md                     # NEW today
scripts/
  run_day4_make_features.py      # NEW today (small, idempotent)

Task 0 — Create the notebook folder (2 minutes)

From repo root:

mkdir -p notebooks

(Windows PowerShell)

New-Item -ItemType Directory -Force notebooks | Out-Null

Task 1 — Build orders_features.parquet (20 minutes)

Create scripts/run_day4_make_features.py that:

  1. reads data/processed/orders_enriched.parquet
  2. parses created_at (UTC)
  3. adds time parts (month, dow, hour)
  4. flags outliers + creates amount_winsor
  5. writes data/processed/orders_features.parquet (idempotent)

Checkpoint: the file exists and reloading shows > 0 rows.

Task 1 — Solution: scripts/run_day4_make_features.py

from __future__ import annotations

from pathlib import Path
import pandas as pd

from bootcamp_data.io import read_parquet, write_parquet
from bootcamp_data.checks import require_columns, assert_non_empty
from bootcamp_data.transforms import parse_datetime, add_time_parts, winsorize, add_outlier_flag, add_missing_flags


def main() -> None:
    root = Path(__file__).resolve().parents[1]
    in_path = root / "data" / "processed" / "orders_enriched.parquet"
    out_path = root / "data" / "processed" / "orders_features.parquet"

    df = read_parquet(in_path)
    assert_non_empty(df, df_name="orders_enriched")
    require_columns(df, ["order_id", "user_id", "amount", "created_at", "country", "status_clean"], df_name="orders_enriched")

    d = df.copy()
    d = parse_datetime(d, "created_at", utc=True)
    d = add_time_parts(d, "created_at")
    d["amount_winsor"] = winsorize(d["amount"])
    d = add_outlier_flag(d, "amount", k=1.5)
    d = add_missing_flags(d, ["amount", "created_at", "country", "status_clean"])

    write_parquet(d, out_path)
    print("wrote:", out_path, "rows:", len(d))


if __name__ == "__main__":
    main()

Task 1 — Run it (2 minutes)

macOS/Linux

uv run python -m scripts.run_day4_make_features

Windows PowerShell

uv run python -m scripts.run_day4_make_features

Task 2 — Create notebooks/day4_eda.ipynb (10 minutes)

In VS Code:

  1. Create a new notebook: notebooks/day4_eda.ipynb
  2. Select the interpreter from your uv environment
  3. Add a title cell + an “inputs” cell

Note

If your notebook kernel can’t start, install ipykernel:

uv add ipykernel

Task 2 — Notebook outline (copy this structure)

Your notebook should have these sections:

  1. Load
  2. Audit (shape, dtypes, missingness, outlier rate)
  3. Questions (3–6):
    • table → chart → interpretation (+ caveat)
  4. Bootstrap comparison (1 example)
  5. Export figures + write reports/summary.md

Task 3 — Add reusable helpers (copy into notebook)

from pathlib import Path
import pandas as pd
import plotly.express as px

ROOT = Path("..")  # notebooks/ → repo root
FIGS = ROOT / "reports" / "figures"
FIGS.mkdir(parents=True, exist_ok=True)

def save_html(fig, name: str) -> Path:
    p = FIGS / f"{name}.html"
    fig.write_html(p, include_plotlyjs="embed")
    return p

Task 4 — Load features (copy into notebook)

df = pd.read_parquet(ROOT / "data" / "processed" / "orders_features.parquet")
df.head()

Checkpoint: you see rows, and created_at is datetime-like.

Task 5 — Your 3 “required charts” (45 minutes)

Make at least these three:

  1. Top countries by revenue (sorted bar)
  2. Monthly revenue trend (line)
  3. Amount distribution (histogram; consider using amount_winsor)

Export each with save_html(fig, "day4_<name>").

Task 5 — Example: revenue by country (bar)

by_country = (
    df.groupby("country", dropna=False)["amount"]
      .sum(min_count=1)
      .rename("revenue")
      .reset_index()
      .sort_values("revenue", ascending=False)
      .head(15)
)

fig = px.bar(by_country, x="country", y="revenue", title="Revenue by country (top 15)")
fig.update_layout(title={"x": 0.02})
fig.update_xaxes(title_text="Country")
fig.update_yaxes(title_text="Revenue")
save_html(fig, "day4_revenue_by_country")

Task 6 — Bootstrap: pick one comparison (25 minutes)

Choose one:

  • mean amount in top country vs 2nd country
  • mean amount for paid vs refund
  • refund rate in top country vs all others

Compute a bootstrap interval and include it as a bullet in your summary.

Task 6 — Example: bootstrap mean difference (two countries)

import numpy as np

top2 = by_country["country"].head(2).tolist()
a = df.loc[df["country"].eq(top2[0]), "amount"]
b = df.loc[df["country"].eq(top2[1]), "amount"]

res = bootstrap_diff_means(a, b, n_boot=2000, seed=0)
res

Interpretation rule: if CI crosses 0, be cautious.

Task 7 — Write reports/summary.md from the notebook (20 minutes)

Minimum content:

  • 3–6 bullet findings (with numbers)
  • 1 bootstrap result bullet (diff + CI + n)
  • caveats (missingness, join coverage, outliers)
  • links to exported figures

Task 7 — Example summary writer (copy into notebook)

summary = f"""# Project Summary (Day 4)

## Key findings
- Top revenue country: **{by_country.iloc[0]['country']}** (revenue={by_country.iloc[0]['revenue']:.2f})
- (Add 2–4 more bullets from your tables/charts.)
- Bootstrap mean diff ({top2[0]} - {top2[1]}): **{res['diff_mean']:.2f}** (95% CI [{res['ci_low']:.2f}, {res['ci_high']:.2f}], n={res['n_a']} vs {res['n_b']})

## Figures
- [Revenue by country](figures/day4_revenue_by_country.html)
- (Add your other figure links.)

## Caveats / assumptions
- Totals use raw `amount`; `amount_winsor` is for visualization only.
- Outliers are flagged, not removed.
- Date parsing errors become missing timestamps.
"""

out = ROOT / "reports" / "summary.md"
out.write_text(summary, encoding="utf-8")
print("wrote:", out)

Task 8 — Git checkpoint (5 minutes)

Commit your Day 4 deliverables.

git add -A
git commit -m "w2d4: plotly EDA notebook + figures + summary"
git push

Debug playbook (Day 4)

If you get stuck:

  1. Confirm you’re in the repo root (paths matter)
  2. Rebuild inputs: run Day 2 again (scripts.run_day2_validate_and_report)
  3. If Plotly import fails: uv add plotly
  4. If notebook kernel fails: uv add ipykernel
  5. If a chart looks wrong: check the table first (row counts + sorting)
  6. If bootstrap looks unstable: print group sizes; avoid tiny groups

Exit Ticket

In 1–2 sentences:

How do you decide between a bar chart, line chart, histogram, and scatter plot? (Answer using: question type + variable types.)

Day 4 assignment (end of day)

Due: before Day 5 starts (2025-12-25)

  1. Notebook: notebooks/day4_eda.ipynb runs top-to-bottom
  2. Export at least 3 Plotly figures to reports/figures/*.html
  3. Write reports/summary.md including one bootstrap CI
  4. Push at least 2 commits (one mid-way, one final)

Tip

If your repo ignores data/processed/, that’s fine—your scripts must recreate it reliably.

Thank You!