Data Work (ETL + EDA)

AI Professionals Bootcamp | Week 2

2025-12-23

Day 3: Transform → Feature Table → EDA Tables (Tidy data + time + outliers)

Goal: turn Day 2’s processed outputs into an analysis-ready feature table (datetimes + time parts + outlier flags) and a rerunnable EDA tables report you’ll use for Plotly charts on Day 4.

Bootcamp • SDAIA Academy

Today’s Flow

  • Session 1 (60m): Datetimes that don’t lie (parse → measure → time parts)
  • Asr Prayer (20m)
  • Session 2 (60m): Outliers + robust summaries (flag, cap for viz, don’t delete)
  • Maghrib Prayer (20m)
  • Session 3 (60m): Tidy data + reshape (wide ↔︎ long with melt / pivot_table)
  • Isha Prayer (20m)
  • Hands-on (120m): Build data/processed/orders_features.parquet + reports/day3_eda_tables.md (+ long/wide samples for tomorrow)

Learning Objectives

By the end of today, you can:

  • run Day 2 and verify inputs before analysis
  • parse a datetime safely with pd.to_datetime(..., errors="coerce", utc=True) and measure invalids
  • derive time parts (month, dow, hour) for grouping
  • detect outliers with percentiles / IQR and choose a safe policy (flag + optional cap for viz)
  • produce robust numeric summaries (mean vs median, quantiles) and interpret skew
  • reshape tables safely: wide ↔︎ long with melt() / pivot_table()
  • write data/processed/orders_features.parquet + reports/day3_eda_tables.md from a script you can rerun

Bridge: Week 1 habits that make Week 2 work

  • Use uv to stay reproducible: uv add ... and uv run ...

  • Run entrypoints as modules from repo root (imports stay stable)

  • Write artifacts you can commit/review:

    • Parquet in data/processed/
    • Markdown in reports/

Quick review: what you already built (Day 1–2)

Today we build on these artifacts:

  • data/processed/orders_clean.parquet
  • data/processed/orders_enriched.parquet
  • reports/day2_summary.md

Note

Week 2 is offline-first: we do not fetch anything from the internet today.

Warm-up (5 minutes)

Run Day 2 and confirm the outputs exist.

macOS/Linux

uv run python -m scripts.run_day2_validate_and_report
ls -la data/processed | sed -n '1,40p'
ls -la reports | sed -n '1,40p'

Windows PowerShell

uv run python -m scripts.run_day2_validate_and_report
Get-ChildItem data\processed | Select-Object -First 20
Get-ChildItem reports | Select-Object -First 20

Checkpoint: these files exist:

  • data/processed/orders_clean.parquet
  • data/processed/orders_enriched.parquet
  • reports/day2_summary.md

If warm-up fails: 3 fast fixes

  1. You’re not in repo root

    • You should see folders like bootcamp_data/, scripts/, data/, reports/
  2. You ran by file path instead of as a module

    • uv run python -m scripts.run_day2_validate_and_report
    • python scripts/run_day2_validate_and_report.py
  3. Your environment is missing deps

    • uv add pandas pyarrow httpx

Warning

Do not “fix imports” by editing sys.path or setting PYTHONPATH.

Session 1

Datetimes that don’t lie

Session 1 objectives

By the end of this session, you can:

  • explain why datetime parsing is a correctness problem
  • parse timestamps safely with errors="coerce" and utc=True
  • measure invalid timestamps (after parsing)
  • add time parts for grouping (month / day-of-week / hour)

Why datetimes break analytics (silently)

If time is wrong, you get wrong:

  • trends and seasonality
  • “before vs after” comparisons
  • “most recent” sorting
  • cohorts and retention

Datetime bugs often don’t crash. They quietly lie.

New habit: parse → measure → proceed

When you parse time:

  1. Parse with a safe rule
  2. Count how many values became missing
  3. Decide what to do (fix, drop, flag)

Tip

“I parsed it” is not enough. You need the missing-after-parse count.

pd.to_datetime safe defaults

Good Week 2 default:

  • errors="coerce" → invalid strings become missing values (measurable)
  • utc=True → timestamps become UTC-aware (avoids mixing time zones)

Warning

If your raw timestamps have mixed formats (e.g., 03/04/2025), you need a plan. Don’t guess.

Micro-exercise: parse created_at and count invalids (6 minutes)

  1. Load data/processed/orders_enriched.parquet

  2. Parse created_at with errors="coerce" and utc=True

  3. Print:

    • dtype of created_at
    • number of missing timestamps after parsing

Checkpoint: dtype is datetime-like and you can print n_missing_created_at.

Solution (example)

import pandas as pd

df = pd.read_parquet("data/processed/orders_enriched.parquet")
dt = pd.to_datetime(df["created_at"], errors="coerce", utc=True)
df = df.assign(created_at=dt)

print("dtype:", df["created_at"].dtype)
print("n_missing_created_at:", int(df["created_at"].isna().sum()))

Time parts you’ll actually use

Time parts turn a timestamp into group keys:

  • month (trend)
  • dow (day-of-week)
  • hour (hour-of-day)

Note

.dt only works after parsing to a datetime type.

Micro-exercise: add time parts + group (7 minutes)

  1. Start from the parsed created_at

  2. Add:

    • month = created_at.dt.to_period("M").astype("string")
    • dow = created_at.dt.day_name()
    • hour = created_at.dt.hour
  3. Compute orders_per_month (count of rows per month)

Checkpoint: you can print a small month table sorted by month.

Solution (example)

orders = (
    df
    .assign(
        month=df["created_at"].dt.to_period("M").astype("string"),
        dow=df["created_at"].dt.day_name(),
        hour=df["created_at"].dt.hour,
    )
)

orders_per_month = (
    orders.groupby("month", dropna=False)
          .size()
          .reset_index(name="n_orders")
          .sort_values("month")
)
print(orders_per_month)

New pattern: .pipe() makes transforms readable

.pipe(fn, ...) lets you apply a function in a clean top-to-bottom flow.

orders = (
    df
    .pipe(parse_created_at)
    .pipe(add_time_parts)
)

We’ll use this in hands-on when we move parsing logic into bootcamp_data/transforms.py.

Micro-exercise: narrate a pipeline (3 minutes)

In one sentence each, explain what these steps do:

  1. parse
  2. add time parts
  3. group and summarize

Checkpoint: you can explain the pipeline without reading code line-by-line.

Solution: a good narration

  • Parse: convert time-like strings into real timestamps and measure invalids
  • Time parts: create group keys like month/day/hour
  • Summarize: aggregate counts/totals by those group keys

Session 1 recap

  • Datetimes are a correctness problem, not a formatting problem
  • Default: errors="coerce" + measure missing-after-parse
  • Add time parts to create useful group keys
  • Use .pipe() to keep transforms readable (df → df)

Asr break

20 minutes

When you return: we’ll handle outliers without corrupting the story.

Session 2

Outliers + robust summaries

Session 2 objectives

By the end of this session, you can:

  • explain why outliers distort averages and charts
  • compute percentile summaries (p50/p90/p99)
  • compute IQR bounds and flag outliers
  • pick a safe policy: flag + (optional) cap for viz

Outliers: not “delete”, but “decide”

Outliers might be:

  • data entry mistakes
  • genuine rare events (VIP purchase)
  • fraud / abuse

Rule for Week 2: don’t delete silently. Flag first.

Percentiles = quick reality check

Useful first look for a numeric column:

  • p50 (median) = “typical”
  • p90 = high but not extreme
  • p99 = top 1% threshold

If p99 is wildly bigger than p50, you may have a heavy tail or errors.

Micro-exercise: compute p50/p90/p99 for amount (6 minutes)

  1. Load orders_enriched.parquet
  2. Drop missing amount
  3. Print p50/p90/p99

Checkpoint: you can explain what p99 means in one sentence.

Solution (example)

import pandas as pd

df = pd.read_parquet("data/processed/orders_enriched.parquet")
s = df["amount"].dropna()
print(s.quantile([0.5, 0.9, 0.99]))

Robust summary > single average

If data is skewed (common in money), the mean can be misleading.

For numeric EDA, always include:

  • n (non-missing)
  • mean and median
  • p25 / p75 (IQR) and a “tail” percentile (p90 or p95)
  • min / max (sanity)

Tip

If mean ≫ median, your “average” is being pulled by a small number of large values.

Micro-exercise: build a numeric summary table for amount (7 minutes)

Create a small summary table with rows:

n, mean, median, p25, p75, p90, min, max

Checkpoint: you can explain “median vs mean” in one sentence.

Solution (example)

import pandas as pd

df = pd.read_parquet("data/processed/orders_enriched.parquet")
s = pd.to_numeric(df["amount"], errors="coerce")

summary = pd.DataFrame(
    {
        "metric": ["n", "mean", "median", "p25", "p75", "p90", "min", "max"],
        "value": [
            int(s.notna().sum()),
            float(s.mean()),
            float(s.median()),
            float(s.quantile(0.25)),
            float(s.quantile(0.75)),
            float(s.quantile(0.90)),
            float(s.min()),
            float(s.max()),
        ],
    }
)
print(summary)

IQR bounds = simple outlier flagging

IQR method:

  • Q1 = 25th percentile
  • Q3 = 75th percentile
  • IQR = Q3 − Q1
  • bounds = [Q1 − 1.5×IQR, Q3 + 1.5×IQR]

We use bounds to flag unusual values.

Micro-exercise: add an outlier flag + compute the rate (7 minutes)

  1. Compute IQR bounds for amount
  2. Create amount__is_outlier
  3. Print outlier rate (mean of the boolean)

Checkpoint: you can print outlier_rate as a number between 0 and 1.

Solution (example)

import pandas as pd

df = pd.read_parquet("data/processed/orders_enriched.parquet")
x = df["amount"].dropna()
q1, q3 = x.quantile(0.25), x.quantile(0.75)
iqr = q3 - q1
lo, hi = float(q1 - 1.5 * iqr), float(q3 + 1.5 * iqr)

df = df.assign(amount__is_outlier=(df["amount"] < lo) | (df["amount"] > hi))
print("bounds:", lo, hi)
print("outlier_rate:", float(df["amount__is_outlier"].mean()))

Capping for charts: winsorized values

Sometimes you want readable charts without deleting rows.

Winsorization (simple version):

  • cap values to [p1, p99]
  • keep all rows

Warning

Use capped values for visualization. Use raw values for final totals unless you have a business rule.

Micro-exercise: compare top countries (raw vs winsor) (8 minutes)

  1. Create amount_winsor by capping to p1/p99

  2. Compute top 5 countries by revenue using:

    • raw amount
    • amount_winsor
  3. Compare: do the top 5 change?

Checkpoint: you can answer: “Does capping change the story?”

Solution (example)

import pandas as pd

df = pd.read_parquet("data/processed/orders_enriched.parquet")
a, b = df["amount"].dropna().quantile(0.01), df["amount"].dropna().quantile(0.99)
df = df.assign(amount_winsor=df["amount"].clip(lower=a, upper=b))

raw = (
    df.groupby("country", dropna=False)["amount"]
      .sum()
      .sort_values(ascending=False)
      .head(5)
)
cap = (
    df.groupby("country", dropna=False)["amount_winsor"]
      .sum()
      .sort_values(ascending=False)
      .head(5)
)

print("raw top 5:\n", raw)
print("\ncapped top 5:\n", cap)

Session 2 recap

  • Outliers are a decision: flag first
  • Percentiles tell you “typical vs extreme”
  • IQR gives a simple outlier rule
  • Winsorization helps charts without deleting rows

Maghrib break

20 minutes

When you return: we’ll turn these ideas into a rerunnable report (and optional figures).

Session 3

Tidy data + reshape (wide ↔︎ long with melt/pivot)

Session 3 objectives

By the end of this session, you can:

  • explain “tidy data” in one minute (variables / observations / units)
  • build a monthly summary table with multiple metrics (n_orders, revenue, AOV)
  • reshape wide → long with melt() (plotting-friendly)
  • reshape long → wide with pivot_table() (report-friendly)
  • avoid accidental aggregation when reshaping

Tidy data mental model (the 10-second version)

A dataset is “tidy” when:

  • each variable is a column
  • each observation is a row
  • each observational unit is its own table

Why you care: groupby + Plotly become much easier.

Long vs wide: when to use which

Long (tidy):

  • best for plotting (Plotly) and flexible grouping
  • columns like: month, metric, value

Wide:

  • best for human-readable tables / CSV exports
  • columns like: month, n_orders, revenue, aov

Rule of thumb: store features as normal tables; reshape for viz/reporting.

Example: monthly metrics table (wide)

We want one row per month:

  • n_orders
  • revenue
  • aov (mean order amount)

This is perfect for a Markdown report.

Micro-exercise: build a monthly metrics table (7 minutes)

  1. Start from your transformed DataFrame with a month column

  2. Group by month

  3. Compute:

    • n_orders
    • revenue
    • aov (mean amount)

Checkpoint: you can print a month table sorted by month.

Solution (example)

monthly = (
    df.groupby("month", dropna=False)
      .agg(
          n_orders=("order_id", "size"),
          revenue=("amount", "sum"),
          aov=("amount", "mean"),
      )
      .reset_index()
      .sort_values("month")
)
print(monthly.head(12))

melt() = wide → long (plotting-friendly)

melt turns metric columns into two columns:

  • metric (a category)
  • value (a numeric value)

This is the format Plotly loves: x = month, y = value, color = metric.

Micro-exercise: melt monthly metrics to long (6 minutes)

  1. Start from monthly (wide)

  2. Convert to long with:

    • id_vars=["month"]
    • value_vars=["n_orders","revenue","aov"]

Checkpoint: your long table has columns: month, metric, value.

Solution (example)

monthly_long = monthly.melt(
    id_vars=["month"],
    value_vars=["n_orders", "revenue", "aov"],
    var_name="metric",
    value_name="value",
)
print(monthly_long.head(10))

pivot_table() = long → wide (report-friendly)

pivot_table is how you go back to wide.

Use it when:

  • you want a “spreadsheet-like” table
  • you are sure the pivot keys uniquely identify rows (or you choose an explicit aggregation)

Micro-exercise: pivot long → wide (5 minutes)

  1. Start from monthly_long
  2. Create a wide version again
  3. Confirm it has one row per month

Checkpoint: row count equals number of distinct months.

Solution (example)

monthly_wide = (
    monthly_long.pivot_table(
        index=["month"],
        columns="metric",
        values="value",
        aggfunc="first",  # safe because (month, metric) is unique here
    )
    .reset_index()
)
print(monthly_wide.head(12))

Pitfall: “pivot” can hide accidental aggregation

If your keys are not unique, pivot_table will aggregate without yelling.

Safer habits:

  • aggregate first (groupby → tidy summary), then pivot
  • verify uniqueness (or expected row counts)
  • choose aggfunc intentionally (never rely on defaults)

End-of-day artifacts (to use tomorrow)

Today we will write:

  • data/processed/orders_features.parquet (analysis-ready features)
  • reports/day3_eda_tables.md (tables + caveats)
  • reports/day3_monthly_metrics_long.csv (plot-ready long form)
  • optional: reports/day3_monthly_metrics_wide.csv (wide form for review)

Session 3 recap

  • Long = plotting-friendly, wide = report-friendly
  • melt() and pivot_table() are the bridge
  • Tomorrow’s charts become easy when today’s tables are tidy

Isha break

20 minutes

When you return: we’ll build orders_features.parquet + day3_eda_tables.md + tidy reshape artifacts end-to-end.

Hands-on

Build: feature table + tidy reshape artifacts + EDA tables report

Hands-on success criteria (today)

By the end, you should have:

  • data/processed/orders_features.parquet (NEW)
  • bootcamp_data/eda.py (updated helpers)
  • updated bootcamp_data/transforms.py with datetime + outlier helpers
  • scripts/run_day3_features_and_tables.py (NEW runner)
  • reports/day3_eda_tables.md
  • reports/day3_monthly_metrics_long.csv (plot-ready long form)
  • optional: reports/day3_monthly_metrics_wide.csv (wide form for review)

Project layout (after today)

bootcamp_data/
  config.py
  checks.py
  transforms.py
  eda.py
scripts/
  run_day1_load.py
  run_day2_validate_and_report.py
  run_day3_features_and_tables.py     # NEW today
data/processed/
  orders_enriched.parquet
  orders_features.parquet            # NEW today
reports/
  day2_summary.md
  day3_eda_tables.md                 # NEW today
  day3_monthly_metrics_long.csv      # NEW today
  day3_monthly_metrics_wide.csv      # optional

Task 0 — Confirm Day 2 artifacts (5 minutes)

From repo root, confirm these exist:

  • data/processed/orders_enriched.parquet
  • reports/day2_summary.md

macOS/Linux

ls -la data/processed | sed -n '1,40p'
ls -la reports | sed -n '1,40p'

Windows PowerShell

Get-ChildItem data\processed | Select-Object -First 20
Get-ChildItem reports | Select-Object -First 20

Task 1 — Upgrade bootcamp_data/eda.py (25 minutes)

In bootcamp_data/eda.py, implement:

  • md_table(df, max_rows=10) → DataFrame → Markdown table string
  • top_k_table(df, group_col, value_col, k=10) → grouped totals table
  • describe_numeric(df, col) → robust numeric summary (mean/median/quantiles)
  • summary_metrics(df) → top-line metrics (rows, revenue, missing timestamps, match rate, outlier rate)
  • write_markdown_report(path, title, sections) → write a Markdown report from sections

Checkpoint:

uv run python -c "from bootcamp_data.eda import describe_numeric, top_k_table; print('eda ok')"

Solution — bootcamp_data/eda.py (complete)

from __future__ import annotations

from pathlib import Path

import pandas as pd


def md_table(df: pd.DataFrame, max_rows: int = 10) -> str:
    view = df.head(max_rows).copy().astype("string").fillna("")
    cols = list(view.columns)
    header = "| " + " | ".join(cols) + " |"
    sep = "| " + " | ".join(["---"] * len(cols)) + " |"
    rows = ["| " + " | ".join(row) + " |" for row in view.itertuples(index=False, name=None)]
    tail = "" if len(df) <= max_rows else f"\n\n::: {{.muted}}Showing first {max_rows} of {len(df)} rows.:::\n"
    return "\n".join([header, sep, *rows]) + tail


def top_k_table(df: pd.DataFrame, group_col: str, value_col: str, k: int = 10) -> pd.DataFrame:
    return (
        df.groupby(group_col, dropna=False)[value_col]
          .agg(n="size", total="sum")
          .reset_index()
          .sort_values("total", ascending=False)
          .head(k)
    )


def describe_numeric(df: pd.DataFrame, col: str) -> pd.DataFrame:
    s = pd.to_numeric(df[col], errors="coerce")
    return pd.DataFrame(
        {
            "metric": ["n", "mean", "median", "p25", "p75", "p90", "min", "max"],
            "value": [
                int(s.notna().sum()),
                float(s.mean()),
                float(s.median()),
                float(s.quantile(0.25)),
                float(s.quantile(0.75)),
                float(s.quantile(0.90)),
                float(s.min()),
                float(s.max()),
            ],
        }
    )


def summary_metrics(df: pd.DataFrame) -> pd.DataFrame:
    rows = len(df)
    revenue = df["amount"].sum(min_count=1) if "amount" in df.columns else None
    missing_ts = df["created_at"].isna().sum() if "created_at" in df.columns else None

    match_rate = None
    if "user_found" in df.columns:
        match_rate = float(df["user_found"].mean())
    elif "country" in df.columns:
        match_rate = 1.0 - float(df["country"].isna().mean())

    outlier_rate = float(df["amount__is_outlier"].mean()) if "amount__is_outlier" in df.columns else None

    out = pd.DataFrame(
        {
            "metric": ["rows", "revenue_sum", "missing_created_at", "user_match_rate", "outlier_rate"],
            "value": [rows, revenue, missing_ts, match_rate, outlier_rate],
        }
    )
    return out.dropna(subset=["value"])


def write_markdown_report(path: Path, title: str, sections: list[tuple[str, str]]) -> None:
    path.parent.mkdir(parents=True, exist_ok=True)
    parts = [f"# {title}\n"]
    for h, body in sections:
        parts.append(f"## {h}\n\n{body}\n")
    path.write_text("\n".join(parts), encoding="utf-8")

Task 2 — Add datetime + outlier transforms (20 minutes)

In bootcamp_data/transforms.py, add:

  • parse_datetime(df, col, utc=True)
  • add_time_parts(df, ts_col) → month/dow/hour
  • iqr_bounds(s, k=1.5)
  • winsorize(s, lo=0.01, hi=0.99)
  • add_outlier_flag(df, col, k=1.5)

Checkpoint:

uv run python -c "from bootcamp_data.transforms import add_time_parts, winsorize; print('transforms ok')"

Solution — transforms (datetime + outliers)

import pandas as pd


def parse_datetime(df: pd.DataFrame, col: str, *, utc: bool = True) -> pd.DataFrame:
    dt = pd.to_datetime(df[col], errors="coerce", utc=utc)
    return df.assign(**{col: dt})


def add_time_parts(df: pd.DataFrame, ts_col: str) -> pd.DataFrame:
    ts = df[ts_col]
    return df.assign(
        month=ts.dt.to_period("M").astype("string"),
        dow=ts.dt.day_name(),
        hour=ts.dt.hour,
    )


def iqr_bounds(s: pd.Series, k: float = 1.5) -> tuple[float, float]:
    x = s.dropna()
    q1, q3 = x.quantile(0.25), x.quantile(0.75)
    iqr = q3 - q1
    return float(q1 - k * iqr), float(q3 + k * iqr)


def winsorize(s: pd.Series, lo: float = 0.01, hi: float = 0.99) -> pd.Series:
    x = s.dropna()
    a, b = x.quantile(lo), x.quantile(hi)
    return s.clip(lower=a, upper=b)


def add_outlier_flag(df: pd.DataFrame, col: str, *, k: float = 1.5) -> pd.DataFrame:
    lo, hi = iqr_bounds(df[col], k=k)
    return df.assign(**{f"{col}__is_outlier": (df[col] < lo) | (df[col] > hi)})

Task 3 — Create scripts/run_day3_features_and_tables.py (40 minutes)

Create a new runner that:

  1. reads data/processed/orders_enriched.parquet

  2. verifies required columns

  3. parses created_at and adds time parts

  4. adds amount_winsor + amount__is_outlier

  5. writes data/processed/orders_features.parquet

  6. builds a monthly metrics table (wide) and a melted long form table

  7. writes:

    • reports/day3_eda_tables.md
    • reports/day3_monthly_metrics_long.csv
    • optional: reports/day3_monthly_metrics_wide.csv

Hint — run your runner as a module

This keeps imports stable (no sys.path hacks).

macOS/Linux

uv run python -m scripts.run_day3_features_and_tables

Windows PowerShell

uv run python -m scripts.run_day3_features_and_tables

Solution — runner (step 1: load + verify)

Put this in scripts/run_day3_features_and_tables.py:

from __future__ import annotations

import logging
from pathlib import Path

import pandas as pd

from bootcamp_data.checks import assert_non_empty, require_columns
from bootcamp_data.eda import describe_numeric, md_table, summary_metrics, top_k_table, write_markdown_report
from bootcamp_data.transforms import add_outlier_flag, add_time_parts, parse_datetime, winsorize

log = logging.getLogger(__name__)


def main() -> None:
    logging.basicConfig(level=logging.INFO, format="%(levelname)s %(name)s: %(message)s")
    root = Path(__file__).resolve().parents[1]

    in_path = root / "data" / "processed" / "orders_enriched.parquet"
    df = pd.read_parquet(in_path)

    require_columns(
        df,
        ["order_id", "user_id", "amount", "status_clean", "created_at", "country"],
        df_name="orders_enriched",
    )
    assert_non_empty(df, df_name="orders_enriched")

Solution — runner (step 2: build features + save)

Continue inside main():

    features = (
        df.pipe(parse_datetime, col="created_at", utc=True)
          .pipe(add_time_parts, ts_col="created_at")
          .assign(amount_winsor=lambda d: winsorize(d["amount"]))
          .pipe(add_outlier_flag, col="amount", k=1.5)
    )

    out_features = root / "data" / "processed" / "orders_features.parquet"
    out_features.parent.mkdir(parents=True, exist_ok=True)
    features.to_parquet(out_features, index=False)
    log.info("wrote: %s (%s rows)", out_features, len(features))

Solution — runner (step 3: monthly tables + reshape samples)

Continue inside main():

    monthly = (
        features.groupby("month", dropna=False)
                .agg(
                    n_orders=("order_id", "size"),
                    revenue=("amount", "sum"),
                    aov=("amount", "mean"),
                )
                .reset_index()
                .sort_values("month")
    )

    monthly_long = monthly.melt(
        id_vars=["month"],
        value_vars=["n_orders", "revenue", "aov"],
        var_name="metric",
        value_name="value",
    )

    monthly_wide = (
        monthly_long.pivot_table(index=["month"], columns="metric", values="value", aggfunc="first")
                   .reset_index()
    )

    reports = root / "reports"
    reports.mkdir(parents=True, exist_ok=True)
    monthly_long.to_csv(reports / "day3_monthly_metrics_long.csv", index=False)
    monthly_wide.to_csv(reports / "day3_monthly_metrics_wide.csv", index=False)

Solution — runner (step 4: write the Markdown tables report)

Continue inside main():

    metrics = summary_metrics(features)
    amount_desc = describe_numeric(features, "amount")
    by_month = monthly.head(24)  # show first ~2 years if present
    by_country = top_k_table(features, "country", "amount", k=10)
    by_status = top_k_table(features, "status_clean", "amount", k=10)

    caveats = (
        "- `created_at` was parsed with `errors='coerce'`; invalid timestamps become missing.\n"
        "- Outliers are **flagged** (`amount__is_outlier`). For charts we may cap values (`amount_winsor`).\n"
        "- Join coverage may be < 100% (see match rate)."
    )

    out_report = reports / "day3_eda_tables.md"
    sections = [
        ("Top-line metrics", md_table(metrics)),
        ("Amount summary (robust stats)", md_table(amount_desc)),
        ("Monthly metrics (wide)", md_table(by_month, max_rows=24)),
        ("Revenue by country (top 10)", md_table(by_country)),
        ("Revenue by status_clean (top 10)", md_table(by_status)),
        ("Data caveats", caveats),
        ("Files written", "- `data/processed/orders_features.parquet`\n- `reports/day3_monthly_metrics_long.csv`"),
    ]
    write_markdown_report(out_report, title="Day 3 — Feature table + EDA tables", sections=sections)
    log.info("wrote: %s", out_report)


if __name__ == "__main__":
    main()

Task 4 — Run + verify artifacts (10 minutes)

macOS/Linux

uv run python -m scripts.run_day3_features_and_tables
ls -la data/processed | sed -n '1,60p'
ls -la reports | sed -n '1,60p'

Windows PowerShell

uv run python -m scripts.run_day3_features_and_tables
Get-ChildItem data\processed | Select-Object -First 30
Get-ChildItem reports | Select-Object -First 30

Checkpoint: these files exist:

  • data/processed/orders_features.parquet
  • reports/day3_eda_tables.md
  • reports/day3_monthly_metrics_long.csv

Git checkpoint (5 minutes)

  • git status
  • commit with message: "w2d3: features + tidy tables"
  • push

Solution — git commands

git add -A
git commit -m "w2d3: features + tidy tables"
git push

Debug playbook (Day 3)

When results look wrong:

  1. Datetimes: created_at dtype + missing-after-parse count
  2. Time parts: check month/dow/hour exist and look reasonable
  3. Outliers: print p50/p99 + outlier rate
  4. Join coverage: check user_found (if present) or % missing country
  5. Reshape sanity: monthly_long should have n_months × 3 rows
  6. Reproducibility: rerun the script from a fresh terminal

Stretch goals (optional)

If you finish early:

  • Add a second grouped table:

    • orders by dow (day-of-week)
    • revenue by hour
  • Save a second long-form table for tomorrow:

    • reports/day3_by_dow_long.csv
  • Add a small run metadata JSON:

    • reports/day3_run_meta.json (rows, missing timestamps, match rate, outlier rate)

Exit Ticket

In 1–2 sentences:

When would you use melt() vs pivot_table()?

What to do after class (Day 3 assignment)

Due: before Day 4 starts

  1. Ensure this runs from repo root:

    • uv run python -m scripts.run_day3_features_and_tables
  2. Commit + push your changes

  3. Confirm these artifacts exist:

    • data/processed/orders_features.parquet
    • reports/day3_eda_tables.md
    • reports/day3_monthly_metrics_long.csv

Deliverable: GitHub repo link + screenshot of reports/day3_eda_tables.md preview.

Tip

Tomorrow (Day 4) you should only need to read orders_features.parquet and plot.

Thank You!