Data Work (ETL + EDA)

AI Professionals Bootcamp | Week 2

2025-12-25

Day 5: Ship a trustworthy data product (rebuild + metadata + SQL lens)

Goal: ship Week 2 as something a teammate can run in one command and trust.

Bootcamp • SDAIA Academy

Today’s Flow

Session 1 (60m): Quality gates + “what does trustworthy mean?”
Asr Prayer (20m)
Session 2 (60m): One‑command rebuild + run metadata (Typer CLI)
Maghrib Prayer (20m)
Session 3 (60m): DuckDB SQL lens + handoff docs + ship checklist
Isha Prayer (20m)
Hands‑on (120m): Implement contract + rebuild CLI + run meta + DuckDB queries + handoff

Learning Objectives

By the end of today, you can:

add a quality gate that fails fast with actionable messages
define a lightweight data contract for orders_features.parquet
build a one‑command rebuild using Typer (builds on Week 1)
write run metadata (data/processed/_run_meta.json) and a schema summary
use DuckDB to query local Parquet (SQL as an analytics lens)
write a clear handoff: how to run, what files matter, what caveats exist

Warm‑up (5 minutes)

Rerun Day 4 and confirm outputs exist.

macOS/Linux

uv run python -m scripts.run_day4_publish --top-k 5 --figures
ls -la reports
ls -la data/processed
ls -la reports/figures

Windows PowerShell

uv run python -m scripts.run_day4_publish --top-k 5 --figures
dir reports
dir data\processed
dir reports\figures

Checkpoint: these exist:

reports/day4_publishable_report.md
reports/day4_data_dictionary.md
data/processed/orders_features.parquet

Note

If reports/figures/ is empty, that’s okay. Figures are optional and offline‑friendly.

Quick review: Week 2 pipeline so far

Day 2: verify + clean + join → orders_enriched.parquet + reports/day2_summary.md
Day 3: trusted EDA tables → reports/day3_eda.md (+ optional figures)
Day 4: publishable report + feature table + data dictionary

Today: add guardrails + audit trail so a teammate can rebuild and trust the artifacts.

Offline‑first: everything runs locally from data/raw/ + data/processed/.

Session 1

Quality gates + “what does trustworthy mean?”

Session 1 objectives

By the end of this session, you can:

explain a “quality gate” in one sentence
define a practical data contract for a table
add dtype + range checks without being brittle
decide where the contract runs (pipeline + rebuild)

Trustworthy data product = 3 things

Repeatable: rerun produces the same outputs (idempotent)
Checked: assumptions are enforced (quality gates)
Auditable: you can answer “what ran?” (run metadata)

Today we implement (2) + (3).

Why a “quality gate”?

Without a gate:

your report can silently change when data changes
joins can quietly drop coverage (or explode rows)
dtypes can drift (IDs become numbers, leading zeros disappear)

With a gate:

the pipeline fails fast with a message that tells you what to fix

Data contract (practical definition)

A data contract is a small, explicit list of promises about a table:

required columns exist
key rules hold (e.g., unique order_id)
dtypes are safe (IDs are strings)
basic ranges make sense (non‑negative amounts, hour in 0–23)

Tip

Contracts are just composed checks. No new framework.

Contract in code (one function)

Mental model

One function per “important table”
Uses bootcamp_data.checks inside
Raises clear errors
Returns nothing on success

Today we contract‑check orders_features.parquet.

Shape

def validate_orders_features(df):
    # 1) columns + non-empty
    # 2) keys
    # 3) dtypes
    # 4) ranges
    return None  # success

Micro‑exercise: design the contract (6 minutes)

In pairs, write a minimal contract for orders_features:

5 required columns you will enforce
1 key rule you will enforce
2 range rules you will enforce

Checkpoint: every rule can be checked automatically in code.

Solution: a minimal contract (example)

A reasonable minimum set:

Required columns: order_id, user_id, amount, created_at, hour
Key rule: order_id is unique and non‑missing
Range rules:
- amount >= 0
- hour is in [0, 23] (after parsing)

Note

Your contract can be stricter later. Start minimal, then tighten.

Dtype checks that won’t fight you

Use pandas.api.types instead of comparing dtype strings.

from pandas.api.types import is_numeric_dtype, is_string_dtype

is_string_dtype(df["order_id"])  # True when IDs are safe
is_numeric_dtype(df["amount"])   # True for Float64 / Int64 / etc.

Warning

If order_id becomes an int, you may permanently lose leading zeros.

Quick Check

Question: Where should the contract run?

Answer: at least before publishing/handoff, and again in the one‑command rebuild.

Session 1 recap

A quality gate makes your pipeline trustworthy
A contract is a small set of explicit promises about a table
Use pandas.api.types for dtype checks
Start minimal → tighten over time

Asr break

20 minutes

When you return: we’ll build a one‑command rebuild and write run metadata.

Session 2

One‑command rebuild + run metadata (Typer CLI)

Session 2 objectives

By the end of this session, you can:

explain why “one command” matters for teams
build a small Typer CLI that orchestrates Day 2 → Day 4
verify output files (exists + non‑empty)
write run metadata and a schema summary

Why “one command”?

A teammate should not have to remember 6 steps.

One command gives you:

reproducibility (same steps, same order)
less “it works on my laptop”
a natural place to run the contract gate
a natural place to write run metadata

Why Typer (today)?

Week 1 already taught you Typer CLIs.

We use Typer today because:

--help is automatic and pleasant
options are typed (int, bool, optional strings)
it’s the same mental model as Week 1: commands call reusable functions

Tip

We are not building a fancy CLI. Just one command: rebuild.

CLI shape (minimal)

import typer

app = typer.Typer(add_completion=False)

@app.command()
def rebuild(top_k: int = 10, figures: bool = False):
    ...

if __name__ == "__main__":
    app()

Run it like:

uv run python -m bootcamp_data.cli rebuild --top-k 10 --figures

Orchestration pattern (simple + cross‑platform)

We will call your existing scripts using subprocess.run.

Key habits:

use sys.executable (works inside uv run)
run modules with -m scripts... (import‑safe)
set cwd=ROOT (paths are correct)

Verification: files exist + non‑empty

After running scripts, verify the artifacts.

from pathlib import Path

def require_file(path: Path) -> None:
    if not path.exists():
        raise FileNotFoundError(f"Missing: {path}")
    if path.stat().st_size == 0:
        raise ValueError(f"Empty file: {path}")

New concept: run metadata (audit trail)

Run metadata answers:

what inputs were used?
what outputs were written?
how many rows were produced?
when did it run?
(optional) what git commit produced these results?

Why it matters: when numbers change, you can debug what changed.

Minimal run metadata JSON (template)

Write to: data/processed/_run_meta.json

{
  "timestamp_utc": "2025-12-25T09:15:00Z",
  "git_commit": "abc123...",
  "args": {"top_k": 10, "figures": false},
  "rows": {"orders_enriched": 12034, "orders_features": 12034},
  "outputs": [
    "data/processed/orders_features.parquet",
    "reports/day4_publishable_report.md"
  ]
}

Tip

Keep it small and truthful. You can always add fields later.

Schema summary (quick, useful)

Write a small schema/missingness table for the handoff.

Target path: reports/schema_summary.csv

Columns:

column
dtype
n_missing

Session 2 recap

One‑command rebuild = team‑friendly reproducibility
Typer gives you a clean, typed CLI interface
Verify key artifacts (exists + non‑empty)
Write run metadata + schema summary so runs are auditable

Maghrib break

20 minutes

When you return: we’ll use DuckDB to query your Parquet and then finish the handoff.

Session 3

DuckDB SQL lens + handoff docs + ship checklist

Session 3 objectives

By the end of this session, you can:

run 2–3 useful analytics queries using DuckDB on Parquet
explain when SQL is simpler than pandas
write a handoff doc that enables “clone → run → trust”
finish a ship checklist (rebuild + artifacts + caveats)

Why DuckDB?

DuckDB is:

zero‑setup SQL analytics on local files (CSV/Parquet)
fast and convenient for groupby/join style questions
a bridge to “real” SQL environments later

Note

We use DuckDB as an analytics lens, not as the source of truth for transforms.

DuckDB pattern: query Parquet directly

import duckdb

sql = '''
SELECT
  country,
  COUNT(*) AS n,
  SUM(amount) AS revenue
FROM read_parquet('data/processed/orders_features.parquet')
GROUP BY 1
ORDER BY revenue DESC
LIMIT 10
'''
df = duckdb.query(sql).df()
print(df.head())

Micro‑exercise: write a time trend query (6 minutes)

Write a SQL query that returns:

month
n_orders
revenue

Sorted by month ascending.

Checkpoint: your query uses GROUP BY.

Solution (example)

SELECT
  month,
  COUNT(*) AS n_orders,
  SUM(amount) AS revenue
FROM read_parquet('data/processed/orders_features.parquet')
GROUP BY 1
ORDER BY month ASC

What should DuckDB output?

Keep it simple:

write query results to reports/ as CSV
link the CSV in your handoff

Example:

reports/day5_duckdb_top_countries.csv
reports/day5_duckdb_monthly_revenue.csv

Handoff doc template: `reports/handoff.md`

# Week 2 Handoff (Data Work)

## Quick start (5 minutes)
1) uv run python -m bootcamp_data.cli rebuild --top-k 10 --figures
2) Open reports/day4_publishable_report.md

## Key artifacts
- data/processed/orders_features.parquet
- reports/day4_publishable_report.md
- reports/day4_data_dictionary.md
- data/processed/_run_meta.json
- reports/schema_summary.csv

## Optional SQL artifacts (DuckDB)
- reports/day5_duckdb_top_countries.csv
- reports/day5_duckdb_monthly_revenue.csv

## Data quality caveats
- Missingness: ...
- Join coverage: ...
- Outliers/winsorization: ...

## Troubleshooting
- Run from repo root
- Rerun rebuild
- If date filters empty the data, adjust --start-date/--end-date

Week 2 ship checklist

Before you call it “done”:

✅ uv run python -m bootcamp_data.cli rebuild succeeds from repo root
✅ orders_features.parquet exists and contract check passes
✅ run metadata exists: data/processed/_run_meta.json
✅ schema summary exists: reports/schema_summary.csv
✅ Day 4 report is readable (metrics + tables + caveats)
✅ data dictionary exists and matches the feature table
✅ reports/handoff.md tells a teammate exactly what to do

Session 3 recap

DuckDB is a useful SQL lens on local Parquet
Handoff docs make your work usable by others
Caveats are not embarrassment — they are professionalism

Isha break

20 minutes

When you return: we’ll implement the contract + CLI rebuild + run metadata + DuckDB outputs and commit.

Hands‑on

Build: contract gate + one‑command rebuild + run metadata + DuckDB outputs + handoff

Hands‑on success criteria (today)

By the end, you should have:

bootcamp_data/contract.py (contract for orders_features)
bootcamp_data/cli.py (Typer command: rebuild)
data/processed/_run_meta.json (run metadata written by rebuild)
reports/schema_summary.csv (schema + missingness summary)
scripts/run_day5_duckdb_queries.py (writes SQL outputs to reports/)
reports/handoff.md (quick start + artifacts + caveats)
at least 1 commit pushed to GitHub

Project layout (what you’ll have)

bootcamp_data/
  __init__.py
  checks.py
  contract.py          # NEW
  cli.py               # NEW (Typer)
  io.py
  transforms.py
scripts/
  __init__.py
  run_day2_validate_and_report.py
  run_day3_eda_and_figures.py
  run_day4_publish.py
  run_day5_duckdb_queries.py   # NEW
reports/
  day4_publishable_report.md
  day4_data_dictionary.md
  schema_summary.csv           # NEW
  handoff.md                   # NEW
  day5_duckdb_top_countries.csv        # NEW (from DuckDB)
  day5_duckdb_monthly_revenue.csv      # NEW (from DuckDB)
  figures/
data/
  raw/
  processed/
    orders_features.parquet
    _run_meta.json             # NEW

Task 0 — Confirm starting point (10 minutes)

Rerun Day 4 once
Make sure you are in the repo root

Checkpoint: Day 4 outputs exist (Warm‑up checkpoint).

Task 1 — Add `bootcamp_data/contract.py` (25 minutes)

Create bootcamp_data/contract.py
Implement validate_orders_features(df)
Use existing checks: columns, non‑empty, unique key, ranges
Add 2 dtype checks (IDs string‑like, amount numeric)

Checkpoint: calling the function raises no error for your current orders_features.parquet.

Solution (Task 1): `bootcamp_data/contract.py`

from __future__ import annotations

import pandas as pd
from pandas.api.types import is_numeric_dtype, is_string_dtype

from .checks import assert_in_range, assert_non_empty, assert_unique_key, require_columns

REQUIRED = [
    "order_id", "user_id", "amount", "quantity", "created_at",
    "status_clean", "month", "dow", "hour",
]

def validate_orders_features(df: pd.DataFrame, *, df_name: str = "orders_features") -> None:
    require_columns(df, REQUIRED, df_name=df_name)
    assert_non_empty(df, df_name=df_name)
    assert_unique_key(df, "order_id", allow_na=False, df_name=df_name)

    if not is_string_dtype(df["order_id"]):
        raise ValueError(f"{df_name}: order_id must be string-like (got {df['order_id'].dtype})")
    if not is_string_dtype(df["user_id"]):
        raise ValueError(f"{df_name}: user_id must be string-like (got {df['user_id'].dtype})")
    if not is_numeric_dtype(df["amount"]):
        raise ValueError(f"{df_name}: amount must be numeric (got {df['amount'].dtype})")

    assert_in_range(df["amount"].dropna(), lo=0, name=f"{df_name}.amount")
    assert_in_range(df["quantity"].dropna(), lo=0, name=f"{df_name}.quantity")
    assert_in_range(df["hour"].dropna(), lo=0, hi=23, name=f"{df_name}.hour")

Task 1 checkpoint (run it) (5 minutes)

macOS/Linux

uv run python -c "import pandas as pd; from bootcamp_data.contract import validate_orders_features; df=pd.read_parquet('data/processed/orders_features.parquet'); validate_orders_features(df); print('contract OK')"

Windows PowerShell

uv run python -c "import pandas as pd; from bootcamp_data.contract import validate_orders_features; df=pd.read_parquet('data/processed/orders_features.parquet'); validate_orders_features(df); print('contract OK')"

Checkpoint: you see contract OK.

Task 2 — Create `bootcamp_data/cli.py` (Typer) (40 minutes)

Note

If you see ModuleNotFoundError: typer, install it once with: uv add typer.

Create bootcamp_data/cli.py
Add a rebuild command that runs Day 2 → Day 4 using subprocess.run
Verify key artifacts exist and are non‑empty
Load orders_features.parquet and run validate_orders_features(df)
Write:
- data/processed/_run_meta.json
- reports/schema_summary.csv

Checkpoint: uv run python -m bootcamp_data.cli rebuild completes successfully.

Solution (Task 2): `bootcamp_data/cli.py` (1/2)

from __future__ import annotations

import json
import subprocess
import sys
import time
from datetime import datetime, timezone
from pathlib import Path

import pandas as pd
import typer

from .contract import validate_orders_features

app = typer.Typer(add_completion=False)
ROOT = Path(__file__).resolve().parents[1]

def run_mod(mod: str, args: list[str] | None = None) -> None:
    cmd = [sys.executable, "-m", mod] + (args or [])
    subprocess.run(cmd, cwd=ROOT, check=True)

def require_file(path: Path) -> None:
    if not path.exists():
        raise FileNotFoundError(f"Missing: {path}")
    if path.stat().st_size == 0:
        raise ValueError(f"Empty file: {path}")

def safe_git_commit() -> str | None:
    try:
        return subprocess.check_output(["git", "rev-parse", "HEAD"], cwd=ROOT, text=True).strip()
    except Exception:
        return None

Solution (Task 2): `bootcamp_data/cli.py` (2/2)

@app.command()
def rebuild(
    top_k: int = typer.Option(10, help="Top-k categories to show in Day 4 report."),
    figures: bool = typer.Option(False, help="Attempt to export figures (optional)."),
    start_date: str | None = typer.Option(None, help="Optional YYYY-MM-DD filter for Day 4."),
    end_date: str | None = typer.Option(None, help="Optional YYYY-MM-DD filter for Day 4."),
) -> None:
    t0 = time.time()
    started = datetime.now(timezone.utc).isoformat()

    run_mod("scripts.run_day2_validate_and_report")
    run_mod("scripts.run_day3_eda_and_figures")

    day4_args = ["--top-k", str(top_k)]
    if start_date: day4_args += ["--start-date", start_date]
    if end_date: day4_args += ["--end-date", end_date]
    if figures: day4_args += ["--figures"]
    run_mod("scripts.run_day4_publish", day4_args)

    features_path = ROOT / "data/processed/orders_features.parquet"
    report_path = ROOT / "reports/day4_publishable_report.md"
    dict_path = ROOT / "reports/day4_data_dictionary.md"

    require_file(features_path)
    require_file(report_path)
    require_file(dict_path)

    df = pd.read_parquet(features_path)
    validate_orders_features(df)

    # schema summary
    schema = pd.DataFrame({
        "column": df.columns,
        "dtype": [str(t) for t in df.dtypes],
        "n_missing": [int(df[c].isna().sum()) for c in df.columns],
    }).sort_values("n_missing", ascending=False)
    schema_path = ROOT / "reports/schema_summary.csv"
    schema_path.parent.mkdir(parents=True, exist_ok=True)
    schema.to_csv(schema_path, index=False)

    meta = {
        "started_utc": started,
        "finished_utc": datetime.now(timezone.utc).isoformat(),
        "duration_s": round(time.time() - t0, 2),
        "git_commit": safe_git_commit(),
        "args": {"top_k": top_k, "figures": figures, "start_date": start_date, "end_date": end_date},
        "rows": {"orders_features": int(len(df))},
        "outputs": [str(features_path), str(report_path), str(dict_path), str(schema_path)],
    }
    meta_path = ROOT / "data/processed/_run_meta.json"
    meta_path.parent.mkdir(parents=True, exist_ok=True)
    meta_path.write_text(json.dumps(meta, indent=2), encoding="utf-8")

    typer.echo("✅ rebuild OK")

Task 2 checkpoint (run it) (5 minutes)

macOS/Linux

uv run python -m bootcamp_data.cli rebuild --top-k 10 --figures

Windows PowerShell

uv run python -m bootcamp_data.cli rebuild --top-k 10 --figures

Checkpoint: you see ✅ rebuild OK.

Task 3 — Add DuckDB queries (25 minutes)

Add DuckDB dependency (if needed): uv add duckdb
Create scripts/run_day5_duckdb_queries.py
Run 2 queries on orders_features.parquet and write CSV outputs to reports/

Checkpoint: the CSVs exist in reports/.

Note

If you can’t install DuckDB due to connectivity, keep the script and run it later. The SQL practice is still valuable.

Solution (Task 3): `scripts/run_day5_duckdb_queries.py`

from __future__ import annotations

from pathlib import Path
import duckdb

ROOT = Path(__file__).resolve().parents[1]

FEATURES = ROOT / "data/processed/orders_features.parquet"
OUT1 = ROOT / "reports/day5_duckdb_top_countries.csv"
OUT2 = ROOT / "reports/day5_duckdb_monthly_revenue.csv"

def main() -> None:
    OUT1.parent.mkdir(parents=True, exist_ok=True)

    q1 = '''
    SELECT country, COUNT(*) AS n, SUM(amount) AS revenue
    FROM read_parquet(?)
    GROUP BY 1
    ORDER BY revenue DESC
    LIMIT 10
    '''
    df1 = duckdb.query(q1, [str(FEATURES)]).df()
    df1.to_csv(OUT1, index=False)

    q2 = '''
    SELECT month, COUNT(*) AS n_orders, SUM(amount) AS revenue
    FROM read_parquet(?)
    GROUP BY 1
    ORDER BY month ASC
    '''
    df2 = duckdb.query(q2, [str(FEATURES)]).df()
    df2.to_csv(OUT2, index=False)

    print("wrote:", OUT1)
    print("wrote:", OUT2)

if __name__ == "__main__":
    main()

Task 3 checkpoint (run it) (5 minutes)

macOS/Linux

uv run python -m scripts.run_day5_duckdb_queries
ls -la reports/day5_duckdb_*.csv

Windows PowerShell

uv run python -m scripts.run_day5_duckdb_queries
dir reports\day5_duckdb_*.csv

Task 4 — Write `reports/handoff.md` (20 minutes)

Create reports/handoff.md
Add: quick start, key artifacts, optional SQL artifacts, caveats, troubleshooting

Checkpoint: a teammate can follow it without asking you questions.

Git checkpoint (2 minutes)

git status
Commit with message: "day5: contract + cli rebuild + run meta + duckdb + handoff"
git push

Checkpoint: your GitHub repo shows the new commit.

Debug playbook

When something breaks:

Confirm repo root (run from the folder with data/ + scripts/ + bootcamp_data/)
Re-run: uv run python -m bootcamp_data.cli rebuild
Read the error message: which check failed? which column?
Verify inputs:
- data/raw/ files exist
- processed artifacts are not stale (rebuild overwrites)
If using date filters: make sure they don’t filter everything out

Data Work (ETL + EDA)

Day 5: Ship a trustworthy data product (rebuild + metadata + SQL lens)

Today’s Flow

Learning Objectives

Warm‑up (5 minutes)

Quick review: Week 2 pipeline so far

Session 1

Session 1 objectives

Trustworthy data product = 3 things

Why a “quality gate”?

Data contract (practical definition)

Contract in code (one function)

Micro‑exercise: design the contract (6 minutes)

Solution: a minimal contract (example)

Dtype checks that won’t fight you

Quick Check

Session 1 recap

Asr break

20 minutes

Session 2

Session 2 objectives

Why “one command”?

Why Typer (today)?

CLI shape (minimal)

Orchestration pattern (simple + cross‑platform)

Verification: files exist + non‑empty

New concept: run metadata (audit trail)

Minimal run metadata JSON (template)

Schema summary (quick, useful)

Session 2 recap

Maghrib break

20 minutes

Session 3

Session 3 objectives

Why DuckDB?

DuckDB pattern: query Parquet directly

Micro‑exercise: write a time trend query (6 minutes)

Solution (example)

What should DuckDB output?

Handoff doc template: reports/handoff.md

Week 2 ship checklist

Session 3 recap

Isha break

20 minutes

Hands‑on

Hands‑on success criteria (today)

Project layout (what you’ll have)

Task 0 — Confirm starting point (10 minutes)

Task 1 — Add bootcamp_data/contract.py (25 minutes)

Solution (Task 1): bootcamp_data/contract.py

Task 1 checkpoint (run it) (5 minutes)

Task 2 — Create bootcamp_data/cli.py (Typer) (40 minutes)

Solution (Task 2): bootcamp_data/cli.py (1/2)

Solution (Task 2): bootcamp_data/cli.py (2/2)

Task 2 checkpoint (run it) (5 minutes)

Task 3 — Add DuckDB queries (25 minutes)

Solution (Task 3): scripts/run_day5_duckdb_queries.py

Task 3 checkpoint (run it) (5 minutes)

Task 4 — Write reports/handoff.md (20 minutes)

Git checkpoint (2 minutes)

Debug playbook

Thank You!

Handoff doc template: `reports/handoff.md`

Task 1 — Add `bootcamp_data/contract.py` (25 minutes)

Solution (Task 1): `bootcamp_data/contract.py`

Task 2 — Create `bootcamp_data/cli.py` (Typer) (40 minutes)

Solution (Task 2): `bootcamp_data/cli.py` (1/2)

Solution (Task 2): `bootcamp_data/cli.py` (2/2)

Solution (Task 3): `scripts/run_day5_duckdb_queries.py`

Task 4 — Write `reports/handoff.md` (20 minutes)