Data Work (ETL + EDA)

AI Professionals Bootcamp | Week 2

2025-12-25

Day 5: Ship a trustworthy data product (rebuild + metadata + SQL lens)

Goal: ship Week 2 as something a teammate can run in one command and trust.

Bootcamp • SDAIA Academy

Today’s Flow

  • Session 1 (60m): Quality gates + “what does trustworthy mean?”
  • Asr Prayer (20m)
  • Session 2 (60m): One‑command rebuild + run metadata (Typer CLI)
  • Maghrib Prayer (20m)
  • Session 3 (60m): DuckDB SQL lens + handoff docs + ship checklist
  • Isha Prayer (20m)
  • Hands‑on (120m): Implement contract + rebuild CLI + run meta + DuckDB queries + handoff

Learning Objectives

By the end of today, you can:

  • add a quality gate that fails fast with actionable messages
  • define a lightweight data contract for orders_features.parquet
  • build a one‑command rebuild using Typer (builds on Week 1)
  • write run metadata (data/processed/_run_meta.json) and a schema summary
  • use DuckDB to query local Parquet (SQL as an analytics lens)
  • write a clear handoff: how to run, what files matter, what caveats exist

Warm‑up (5 minutes)

Rerun Day 4 and confirm outputs exist.

macOS/Linux

uv run python -m scripts.run_day4_publish --top-k 5 --figures
ls -la reports
ls -la data/processed
ls -la reports/figures

Windows PowerShell

uv run python -m scripts.run_day4_publish --top-k 5 --figures
dir reports
dir data\processed
dir reports\figures

Checkpoint: these exist:

  • reports/day4_publishable_report.md
  • reports/day4_data_dictionary.md
  • data/processed/orders_features.parquet

Note

If reports/figures/ is empty, that’s okay. Figures are optional and offline‑friendly.

Quick review: Week 2 pipeline so far

  • Day 2: verify + clean + join → orders_enriched.parquet + reports/day2_summary.md
  • Day 3: trusted EDA tables → reports/day3_eda.md (+ optional figures)
  • Day 4: publishable report + feature table + data dictionary

Today: add guardrails + audit trail so a teammate can rebuild and trust the artifacts.

Offline‑first: everything runs locally from data/raw/ + data/processed/.

Session 1

Quality gates + “what does trustworthy mean?”

Session 1 objectives

By the end of this session, you can:

  • explain a “quality gate” in one sentence
  • define a practical data contract for a table
  • add dtype + range checks without being brittle
  • decide where the contract runs (pipeline + rebuild)

Trustworthy data product = 3 things

  1. Repeatable: rerun produces the same outputs (idempotent)
  2. Checked: assumptions are enforced (quality gates)
  3. Auditable: you can answer “what ran?” (run metadata)

Today we implement (2) + (3).

Why a “quality gate”?

Without a gate:

  • your report can silently change when data changes
  • joins can quietly drop coverage (or explode rows)
  • dtypes can drift (IDs become numbers, leading zeros disappear)

With a gate:

  • the pipeline fails fast with a message that tells you what to fix

Data contract (practical definition)

A data contract is a small, explicit list of promises about a table:

  • required columns exist
  • key rules hold (e.g., unique order_id)
  • dtypes are safe (IDs are strings)
  • basic ranges make sense (non‑negative amounts, hour in 0–23)

Tip

Contracts are just composed checks. No new framework.

Contract in code (one function)

Mental model

  • One function per “important table”
  • Uses bootcamp_data.checks inside
  • Raises clear errors
  • Returns nothing on success

Today we contract‑check orders_features.parquet.

Shape

def validate_orders_features(df):
    # 1) columns + non-empty
    # 2) keys
    # 3) dtypes
    # 4) ranges
    return None  # success

Micro‑exercise: design the contract (6 minutes)

In pairs, write a minimal contract for orders_features:

  1. 5 required columns you will enforce
  2. 1 key rule you will enforce
  3. 2 range rules you will enforce

Checkpoint: every rule can be checked automatically in code.

Solution: a minimal contract (example)

A reasonable minimum set:

  • Required columns: order_id, user_id, amount, created_at, hour

  • Key rule: order_id is unique and non‑missing

  • Range rules:

    • amount >= 0
    • hour is in [0, 23] (after parsing)

Note

Your contract can be stricter later. Start minimal, then tighten.

Dtype checks that won’t fight you

Use pandas.api.types instead of comparing dtype strings.

from pandas.api.types import is_numeric_dtype, is_string_dtype

is_string_dtype(df["order_id"])  # True when IDs are safe
is_numeric_dtype(df["amount"])   # True for Float64 / Int64 / etc.

Warning

If order_id becomes an int, you may permanently lose leading zeros.

Quick Check

Question: Where should the contract run?

Answer: at least before publishing/handoff, and again in the one‑command rebuild.

Session 1 recap

  • A quality gate makes your pipeline trustworthy
  • A contract is a small set of explicit promises about a table
  • Use pandas.api.types for dtype checks
  • Start minimal → tighten over time

Asr break

20 minutes

When you return: we’ll build a one‑command rebuild and write run metadata.

Session 2

One‑command rebuild + run metadata (Typer CLI)

Session 2 objectives

By the end of this session, you can:

  • explain why “one command” matters for teams
  • build a small Typer CLI that orchestrates Day 2 → Day 4
  • verify output files (exists + non‑empty)
  • write run metadata and a schema summary

Why “one command”?

A teammate should not have to remember 6 steps.

One command gives you:

  • reproducibility (same steps, same order)
  • less “it works on my laptop”
  • a natural place to run the contract gate
  • a natural place to write run metadata

Why Typer (today)?

Week 1 already taught you Typer CLIs.

We use Typer today because:

  • --help is automatic and pleasant
  • options are typed (int, bool, optional strings)
  • it’s the same mental model as Week 1: commands call reusable functions

Tip

We are not building a fancy CLI. Just one command: rebuild.

CLI shape (minimal)

import typer

app = typer.Typer(add_completion=False)

@app.command()
def rebuild(top_k: int = 10, figures: bool = False):
    ...

if __name__ == "__main__":
    app()

Run it like:

uv run python -m bootcamp_data.cli rebuild --top-k 10 --figures

Orchestration pattern (simple + cross‑platform)

We will call your existing scripts using subprocess.run.

Key habits:

  • use sys.executable (works inside uv run)
  • run modules with -m scripts... (import‑safe)
  • set cwd=ROOT (paths are correct)

Verification: files exist + non‑empty

After running scripts, verify the artifacts.

from pathlib import Path

def require_file(path: Path) -> None:
    if not path.exists():
        raise FileNotFoundError(f"Missing: {path}")
    if path.stat().st_size == 0:
        raise ValueError(f"Empty file: {path}")

New concept: run metadata (audit trail)

Run metadata answers:

  • what inputs were used?
  • what outputs were written?
  • how many rows were produced?
  • when did it run?
  • (optional) what git commit produced these results?

Why it matters: when numbers change, you can debug what changed.

Minimal run metadata JSON (template)

Write to: data/processed/_run_meta.json

{
  "timestamp_utc": "2025-12-25T09:15:00Z",
  "git_commit": "abc123...",
  "args": {"top_k": 10, "figures": false},
  "rows": {"orders_enriched": 12034, "orders_features": 12034},
  "outputs": [
    "data/processed/orders_features.parquet",
    "reports/day4_publishable_report.md"
  ]
}

Tip

Keep it small and truthful. You can always add fields later.

Schema summary (quick, useful)

Write a small schema/missingness table for the handoff.

Target path: reports/schema_summary.csv

Columns:

  • column
  • dtype
  • n_missing

Session 2 recap

  • One‑command rebuild = team‑friendly reproducibility
  • Typer gives you a clean, typed CLI interface
  • Verify key artifacts (exists + non‑empty)
  • Write run metadata + schema summary so runs are auditable

Maghrib break

20 minutes

When you return: we’ll use DuckDB to query your Parquet and then finish the handoff.

Session 3

DuckDB SQL lens + handoff docs + ship checklist

Session 3 objectives

By the end of this session, you can:

  • run 2–3 useful analytics queries using DuckDB on Parquet
  • explain when SQL is simpler than pandas
  • write a handoff doc that enables “clone → run → trust”
  • finish a ship checklist (rebuild + artifacts + caveats)

Why DuckDB?

DuckDB is:

  • zero‑setup SQL analytics on local files (CSV/Parquet)
  • fast and convenient for groupby/join style questions
  • a bridge to “real” SQL environments later

Note

We use DuckDB as an analytics lens, not as the source of truth for transforms.

DuckDB pattern: query Parquet directly

import duckdb

sql = '''
SELECT
  country,
  COUNT(*) AS n,
  SUM(amount) AS revenue
FROM read_parquet('data/processed/orders_features.parquet')
GROUP BY 1
ORDER BY revenue DESC
LIMIT 10
'''
df = duckdb.query(sql).df()
print(df.head())

Micro‑exercise: write a time trend query (6 minutes)

Write a SQL query that returns:

  • month
  • n_orders
  • revenue

Sorted by month ascending.

Checkpoint: your query uses GROUP BY.

Solution (example)

SELECT
  month,
  COUNT(*) AS n_orders,
  SUM(amount) AS revenue
FROM read_parquet('data/processed/orders_features.parquet')
GROUP BY 1
ORDER BY month ASC

What should DuckDB output?

Keep it simple:

  • write query results to reports/ as CSV
  • link the CSV in your handoff

Example:

  • reports/day5_duckdb_top_countries.csv
  • reports/day5_duckdb_monthly_revenue.csv

Handoff doc template: reports/handoff.md

# Week 2 Handoff (Data Work)

## Quick start (5 minutes)
1) uv run python -m bootcamp_data.cli rebuild --top-k 10 --figures
2) Open reports/day4_publishable_report.md

## Key artifacts
- data/processed/orders_features.parquet
- reports/day4_publishable_report.md
- reports/day4_data_dictionary.md
- data/processed/_run_meta.json
- reports/schema_summary.csv

## Optional SQL artifacts (DuckDB)
- reports/day5_duckdb_top_countries.csv
- reports/day5_duckdb_monthly_revenue.csv

## Data quality caveats
- Missingness: ...
- Join coverage: ...
- Outliers/winsorization: ...

## Troubleshooting
- Run from repo root
- Rerun rebuild
- If date filters empty the data, adjust --start-date/--end-date

Week 2 ship checklist

Before you call it “done”:

  • uv run python -m bootcamp_data.cli rebuild succeeds from repo root
  • orders_features.parquet exists and contract check passes
  • ✅ run metadata exists: data/processed/_run_meta.json
  • ✅ schema summary exists: reports/schema_summary.csv
  • ✅ Day 4 report is readable (metrics + tables + caveats)
  • ✅ data dictionary exists and matches the feature table
  • reports/handoff.md tells a teammate exactly what to do

Session 3 recap

  • DuckDB is a useful SQL lens on local Parquet
  • Handoff docs make your work usable by others
  • Caveats are not embarrassment — they are professionalism

Isha break

20 minutes

When you return: we’ll implement the contract + CLI rebuild + run metadata + DuckDB outputs and commit.

Hands‑on

Build: contract gate + one‑command rebuild + run metadata + DuckDB outputs + handoff

Hands‑on success criteria (today)

By the end, you should have:

  • bootcamp_data/contract.py (contract for orders_features)
  • bootcamp_data/cli.py (Typer command: rebuild)
  • data/processed/_run_meta.json (run metadata written by rebuild)
  • reports/schema_summary.csv (schema + missingness summary)
  • scripts/run_day5_duckdb_queries.py (writes SQL outputs to reports/)
  • reports/handoff.md (quick start + artifacts + caveats)
  • at least 1 commit pushed to GitHub

Project layout (what you’ll have)

bootcamp_data/
  __init__.py
  checks.py
  contract.py          # NEW
  cli.py               # NEW (Typer)
  io.py
  transforms.py
scripts/
  __init__.py
  run_day2_validate_and_report.py
  run_day3_eda_and_figures.py
  run_day4_publish.py
  run_day5_duckdb_queries.py   # NEW
reports/
  day4_publishable_report.md
  day4_data_dictionary.md
  schema_summary.csv           # NEW
  handoff.md                   # NEW
  day5_duckdb_top_countries.csv        # NEW (from DuckDB)
  day5_duckdb_monthly_revenue.csv      # NEW (from DuckDB)
  figures/
data/
  raw/
  processed/
    orders_features.parquet
    _run_meta.json             # NEW

Task 0 — Confirm starting point (10 minutes)

  • Rerun Day 4 once
  • Make sure you are in the repo root

Checkpoint: Day 4 outputs exist (Warm‑up checkpoint).

Task 1 — Add bootcamp_data/contract.py (25 minutes)

  1. Create bootcamp_data/contract.py
  2. Implement validate_orders_features(df)
  3. Use existing checks: columns, non‑empty, unique key, ranges
  4. Add 2 dtype checks (IDs string‑like, amount numeric)

Checkpoint: calling the function raises no error for your current orders_features.parquet.

Solution (Task 1): bootcamp_data/contract.py

from __future__ import annotations

import pandas as pd
from pandas.api.types import is_numeric_dtype, is_string_dtype

from .checks import assert_in_range, assert_non_empty, assert_unique_key, require_columns

REQUIRED = [
    "order_id", "user_id", "amount", "quantity", "created_at",
    "status_clean", "month", "dow", "hour",
]

def validate_orders_features(df: pd.DataFrame, *, df_name: str = "orders_features") -> None:
    require_columns(df, REQUIRED, df_name=df_name)
    assert_non_empty(df, df_name=df_name)
    assert_unique_key(df, "order_id", allow_na=False, df_name=df_name)

    if not is_string_dtype(df["order_id"]):
        raise ValueError(f"{df_name}: order_id must be string-like (got {df['order_id'].dtype})")
    if not is_string_dtype(df["user_id"]):
        raise ValueError(f"{df_name}: user_id must be string-like (got {df['user_id'].dtype})")
    if not is_numeric_dtype(df["amount"]):
        raise ValueError(f"{df_name}: amount must be numeric (got {df['amount'].dtype})")

    assert_in_range(df["amount"].dropna(), lo=0, name=f"{df_name}.amount")
    assert_in_range(df["quantity"].dropna(), lo=0, name=f"{df_name}.quantity")
    assert_in_range(df["hour"].dropna(), lo=0, hi=23, name=f"{df_name}.hour")

Task 1 checkpoint (run it) (5 minutes)

macOS/Linux

uv run python -c "import pandas as pd; from bootcamp_data.contract import validate_orders_features; df=pd.read_parquet('data/processed/orders_features.parquet'); validate_orders_features(df); print('contract OK')"

Windows PowerShell

uv run python -c "import pandas as pd; from bootcamp_data.contract import validate_orders_features; df=pd.read_parquet('data/processed/orders_features.parquet'); validate_orders_features(df); print('contract OK')"

Checkpoint: you see contract OK.

Task 2 — Create bootcamp_data/cli.py (Typer) (40 minutes)

Note

If you see ModuleNotFoundError: typer, install it once with: uv add typer.

  1. Create bootcamp_data/cli.py

  2. Add a rebuild command that runs Day 2 → Day 4 using subprocess.run

  3. Verify key artifacts exist and are non‑empty

  4. Load orders_features.parquet and run validate_orders_features(df)

  5. Write:

    • data/processed/_run_meta.json
    • reports/schema_summary.csv

Checkpoint: uv run python -m bootcamp_data.cli rebuild completes successfully.

Solution (Task 2): bootcamp_data/cli.py (1/2)

from __future__ import annotations

import json
import subprocess
import sys
import time
from datetime import datetime, timezone
from pathlib import Path

import pandas as pd
import typer

from .contract import validate_orders_features

app = typer.Typer(add_completion=False)
ROOT = Path(__file__).resolve().parents[1]

def run_mod(mod: str, args: list[str] | None = None) -> None:
    cmd = [sys.executable, "-m", mod] + (args or [])
    subprocess.run(cmd, cwd=ROOT, check=True)

def require_file(path: Path) -> None:
    if not path.exists():
        raise FileNotFoundError(f"Missing: {path}")
    if path.stat().st_size == 0:
        raise ValueError(f"Empty file: {path}")

def safe_git_commit() -> str | None:
    try:
        return subprocess.check_output(["git", "rev-parse", "HEAD"], cwd=ROOT, text=True).strip()
    except Exception:
        return None

Solution (Task 2): bootcamp_data/cli.py (2/2)

@app.command()
def rebuild(
    top_k: int = typer.Option(10, help="Top-k categories to show in Day 4 report."),
    figures: bool = typer.Option(False, help="Attempt to export figures (optional)."),
    start_date: str | None = typer.Option(None, help="Optional YYYY-MM-DD filter for Day 4."),
    end_date: str | None = typer.Option(None, help="Optional YYYY-MM-DD filter for Day 4."),
) -> None:
    t0 = time.time()
    started = datetime.now(timezone.utc).isoformat()

    run_mod("scripts.run_day2_validate_and_report")
    run_mod("scripts.run_day3_eda_and_figures")

    day4_args = ["--top-k", str(top_k)]
    if start_date: day4_args += ["--start-date", start_date]
    if end_date: day4_args += ["--end-date", end_date]
    if figures: day4_args += ["--figures"]
    run_mod("scripts.run_day4_publish", day4_args)

    features_path = ROOT / "data/processed/orders_features.parquet"
    report_path = ROOT / "reports/day4_publishable_report.md"
    dict_path = ROOT / "reports/day4_data_dictionary.md"

    require_file(features_path)
    require_file(report_path)
    require_file(dict_path)

    df = pd.read_parquet(features_path)
    validate_orders_features(df)

    # schema summary
    schema = pd.DataFrame({
        "column": df.columns,
        "dtype": [str(t) for t in df.dtypes],
        "n_missing": [int(df[c].isna().sum()) for c in df.columns],
    }).sort_values("n_missing", ascending=False)
    schema_path = ROOT / "reports/schema_summary.csv"
    schema_path.parent.mkdir(parents=True, exist_ok=True)
    schema.to_csv(schema_path, index=False)

    meta = {
        "started_utc": started,
        "finished_utc": datetime.now(timezone.utc).isoformat(),
        "duration_s": round(time.time() - t0, 2),
        "git_commit": safe_git_commit(),
        "args": {"top_k": top_k, "figures": figures, "start_date": start_date, "end_date": end_date},
        "rows": {"orders_features": int(len(df))},
        "outputs": [str(features_path), str(report_path), str(dict_path), str(schema_path)],
    }
    meta_path = ROOT / "data/processed/_run_meta.json"
    meta_path.parent.mkdir(parents=True, exist_ok=True)
    meta_path.write_text(json.dumps(meta, indent=2), encoding="utf-8")

    typer.echo("✅ rebuild OK")

Task 2 checkpoint (run it) (5 minutes)

macOS/Linux

uv run python -m bootcamp_data.cli rebuild --top-k 10 --figures

Windows PowerShell

uv run python -m bootcamp_data.cli rebuild --top-k 10 --figures

Checkpoint: you see ✅ rebuild OK.

Task 3 — Add DuckDB queries (25 minutes)

  1. Add DuckDB dependency (if needed): uv add duckdb
  2. Create scripts/run_day5_duckdb_queries.py
  3. Run 2 queries on orders_features.parquet and write CSV outputs to reports/

Checkpoint: the CSVs exist in reports/.

Note

If you can’t install DuckDB due to connectivity, keep the script and run it later. The SQL practice is still valuable.

Solution (Task 3): scripts/run_day5_duckdb_queries.py

from __future__ import annotations

from pathlib import Path
import duckdb

ROOT = Path(__file__).resolve().parents[1]

FEATURES = ROOT / "data/processed/orders_features.parquet"
OUT1 = ROOT / "reports/day5_duckdb_top_countries.csv"
OUT2 = ROOT / "reports/day5_duckdb_monthly_revenue.csv"

def main() -> None:
    OUT1.parent.mkdir(parents=True, exist_ok=True)

    q1 = '''
    SELECT country, COUNT(*) AS n, SUM(amount) AS revenue
    FROM read_parquet(?)
    GROUP BY 1
    ORDER BY revenue DESC
    LIMIT 10
    '''
    df1 = duckdb.query(q1, [str(FEATURES)]).df()
    df1.to_csv(OUT1, index=False)

    q2 = '''
    SELECT month, COUNT(*) AS n_orders, SUM(amount) AS revenue
    FROM read_parquet(?)
    GROUP BY 1
    ORDER BY month ASC
    '''
    df2 = duckdb.query(q2, [str(FEATURES)]).df()
    df2.to_csv(OUT2, index=False)

    print("wrote:", OUT1)
    print("wrote:", OUT2)

if __name__ == "__main__":
    main()

Task 3 checkpoint (run it) (5 minutes)

macOS/Linux

uv run python -m scripts.run_day5_duckdb_queries
ls -la reports/day5_duckdb_*.csv

Windows PowerShell

uv run python -m scripts.run_day5_duckdb_queries
dir reports\day5_duckdb_*.csv

Task 4 — Write reports/handoff.md (20 minutes)

  1. Create reports/handoff.md
  2. Add: quick start, key artifacts, optional SQL artifacts, caveats, troubleshooting

Checkpoint: a teammate can follow it without asking you questions.

Git checkpoint (2 minutes)

  • git status
  • Commit with message: "day5: contract + cli rebuild + run meta + duckdb + handoff"
  • git push

Checkpoint: your GitHub repo shows the new commit.

Debug playbook

When something breaks:

  1. Confirm repo root (run from the folder with data/ + scripts/ + bootcamp_data/)

  2. Re-run: uv run python -m bootcamp_data.cli rebuild

  3. Read the error message: which check failed? which column?

  4. Verify inputs:

    • data/raw/ files exist
    • processed artifacts are not stale (rebuild overwrites)
  5. If using date filters: make sure they don’t filter everything out

Thank You!