AI Professionals Bootcamp | Week 2
2025-12-25
Goal: ship Week 2 as something a teammate can run in one command and trust.
Bootcamp • SDAIA Academy
By the end of today, you can:
orders_features.parquetdata/processed/_run_meta.json) and a schema summaryRerun Day 4 and confirm outputs exist.
macOS/Linux
Windows PowerShell
Checkpoint: these exist:
reports/day4_publishable_report.mdreports/day4_data_dictionary.mddata/processed/orders_features.parquetNote
If reports/figures/ is empty, that’s okay. Figures are optional and offline‑friendly.
orders_enriched.parquet + reports/day2_summary.mdreports/day3_eda.md (+ optional figures)Today: add guardrails + audit trail so a teammate can rebuild and trust the artifacts.
Offline‑first: everything runs locally from data/raw/ + data/processed/.
Quality gates + “what does trustworthy mean?”
By the end of this session, you can:
Today we implement (2) + (3).
Without a gate:
With a gate:
A data contract is a small, explicit list of promises about a table:
order_id)Tip
Contracts are just composed checks. No new framework.
In pairs, write a minimal contract for orders_features:
Checkpoint: every rule can be checked automatically in code.
A reasonable minimum set:
Required columns: order_id, user_id, amount, created_at, hour
Key rule: order_id is unique and non‑missing
Range rules:
amount >= 0hour is in [0, 23] (after parsing)Note
Your contract can be stricter later. Start minimal, then tighten.
Use pandas.api.types instead of comparing dtype strings.
Warning
If order_id becomes an int, you may permanently lose leading zeros.
Question: Where should the contract run?
Answer: at least before publishing/handoff, and again in the one‑command rebuild.
pandas.api.types for dtype checksWhen you return: we’ll build a one‑command rebuild and write run metadata.
One‑command rebuild + run metadata (Typer CLI)
By the end of this session, you can:
A teammate should not have to remember 6 steps.
One command gives you:
Week 1 already taught you Typer CLIs.
We use Typer today because:
--help is automatic and pleasantint, bool, optional strings)Tip
We are not building a fancy CLI. Just one command: rebuild.
Run it like:
We will call your existing scripts using subprocess.run.
Key habits:
sys.executable (works inside uv run)-m scripts... (import‑safe)cwd=ROOT (paths are correct)After running scripts, verify the artifacts.
Run metadata answers:
Why it matters: when numbers change, you can debug what changed.
Write to: data/processed/_run_meta.json
Tip
Keep it small and truthful. You can always add fields later.
Write a small schema/missingness table for the handoff.
Target path: reports/schema_summary.csv
Columns:
columndtypen_missingWhen you return: we’ll use DuckDB to query your Parquet and then finish the handoff.
DuckDB SQL lens + handoff docs + ship checklist
By the end of this session, you can:
DuckDB is:
Note
We use DuckDB as an analytics lens, not as the source of truth for transforms.
Write a SQL query that returns:
monthn_ordersrevenueSorted by month ascending.
Checkpoint: your query uses GROUP BY.
Keep it simple:
reports/ as CSVExample:
reports/day5_duckdb_top_countries.csvreports/day5_duckdb_monthly_revenue.csvreports/handoff.md# Week 2 Handoff (Data Work)
## Quick start (5 minutes)
1) uv run python -m bootcamp_data.cli rebuild --top-k 10 --figures
2) Open reports/day4_publishable_report.md
## Key artifacts
- data/processed/orders_features.parquet
- reports/day4_publishable_report.md
- reports/day4_data_dictionary.md
- data/processed/_run_meta.json
- reports/schema_summary.csv
## Optional SQL artifacts (DuckDB)
- reports/day5_duckdb_top_countries.csv
- reports/day5_duckdb_monthly_revenue.csv
## Data quality caveats
- Missingness: ...
- Join coverage: ...
- Outliers/winsorization: ...
## Troubleshooting
- Run from repo root
- Rerun rebuild
- If date filters empty the data, adjust --start-date/--end-dateBefore you call it “done”:
uv run python -m bootcamp_data.cli rebuild succeeds from repo rootorders_features.parquet exists and contract check passesdata/processed/_run_meta.jsonreports/schema_summary.csvreports/handoff.md tells a teammate exactly what to doWhen you return: we’ll implement the contract + CLI rebuild + run metadata + DuckDB outputs and commit.
Build: contract gate + one‑command rebuild + run metadata + DuckDB outputs + handoff
By the end, you should have:
bootcamp_data/contract.py (contract for orders_features)bootcamp_data/cli.py (Typer command: rebuild)data/processed/_run_meta.json (run metadata written by rebuild)reports/schema_summary.csv (schema + missingness summary)scripts/run_day5_duckdb_queries.py (writes SQL outputs to reports/)reports/handoff.md (quick start + artifacts + caveats)bootcamp_data/
__init__.py
checks.py
contract.py # NEW
cli.py # NEW (Typer)
io.py
transforms.py
scripts/
__init__.py
run_day2_validate_and_report.py
run_day3_eda_and_figures.py
run_day4_publish.py
run_day5_duckdb_queries.py # NEW
reports/
day4_publishable_report.md
day4_data_dictionary.md
schema_summary.csv # NEW
handoff.md # NEW
day5_duckdb_top_countries.csv # NEW (from DuckDB)
day5_duckdb_monthly_revenue.csv # NEW (from DuckDB)
figures/
data/
raw/
processed/
orders_features.parquet
_run_meta.json # NEWCheckpoint: Day 4 outputs exist (Warm‑up checkpoint).
bootcamp_data/contract.py (25 minutes)bootcamp_data/contract.pyvalidate_orders_features(df)Checkpoint: calling the function raises no error for your current orders_features.parquet.
bootcamp_data/contract.pyfrom __future__ import annotations
import pandas as pd
from pandas.api.types import is_numeric_dtype, is_string_dtype
from .checks import assert_in_range, assert_non_empty, assert_unique_key, require_columns
REQUIRED = [
"order_id", "user_id", "amount", "quantity", "created_at",
"status_clean", "month", "dow", "hour",
]
def validate_orders_features(df: pd.DataFrame, *, df_name: str = "orders_features") -> None:
require_columns(df, REQUIRED, df_name=df_name)
assert_non_empty(df, df_name=df_name)
assert_unique_key(df, "order_id", allow_na=False, df_name=df_name)
if not is_string_dtype(df["order_id"]):
raise ValueError(f"{df_name}: order_id must be string-like (got {df['order_id'].dtype})")
if not is_string_dtype(df["user_id"]):
raise ValueError(f"{df_name}: user_id must be string-like (got {df['user_id'].dtype})")
if not is_numeric_dtype(df["amount"]):
raise ValueError(f"{df_name}: amount must be numeric (got {df['amount'].dtype})")
assert_in_range(df["amount"].dropna(), lo=0, name=f"{df_name}.amount")
assert_in_range(df["quantity"].dropna(), lo=0, name=f"{df_name}.quantity")
assert_in_range(df["hour"].dropna(), lo=0, hi=23, name=f"{df_name}.hour")macOS/Linux
Windows PowerShell
Checkpoint: you see contract OK.
bootcamp_data/cli.py (Typer) (40 minutes)Note
If you see ModuleNotFoundError: typer, install it once with: uv add typer.
Create bootcamp_data/cli.py
Add a rebuild command that runs Day 2 → Day 4 using subprocess.run
Verify key artifacts exist and are non‑empty
Load orders_features.parquet and run validate_orders_features(df)
Write:
data/processed/_run_meta.jsonreports/schema_summary.csvCheckpoint: uv run python -m bootcamp_data.cli rebuild completes successfully.
bootcamp_data/cli.py (1/2)from __future__ import annotations
import json
import subprocess
import sys
import time
from datetime import datetime, timezone
from pathlib import Path
import pandas as pd
import typer
from .contract import validate_orders_features
app = typer.Typer(add_completion=False)
ROOT = Path(__file__).resolve().parents[1]
def run_mod(mod: str, args: list[str] | None = None) -> None:
cmd = [sys.executable, "-m", mod] + (args or [])
subprocess.run(cmd, cwd=ROOT, check=True)
def require_file(path: Path) -> None:
if not path.exists():
raise FileNotFoundError(f"Missing: {path}")
if path.stat().st_size == 0:
raise ValueError(f"Empty file: {path}")
def safe_git_commit() -> str | None:
try:
return subprocess.check_output(["git", "rev-parse", "HEAD"], cwd=ROOT, text=True).strip()
except Exception:
return Nonebootcamp_data/cli.py (2/2)@app.command()
def rebuild(
top_k: int = typer.Option(10, help="Top-k categories to show in Day 4 report."),
figures: bool = typer.Option(False, help="Attempt to export figures (optional)."),
start_date: str | None = typer.Option(None, help="Optional YYYY-MM-DD filter for Day 4."),
end_date: str | None = typer.Option(None, help="Optional YYYY-MM-DD filter for Day 4."),
) -> None:
t0 = time.time()
started = datetime.now(timezone.utc).isoformat()
run_mod("scripts.run_day2_validate_and_report")
run_mod("scripts.run_day3_eda_and_figures")
day4_args = ["--top-k", str(top_k)]
if start_date: day4_args += ["--start-date", start_date]
if end_date: day4_args += ["--end-date", end_date]
if figures: day4_args += ["--figures"]
run_mod("scripts.run_day4_publish", day4_args)
features_path = ROOT / "data/processed/orders_features.parquet"
report_path = ROOT / "reports/day4_publishable_report.md"
dict_path = ROOT / "reports/day4_data_dictionary.md"
require_file(features_path)
require_file(report_path)
require_file(dict_path)
df = pd.read_parquet(features_path)
validate_orders_features(df)
# schema summary
schema = pd.DataFrame({
"column": df.columns,
"dtype": [str(t) for t in df.dtypes],
"n_missing": [int(df[c].isna().sum()) for c in df.columns],
}).sort_values("n_missing", ascending=False)
schema_path = ROOT / "reports/schema_summary.csv"
schema_path.parent.mkdir(parents=True, exist_ok=True)
schema.to_csv(schema_path, index=False)
meta = {
"started_utc": started,
"finished_utc": datetime.now(timezone.utc).isoformat(),
"duration_s": round(time.time() - t0, 2),
"git_commit": safe_git_commit(),
"args": {"top_k": top_k, "figures": figures, "start_date": start_date, "end_date": end_date},
"rows": {"orders_features": int(len(df))},
"outputs": [str(features_path), str(report_path), str(dict_path), str(schema_path)],
}
meta_path = ROOT / "data/processed/_run_meta.json"
meta_path.parent.mkdir(parents=True, exist_ok=True)
meta_path.write_text(json.dumps(meta, indent=2), encoding="utf-8")
typer.echo("✅ rebuild OK")macOS/Linux
Windows PowerShell
Checkpoint: you see ✅ rebuild OK.
uv add duckdbscripts/run_day5_duckdb_queries.pyorders_features.parquet and write CSV outputs to reports/Checkpoint: the CSVs exist in reports/.
Note
If you can’t install DuckDB due to connectivity, keep the script and run it later. The SQL practice is still valuable.
scripts/run_day5_duckdb_queries.pyfrom __future__ import annotations
from pathlib import Path
import duckdb
ROOT = Path(__file__).resolve().parents[1]
FEATURES = ROOT / "data/processed/orders_features.parquet"
OUT1 = ROOT / "reports/day5_duckdb_top_countries.csv"
OUT2 = ROOT / "reports/day5_duckdb_monthly_revenue.csv"
def main() -> None:
OUT1.parent.mkdir(parents=True, exist_ok=True)
q1 = '''
SELECT country, COUNT(*) AS n, SUM(amount) AS revenue
FROM read_parquet(?)
GROUP BY 1
ORDER BY revenue DESC
LIMIT 10
'''
df1 = duckdb.query(q1, [str(FEATURES)]).df()
df1.to_csv(OUT1, index=False)
q2 = '''
SELECT month, COUNT(*) AS n_orders, SUM(amount) AS revenue
FROM read_parquet(?)
GROUP BY 1
ORDER BY month ASC
'''
df2 = duckdb.query(q2, [str(FEATURES)]).df()
df2.to_csv(OUT2, index=False)
print("wrote:", OUT1)
print("wrote:", OUT2)
if __name__ == "__main__":
main()macOS/Linux
Windows PowerShell
reports/handoff.md (20 minutes)reports/handoff.mdCheckpoint: a teammate can follow it without asking you questions.
git status"day5: contract + cli rebuild + run meta + duckdb + handoff"git pushCheckpoint: your GitHub repo shows the new commit.
When something breaks:
Confirm repo root (run from the folder with data/ + scripts/ + bootcamp_data/)
Re-run: uv run python -m bootcamp_data.cli rebuild
Read the error message: which check failed? which column?
Verify inputs:
data/raw/ files existIf using date filters: make sure they don’t filter everything out