Machine Learning

AI Professionals Bootcamp | Week 3

2026-01-01

GenAI policy reminder (submission day)

You may use Generative AI only for clarifying questions.

✅ Allowed: definitions, “what does this error mean?”, reading docs, concept explanations
❌ Not allowed: generating your solution code, debugging by copy‑paste, writing full functions

Warning

Today is a submission day. Your repo must reflect your skill.

Announcements / admin

What you submit today (minimum ✅) - updated reports/model_card.md - updated reports/eval_summary.md - uv run pytest passes - uv run ruff check . passes (install dev extra if needed) - pushed to GitHub (public)

Note

Capstone teams + project ideas are finalized by end of Week 5 (Jan 15, 2026).

Day 5: Write-up + submit

Goal: turn your working ML system into a reviewable, shippable repo.

Bootcamp • SDAIA Academy

Today’s Flow

Session 1 (60m): Model card = data contract + honest story
Asr Prayer (20m)
Session 2 (60m): Turn artifacts into a clear evaluation summary
Maghrib Prayer (20m)
Session 3 (60m): Final quality gate (tests, ruff, reproducibility)
Isha Prayer (20m)
Hands-on (120m): Fill reports, run checks, push to GitHub

Learning Objectives

By the end of today, you can:

Explain your baseline system to a teammate in 90 seconds
Fill model_card.md with: problem, data contract, split, metrics, limitations
Fill eval_summary.md with: baseline vs model, key caveats, recommendation
Run the full quality gate: ruff + pytest + end-to-end CLI demo
Submit a repo that someone else can run from the README

Warm-up (5 minutes)

Confirm end-to-end still works on your latest run.

macOS/Linux

uv run ml-baseline make-sample-data
uv run ml-baseline train --target is_high_value
run_id=$(cat models/registry/latest.txt)
holdout=$(ls models/runs/$run_id/tables/holdout_input.* | head -n 1)
uv run ml-baseline predict --run latest --input "$holdout" --output outputs/preds.csv

Windows PowerShell

uv run ml-baseline make-sample-data
uv run ml-baseline train --target is_high_value
$run_id = Get-Content models/registry/latest.txt
$holdout = (Get-ChildItem "models/runs/$run_id/tables" -Filter "holdout_input.*" | Select-Object -First 1).FullName
uv run ml-baseline predict --run latest --input $holdout --output outputs/preds.csv

Checkpoint: outputs/preds.csv exists.

The Week 3 finish line (minimum ✅)

Your system is “done” when:

ml-baseline train saves a run folder with model + schema + metrics
ml-baseline predict produces a predictions file on new input
reports answer: what, how well, what can go wrong, how to run

Tip

A working model with no documentation is hard to trust.

Session 1

Model card = data contract + honest story

Session 1 objectives

Understand what a model card is (and why teams use it)
Update your model_card.md to match your actual run artifacts
Write limitations in a way that protects you (and your future users)

What is a model card?

A model card is a short document that answers:

What is this model for?
What data does it expect? (contract)
How was it evaluated?
What are the limitations?
How do I run it?

Note

Think: “If I leave the company, can a teammate understand and rerun this?”

Model card structure (minimum ✅)

Keep it simple and scannable:

Prediction task (target + unit of analysis)
Data contract (required features, optional IDs, forbidden columns)
Training recipe (split, baseline, model family)
Results (baseline vs model on holdout)
Limitations + failure modes
How to run (train + predict commands)

Where the truth lives (don’t guess)

Use your artifacts to fill the model card:

data contract: schema/input_schema.json
run identity: run_meta.json + models/registry/latest.txt
metrics: metrics/baseline_holdout.json + metrics/holdout_metrics.json
examples: tables/holdout_predictions.*

Tip

Copy numbers from artifacts. Do not “estimate” your metrics.

Micro-exercise: find 3 facts (6 minutes)

From your latest run folder, find:

your run_id
one baseline metric value
one model holdout metric value

Checkpoint: you can point to the exact file for each fact.

Solution (where to look)

models/registry/latest.txt (or run_meta.json)
models/runs/<run_id>/metrics/baseline_holdout.json
models/runs/<run_id>/metrics/holdout_metrics.json

Limitations (minimum ✅)

Write 3–5 bullets that are true now:

data coverage limits (missing segments, small dataset)
label quality assumptions (how was y created?)
leakage risks you checked (and what you might have missed)
what “good performance” does not guarantee (business caveats)

Warning

Avoid vague limits like “model may be biased.” Be specific: “performance may drop for ___ because ___.”

Session 1 recap

The model card is your trust document
Use artifacts as the source of truth
Clear limitations are a professional strength

Asr break

20 minutes

When you return: open reports/model_card.md and your latest run folder side-by-side.

Session 2

Turn artifacts into a clear evaluation summary

Session 2 objectives

Write an evaluation summary that compares baseline vs model fairly
Translate metrics into “what it means” (1–2 sentences)
Do light error analysis using holdout_predictions.*

What is `eval_summary.md`?

It is a short decision memo:

What you trained (so we know what we’re judging)
Results (baseline vs model)
What went wrong / where it fails
Whether you’d ship it as a baseline

Keep it short. The goal is clarity, not word count.

Metrics: what to report (minimum ✅)

Choose 1–2 “primary” metrics.

Classification (common) - F1 or recall/precision (pick based on the decision) - ROC-AUC is OK as a secondary metric

Regression (common) - MAE as primary

Tip

Report the baseline metric next to the model metric.

Micro-exercise: interpret one metric (6 minutes)

Open your holdout_metrics.json and answer:

What is your primary metric?
Is “higher is better” or “lower is better”?
Write one sentence: “This means ___.”

Checkpoint: your sentence is understandable by a non-ML teammate.

Solution (example sentences)

F1: “On unseen holdout rows, the model balances precision and recall with F1 = ___.”
Recall: “The model catches ___% of positives on holdout, which matters because missing positives is costly.”
MAE: “On holdout, predictions are off by about ___ units on average.”

Error analysis (minimum ✅)

Use holdout_predictions.* to find:

3–5 worst mistakes (false positives / false negatives)
a pattern (segment, feature range, missing values)

Note

You don’t need perfect analysis. You need evidence you looked.

Optional ⭐: confidence intervals

If your holdout_metrics.json includes a CI field (example: roc_auc_ci or mae_ci):

report it as a range
explain uncertainty in one sentence

Optional does not block submission.

Session 2 recap

eval_summary.md is a decision memo
Always compare baseline vs model on the same holdout
A little error analysis beats none

Maghrib break

20 minutes

When you return: run python -m json.tool on your metrics files and copy numbers into your report.

Session 3

Final quality gate (tests, ruff, reproducibility)

Session 3 objectives

Run the “quality gate” commands and understand failures
Avoid the most common submission mistakes
Make your repo easy to grade (and easy to demo)

The quality gate (minimum ✅)

From repo root:

uv run pytest
uv run ruff check .
(optional) uv run ruff format .

Note

If ruff is missing: uv sync --extra dev.

Common failure modes (and fixes)

Import errors: you edited paths/imports → check src/ package structure
Non-determinism: missing seed usage → set seeds in splits/training
Schema errors: predict fails on missing/extra columns → validate against input_schema.json

Tip

When tests fail, read the assertion message first. It usually tells you what artifact is missing.

Micro-exercise: “make it gradeable” (5 minutes)

Answer in one sentence each:

What command trains your model?
What command runs prediction?
Where do graders find your written work?

Checkpoint: you can point to the exact file paths.

Solution

uv run ml-baseline train --target <your_target>
uv run ml-baseline predict --run latest --input <file> --output outputs/preds.csv
reports/model_card.md and reports/eval_summary.md

Session 3 recap

Passing tests + clean linting is part of “shipping”
Make your repo easy to run from the README
Don’t commit generated artifacts

Isha break

20 minutes

When you return: start Hands-on Task 1 and don’t stop until the checklist is green.

Hands-on

Fill reports + run checks + submit

Hands-on success criteria (today)

Minimum ✅ - reports/model_card.md complete (no blanks) - reports/eval_summary.md complete (baseline vs model) - uv run pytest passes - uv run ruff check . passes - pushed to GitHub (public)

Optional ⭐ - add confidence intervals to the write-up (if available) - add 1 small error slice (metrics by a segment column)

Project layout (what you touch today)

week3-ml-baseline-system/
  reports/                 # you edit these
  src/                     # you edit only if tests fail
  tests/                   # run to verify
  models/runs/             # generated (don’t commit)
  outputs/                 # generated (don’t commit)

Task 1 — Locate your latest run (10 minutes)

Read models/registry/latest.txt
Open the run folder under models/runs/<run_id>/
Identify:
- baseline_holdout.json
- holdout_metrics.json

Checkpoint: you can open both JSON files.

Solution (Task 1)

macOS/Linux

run_id=$(cat models/registry/latest.txt)
ls models/runs/$run_id/metrics
python -m json.tool models/runs/$run_id/metrics/baseline_holdout.json
python -m json.tool models/runs/$run_id/metrics/holdout_metrics.json

Windows PowerShell

$run_id = Get-Content models/registry/latest.txt
ls "models/runs/$run_id/metrics"
python -m json.tool "models/runs/$run_id/metrics/baseline_holdout.json"
python -m json.tool "models/runs/$run_id/metrics/holdout_metrics.json"

Task 2 — Fill `reports/eval_summary.md` (25 minutes)

Describe what you trained (model family + preprocessing)
Paste baseline vs model metrics (holdout)
Do light error analysis:
- open tables/holdout_predictions.*
- identify 2–3 mistakes and a pattern
Finish with a recommendation: ship or not?

Checkpoint: your summary includes numbers and a recommendation.

Hint: quick error analysis without fancy tools

Classification:
- sort by score and look at wrong predictions
- find false positives (pred=1, y=0) and false negatives (pred=0, y=1)
Regression:
- compute absolute error: abs(pred - y) and sort

You can do this in a quick notebook, or export a small CSV sample and inspect it.

Solution (Task 2: what “complete” looks like)

In reports/eval_summary.md you have:

Baseline metrics: e.g., F1 = ___
Holdout metrics: e.g., F1 = , ROC-AUC =
Worst cases: 3 bullets
Next fixes: 2 bullets
Recommendation: 1–2 sentences

Task 3 — Fill `reports/model_card.md` (30 minutes)

Ensure the data contract is explicit:
- target, unit of analysis
- required features + optional IDs
- forbidden columns (target + leakage)
Add the run_id you’re reporting
Add “how to run” commands (train + predict)

Checkpoint: a teammate can run your commands without asking you questions.

Model card template (final)

# Model Card — Week 3 Baseline

## 1) What is the prediction?
- **Target (y):** `__________`
- **Unit of analysis:** one row = __________
- **Decision supported:** __________

## 2) Data contract (inference)
- **ID passthrough columns:** __________
- **Required feature columns (X):** __________
- **Forbidden columns:** `__________` (target + leakage)

## 3) Training recipe
- **Split strategy:** random holdout (test_size=___, seed=___)
- **Baseline:** Dummy (most_frequent / mean)
- **Model family:** __________ (pipeline: preprocessing + estimator)

## 4) Results (holdout)
- **Baseline:** __________
- **Model:** __________

## 5) Limitations + failure modes
- …
- …
- …

## 6) How to run
```bash
uv run ml-baseline train --target __________
uv run ml-baseline predict --run latest --input <file> --output outputs/preds.csv
```

Task 4 — Run the quality gate (15 minutes)

Install dev tools if needed: uv sync --extra dev
Run:
- uv run ruff check .
- uv run pytest

Checkpoint: both commands exit with code 0.

Solution (Task 4)

macOS/Linux

uv sync --extra dev
uv run ruff check .
uv run pytest

Windows PowerShell

uv sync --extra dev
uv run ruff check .
uv run pytest

Tip

If you used ruff format ., re-run tests after formatting.

Task 5 — Git checkpoint + submit (10 minutes)

git status
Commit your report updates
Push to GitHub

Checkpoint: your latest commit is visible online.

Solution (Task 5)

git status
git add reports/model_card.md reports/eval_summary.md
git commit -m "w3d5: finalize reports + submission"
git push

Warning

Do not commit models/runs/ or outputs/.

Debug playbook (submission blockers)

If something fails:

ruff not found → uv sync --extra dev
tests fail → read assertion; it usually names the missing file
predict errors → compare input columns to schema/input_schema.json
path issues → confirm you run from repo root (where pyproject.toml is)

Stretch goals (optional)

⭐ If your minimum is done:

add 1 “slice” metric (e.g., performance by country)
try --threshold-strategy max_f1 (classification) and report the chosen threshold
add a small Plotly confusion matrix or error histogram (optional extra)

Exit Ticket

In 1–2 sentences:

What is the single most important limitation of your model?
If a teammate runs predict with a missing column, what should happen?
Would you ship this baseline today? Why/why not?

What to do after class (Day 5 submission)

Due: today (Jan 1, 2026)

Push your final repo to GitHub (public)
Verify these files are present and updated:
- reports/model_card.md
- reports/eval_summary.md
Verify:
- uv run pytest
- uv run ruff check .

Deliverable: GitHub repo link.

Tip

Before you submit, clone your repo into a new folder and run the quickstart commands. If it works there, it will grade well.

Machine Learning

GenAI policy reminder (submission day)

Announcements / admin

Day 5: Write-up + submit

Today’s Flow

Learning Objectives

Warm-up (5 minutes)

The Week 3 finish line (minimum ✅)

Session 1

Session 1 objectives

What is a model card?

Model card structure (minimum ✅)

Where the truth lives (don’t guess)

Micro-exercise: find 3 facts (6 minutes)

Solution (where to look)

Limitations (minimum ✅)

Session 1 recap

Asr break

20 minutes

Session 2

Session 2 objectives

What is eval_summary.md?

Metrics: what to report (minimum ✅)

Micro-exercise: interpret one metric (6 minutes)

Solution (example sentences)

Error analysis (minimum ✅)

Optional ⭐: confidence intervals

Session 2 recap

Maghrib break

20 minutes

Session 3

Session 3 objectives

The quality gate (minimum ✅)

Common failure modes (and fixes)

Micro-exercise: “make it gradeable” (5 minutes)

Solution

Session 3 recap

Isha break

20 minutes

Hands-on

Hands-on success criteria (today)

Project layout (what you touch today)

Task 1 — Locate your latest run (10 minutes)

Solution (Task 1)

Task 2 — Fill reports/eval_summary.md (25 minutes)

Hint: quick error analysis without fancy tools

Solution (Task 2: what “complete” looks like)

Task 3 — Fill reports/model_card.md (30 minutes)

Model card template (final)

Task 4 — Run the quality gate (15 minutes)

Solution (Task 4)

Task 5 — Git checkpoint + submit (10 minutes)

Solution (Task 5)

Debug playbook (submission blockers)

Stretch goals (optional)

Exit Ticket

What to do after class (Day 5 submission)

Thank You!

What is `eval_summary.md`?

Task 2 — Fill `reports/eval_summary.md` (25 minutes)

Task 3 — Fill `reports/model_card.md` (30 minutes)