Machine Learning

AI Professionals Bootcamp | Week 3

2025-12-29

Announcements / admin

Today you create your first reproducible training run (baseline + saved model)
Don’t commit generated artifacts:
- models/runs/, data/processed/, outputs/ are gitignored on purpose
Keep your reports/model_card.md open (we will update the evaluation plan section)

Note

If you get stuck: read the error, then ask for clarification — not code.

Day 2: Split + baseline + train

Goal: run ml-baseline train to create a versioned run folder and save baseline metrics + a trained model.

Bootcamp • SDAIA Academy

Today’s Flow

Session 1 (60m): Splits that don’t lie (holdout + stratify)
Asr Prayer (20m)
Session 2 (60m): Baselines + metrics (beat the dummy)
Maghrib Prayer (20m)
Session 3 (60m): train command anatomy (run_id + artifacts)
Isha Prayer (20m)
Hands-on (120m): Run training, inspect artifacts, update model card, push a commit

Learning Objectives

By the end of today, you can:

Explain what a holdout split is (and why it exists)
Choose a split strategy at a high level: random / time / group
Explain why a Dummy baseline is mandatory
Identify an evaluation metric that matches a decision (precision/recall vs accuracy)
Run uv run ml-baseline train --target ... and find saved artifacts
Update your model card with a clear evaluation plan

Warm-up (5 minutes)

Run yesterday’s work and confirm it still works.

uv run ml-baseline --help
uv run ml-baseline make-sample-data
ls data/processed
uv run pytest

Checkpoint: you see features.csv or features.parquet, and tests pass.

From your model card → your training config

Yesterday you wrote a dataset contract:

Target (y): what you predict
Unit of analysis: what one row represents
ID passthrough: how you join predictions back
Forbidden columns: target + obvious leakage

Tip

Today you will use those decisions to split and train consistently.

Where today fits in the Week 3 loop

Define → Split → Baseline → Train → Save → Predict → Report

Today: Split + Baseline + Train (your first run folder). Tomorrow: we’ll go deeper on evaluation artifacts and the input schema.

End-of-day demo (what should work)

Train

uv run ml-baseline train --target is_high_value

What should happen - a new folder appears in models/runs/<run_id>/ - models/registry/latest.txt updates to the newest run - baseline metrics are saved to the run folder

Session 1

Splits that don’t lie (holdout + stratify)

Session 1 objectives

Understand splitting as a simulation of production
Know when random split is acceptable (minimum requirement)
Understand stratification for binary classification

Why do we split?

A split answers:

“If we train on some rows, how well do we do on new rows?”

Without a split, you only know:

“how well do we predict what we already saw?” (not useful)

Train vs holdout (final exam)

Train split: the model learns patterns
Holdout split: the model takes a “final exam” on unseen data

Tip

Treat the holdout set like production: don’t “study” it.

Random split (minimum requirement)

Random split is OK when:

each row is mostly independent
you are not forecasting over time
you do not have repeated entities (or you already have 1 row per entity)

Our sample dataset has 1 row per user_id, so random split is fine.

Stratification (keep class balance)

If your target is imbalanced (example: 5% positives):

plain random split can create weird class balance by accident
stratification keeps train/holdout positive rates similar

Note

Stratify only makes sense for classification (not regression).

Micro-exercise: choose the split (6 minutes)

Pick the best split strategy:

Predict next month demand using last 24 months of data
Predict fraud on transactions (each user has many rows)
Predict churn where each user appears once (one row per user)

Checkpoint: you can justify each choice in 1 sentence.

Solution: choose the split

Time split (future demand = time matters)
Group split (avoid user leakage across splits)
Random split (likely i.i.d. and one row per user)

Group leakage (common “accidental cheating”)

If the same entity appears in train and holdout:

the model can “recognize” the entity
metrics inflate without real learning

Typical group columns: - user_id, customer_id, device_id, patient_id

Code pattern: random stratified split

from sklearn.model_selection import train_test_split

def random_split(df, *, target, test_size, seed, stratify):
    y = df[target]
    strat = y if stratify else None
    train, test = train_test_split(
        df, test_size=test_size, random_state=seed, stratify=strat
    )
    return train.reset_index(drop=True), test.reset_index(drop=True)

Minimum requirement: stratify for binary classification when possible.

Quick Check

Question: What problem does stratification prevent?

Answer: train and holdout accidentally having very different class balance.

Session 1 recap

Splitting simulates “new data”
Holdout is a final exam
Random stratified split is the minimum for binary classification

Asr break

20 minutes

When you return: be ready to explain why “beat the dummy” is required.

Session 2

Baselines + metrics (beat the dummy)

Session 2 objectives

Define a baseline as a floor
See why accuracy can lie (imbalanced targets)
Pick a primary metric that matches a decision

What is a baseline?

A baseline is the simplest thing that could work:

classification: predict the most common class
regression: predict the average target value

Tip

If your model can’t beat the baseline, something is wrong (or there is no signal).

If you can’t beat the dummy…

That usually means one of these:

your features don’t contain signal
your target is noisy or poorly defined
your split is unrealistic (leakage or mismatch)

A baseline result is a diagnostic, not an insult.

Metrics are about the decision

Before you choose a metric, ask:

which mistake is worse?
- false positives (predict “yes” when it’s “no”)
- false negatives (miss a real “yes”)
do you need a ranking score or a hard decision?

Note

Today we focus on decision metrics: accuracy / precision / recall / F1.

Accuracy trap (imbalanced data)

Suppose:

1000 samples
50 positives (5%)
a model predicts “negative” for everyone

What is accuracy?

Micro-exercise: compute the accuracy (3 minutes)

Correct negatives = ?
Total samples = ?
Accuracy = ?

Checkpoint: you have a number between 0 and 1.

Solution: why accuracy can lie

Correct negatives = 950
Total = 1000
Accuracy = 950 / 1000 = 0.95

Warning

95% accuracy can still mean “your model never finds positives.”

Minimum metrics (what we report this week)

Classification (binary) - accuracy - precision - recall - F1

Regression - MAE (Mean Absolute Error)

Tip

Pick one primary metric and write it in your model card today.

Optional metrics (if you finish early)

ROC AUC / PR AUC: useful when you care about ranking quality
RMSE: penalizes large errors more than MAE
R²: “how much variance explained” (often confusing for beginners)

Optional ≠ unimportant. It’s just not required to ship a baseline this week.

Dummy baseline in scikit-learn (pattern)

from sklearn.dummy import DummyClassifier

dummy = DummyClassifier(strategy="most_frequent")
dummy.fit(X_train, y_train)

proba = dummy.predict_proba(X_test)
y_score = proba[:, 1] if proba.shape[1] > 1 else proba[:, 0]
baseline = classification_metrics(y_true, y_score, threshold=0.5)

Fit on the train split, evaluate on the holdout split.

Micro-exercise: interpret a baseline (5 minutes)

You see:

baseline accuracy = 0.95
baseline recall = 0.00

What is the baseline doing?
Why is this baseline “good” and “bad” at the same time?
Which metric should you focus on next?

Checkpoint: you can answer in 3 short sentences.

Solution: interpret the baseline

It predicts the majority class (“negative”) almost always.
Accuracy looks great because positives are rare, but it never finds positives.
Focus on recall/precision/F1 (or change the problem/threshold).

Session 2 recap

Baselines create a performance floor
Accuracy can be misleading on imbalanced targets
Choose a metric that matches your decision

Maghrib break

20 minutes

When you return: we will connect the concepts to the train command + artifacts.

Session 3

train command anatomy (run_id + artifacts)

Session 3 objectives

Understand what ml-baseline train does at a high level
Know what a run folder is and why it exists
Identify the minimum artifacts you need today

What `ml-baseline train` does (in 8 steps)

Load data/processed/features.*
Separate X and y (drop target + IDs from X)
Split into train + holdout
Fit a dummy baseline and record metrics
Fit a simple scikit-learn Pipeline (baseline model)
Save a run folder under models/runs/<run_id>/
Write models/registry/latest.txt
Print where things were saved

Run folder mental model

Each run folder is a snapshot:

the model you trained
the data contract you assumed
the metrics you got
the config you used (seed, split, target, …)

Tip

A run folder is how you make training reproducible and reviewable.

Minimum artifacts to understand today

Inside models/runs/<run_id>/ focus on:

metrics/baseline_holdout.json ✅ (your floor)
model/model.joblib ✅ (your saved model)

You may see extra files (schema, tables, holdout metrics). We’ll explain them more on Day 3.

Micro-exercise: find the artifacts (4 minutes)

Match the question → the file:

“What is the baseline score?”
“Which run is the newest?”
“Where is the saved model object?”

Checkpoint: you can name the file path for each.

Solution: artifact locations

models/runs/<run_id>/metrics/baseline_holdout.json
models/registry/latest.txt
models/runs/<run_id>/model/model.joblib

Determinism (why we care about seeds)

Same seed → same split → fair comparisons across runs
Different seed → different split → metrics can change a little

Note

We don’t need “perfect” metrics today — we need repeatable training.

Session 3 recap

train turns your feature table into a versioned run
Baseline metrics + saved model are the minimum outcomes today
Seeds make comparisons meaningful

Isha break

20 minutes

When you return: start Hands-on Task 1 immediately.

Hands-on

Run training + inspect baseline + update model card

Hands-on success criteria (today)

Minimum ✅ - uv run ml-baseline train --target ... runs successfully - a new run folder exists under models/runs/<run_id>/ - metrics/baseline_holdout.json exists - model/model.joblib exists - you updated reports/model_card.md (evaluation plan section) - 1+ commit pushed to GitHub

Optional ⭐ - run training with a different seed and compare metrics - try --split-strategy time or --split-strategy group on your own dataset (if you have one)

Project touch points (Day 2)

src/ml_baseline/
  train.py       # orchestration: load → split → baseline → fit → save
  splits.py      # random/time/group split helpers
  metrics.py     # metric helpers
  pipeline.py    # preprocessing + model (baseline pipeline)
models/
  runs/<run_id>/
  registry/latest.txt
reports/
  model_card.md

Task 1 — Run training once (15 minutes)

Ensure sample data exists
Run training for the sample target
Copy the printed run folder path

uv run ml-baseline make-sample-data
uv run ml-baseline train --target is_high_value

Checkpoint: terminal prints Saved run: .../models/runs/<run_id>.

Solution — expected output shape

You should see something like:

Saved run: .../models/runs/2025-12-29T...Z__classification__seed42

Note

Your run_id will be different. That’s good: each run is versioned.

Task 2 — Inspect the run folder (15 minutes)

List the run folders, open latest.txt, and confirm the minimum artifacts exist

macOS/Linux

ls models/runs
cat models/registry/latest.txt
ls models/runs/$(cat models/registry/latest.txt)

Windows PowerShell

ls models/runs
type models/registry/latest.txt
ls models/runs/$(Get-Content models/registry/latest.txt)

Checkpoint: you can see metrics/ and model/ inside the run.

Solution — run folder checklist

models/runs/<run_id>/
  metrics/baseline_holdout.json   ✅
  model/model.joblib              ✅
  ...

Ignore the “…” for now. We’ll use more artifacts tomorrow.

Task 3 — Read baseline metrics (15 minutes)

Open the baseline JSON file
Answer:
- what is the baseline’s accuracy?
- what is the baseline’s recall?
- what does that imply about your class balance?

Task 3 — Read baseline metrics (15 minutes)

macOS/Linux

LATEST_ID=$(cat models/registry/latest.txt)

python -m json.tool \
    models/runs/$LATEST_ID/metrics/baseline_holdout.json

Windows PowerShell

$LatestID = Get-Content models/registry/latest.txt

python -m json.tool `
    models/runs/$LatestID/metrics/baseline_holdout.json

Checkpoint: you can explain the baseline in 1–2 sentences.

Solution — what to look for

If accuracy is high and recall is low, the target might be imbalanced.
Your next job is to beat this baseline on the holdout (not on training).

Tip

Write your baseline result (1–2 numbers) into your eval_summary.md later this week.

Task 4 — Update your model card (10 minutes)

Open reports/model_card.md and update:

Split strategy: random holdout
Test size + seed (whatever you used)
Primary metric (pick one)

Checkpoint: your model card evaluation plan has no blanks.

Task 5 — Quality gates (10 minutes)

uv run pytest
uv run ruff check .
uv run ruff format --check .

Checkpoint: all commands exit with code 0.

Git checkpoint (2 minutes)

git status
commit message: "w3d2: train + baseline"
push to GitHub

Checkpoint: your commit appears on GitHub.

Solution — Git commands

git status
git add reports/model_card.md
git add -A
git commit -m "w3d2: train + baseline"
git push

Debug playbook (common Day 2 errors)

Missing target column → check --target matches your file’s column name
“file not found” → confirm data/processed/features.* exists
“could not convert string to float” → you forgot categorical handling (pipeline issue)
Parquet errors → install optional dependency: uv sync --extra parquet

Warning

Make one change at a time. Re-run train. Don’t “randomly tweak”.

Stretch goals (optional)

⭐ If you finish early:

Run with a different seed: --seed 7 and compare baseline metrics
If your own dataset has a time column, try --split-strategy time
(Strong students) add cross-validation on the training split (don’t remove holdout)

Exit Ticket

In 1–2 sentences each:

Why do we need a holdout split?
What is the purpose of a Dummy baseline?
Which metric did you pick as your primary metric, and why?

What to do after class (Day 2 assignment)

Due: before Day 3 (Dec 30, 2025)

Make sure this command runs:
- uv run ml-baseline train --target <your_target>
Confirm you can find:
- metrics/baseline_holdout.json
- model/model.joblib
Update reports/model_card.md evaluation plan section
Commit + push

Deliverable: GitHub repo link + the run_id from your latest run.

Tip

Do not commit models/runs/. Commit code + reports.

Machine Learning

Announcements / admin

Day 2: Split + baseline + train

Today’s Flow

Learning Objectives

Warm-up (5 minutes)

From your model card → your training config

Where today fits in the Week 3 loop

End-of-day demo (what should work)

Session 1

Session 1 objectives

Why do we split?

Train vs holdout (final exam)

Random split (minimum requirement)

Stratification (keep class balance)

Micro-exercise: choose the split (6 minutes)

Solution: choose the split

Group leakage (common “accidental cheating”)

Code pattern: random stratified split

Quick Check

Session 1 recap

Asr break

20 minutes

Session 2

Session 2 objectives

What is a baseline?

If you can’t beat the dummy…

Metrics are about the decision

Accuracy trap (imbalanced data)

Micro-exercise: compute the accuracy (3 minutes)

Solution: why accuracy can lie

Minimum metrics (what we report this week)

Optional metrics (if you finish early)

Dummy baseline in scikit-learn (pattern)

Micro-exercise: interpret a baseline (5 minutes)

Solution: interpret the baseline

Session 2 recap

Maghrib break

20 minutes

Session 3

Session 3 objectives

What ml-baseline train does (in 8 steps)

Run folder mental model

Minimum artifacts to understand today

Micro-exercise: find the artifacts (4 minutes)

Solution: artifact locations

Determinism (why we care about seeds)

Session 3 recap

Isha break

20 minutes

Hands-on

Hands-on success criteria (today)

Project touch points (Day 2)

Task 1 — Run training once (15 minutes)

Solution — expected output shape

Task 2 — Inspect the run folder (15 minutes)

Solution — run folder checklist

Task 3 — Read baseline metrics (15 minutes)

Task 3 — Read baseline metrics (15 minutes)

Solution — what to look for

Task 4 — Update your model card (10 minutes)

Task 5 — Quality gates (10 minutes)

Git checkpoint (2 minutes)

Solution — Git commands

Debug playbook (common Day 2 errors)

Stretch goals (optional)

Exit Ticket

What to do after class (Day 2 assignment)

Thank You!

What `ml-baseline train` does (in 8 steps)