AI Professionals Bootcamp | Week 3
2025-12-28
You may use Generative AI only for clarifying questions.
Warning
This week is a graded repo. If GenAI writes your training/predict code, you will not build the skill.
Note
Capstone teams + project ideas are finalized by end of Week 5 (Jan 15, 2026).
Goal: understand the supervised learning loop and write a clear dataset contract for the Week 3 project.
Bootcamp • SDAIA Academy
By the end of today, you can:
ml-baseline --help and ml-baseline make-sample-datareports/model_card.md with a draft dataset contractIn pairs, answer:
Checkpoint: you can say “one row = ”, “ID = ”, “path = data/processed/...”.
By Friday, your repo can:
data/processed/features.<csv|parquet>models/runs/<run_id>/outputs/preds.csvreports/model_card.md + reports/eval_summary.mdToday we only build the foundation: ML basics + dataset contract + “hello CLI”.
Train
Predict
Note
Don’t worry about the options yet. This week is: define → split → baseline → train → evaluate → save → predict → report.
ML in one picture (supervised learning basics)
You have examples where the answer is known.
Tip
In this bootcamp, “ML” means: “turn a table into a reliable prediction pipeline.”
Today: we focus on understanding the pieces (not the math).
Remember
Example table
| user_id | country | n_orders | total_amount | is_high_value |
|---|---|---|---|---|
| u001 | US | 8 | 92.0 | 1 |
| u002 | GB | 2 | 18.5 | 0 |
| u003 | CA | 5 | 63.2 | 0 |
Note
In the sample data, is_high_value is classification (0 or 1).
During training:
During inference (real use):
Warning
If your inference file contains the target column, you are probably leaking information.
Look at the table below.
| user_id | country | n_orders | total_amount | is_high_value |
|---|---|---|---|---|
| u010 | US | 6 | 71.0 | 0 |
Checkpoint: you can answer in 3 short phrases: “y = …”, “X = …”, “ID = …”.
is_high_valueuser_idcountry, n_orders, total_amount(Some projects may include more ID columns, like customer_id, order_id, etc.)
Question: At inference time, do we have the target column?
Answer: No. Inference input should be X only (plus optional ID columns).
When you return: be ready to explain your unit of analysis in 1 sentence.
From feature table → dataset contract (X/y, IDs, leakage)
A feature table is a dataset designed for modeling.
data/processed/Tip
Week 3 starts from the feature table. We do not start from raw data.
The unit of analysis answers:
“What does one row represent?”
Examples:
Note
If you change the unit of analysis, you changed the problem.
Today you’ll write these down. Later you’ll enforce them in a schema file.
ID passthrough columns are useful because:
Warning
IDs are often unique. If you train on them directly, models can “memorize” instead of learning.
Leakage means your features contain information you would not have at prediction time.
Tip
Leakage can make metrics look amazing… and then fail in real life.
Which feature is leakage if you’re predicting “will the user be high value next month?”
n_orders_last_30_daystotal_spend_next_30_dayscountryNow answer: why?
Checkpoint: you can explain your choice in 1 sentence.
B is leakage because it uses information from the future (next 30 days).
In reports/model_card.md, write:
Question: Why do we keep ID columns if we don’t train on them?
Answer: So we can join predictions back, debug errors, and report results per entity.
When you return: open the Week 3 repo and run ml-baseline --help.
Repo + CLI orientation (what you will run all week)
Why we use a CLI:
Tip
If ml-baseline --help works, your project is already more shippable.
macOS/Linux
Windows PowerShell
If pytest is missing later: uv sync --extra dev
uv run ml-baseline --helptrainCheckpoint: you can point to make-sample-data in the help output.
make-sample-data, train, predict)--target, --seed, …) are the “knobs” you’ll use laterRaise your hand when:
uv run ml-baseline --help worksdata/processed/features.* exists after make-sample-datadata/ → models/ → reports/When you return: start Hands-on Task 1 (setup) immediately.
Generate sample data + draft your dataset contract + Git push
Minimum ✅
uv run ml-baseline --help worksuv run ml-baseline make-sample-data writes data/processed/features.*reports/model_card.md exists and is filled (draft)Optional ⭐
pyarrow, you can work with Parquet (.parquet) tooweek3-ml-baseline-system/uv syncuv run ml-baseline --helpCheckpoint: the help text prints and exit code is 0.
Warning
If a command fails, don’t panic—read the error carefully.
uv (Week 1 skill)macOS/Linux
Windows PowerShell
uv run ml-baseline make-sample-datadata/processed/features.csv (or features.parquet)Checkpoint: you can open the file and see columns.
macOS/Linux
Windows PowerShell
If your file is features.parquet, that’s fine (it means Parquet is supported on your machine).
features.*Checkpoint: you can say: “y = …, ID = …, X = …”.
For the sample feature table:
is_high_valueuser_idcountry, n_orders, avg_amount, total_amountreports/model_card.mdCheckpoint: your model card has no blank placeholders.
# Model Card — Week 3 (Draft)
## 1) What is the prediction?
- **Target (y):** `__________`
- **Unit of analysis:** one row = __________
- **Decision supported:** __________
## 2) Data contract (inference)
- **ID passthrough columns:** __________
- **Required feature columns (X):** __________
- **Forbidden columns:** target + leakage fields
## 3) Evaluation plan (fill on Day 2–3)
- Split strategy: __________
- Primary metric: __________git statusCheckpoint: your new commit is visible online.
Tip
Small commits are a superpower. Don’t wait until the end of the week.
Warning
Do not ask GenAI to write your solution code. Ask it to explain concepts or errors.
When stuck:
--help and confirm you’re using the repo environment (uv run ...)pyproject.toml)⭐ If you finish early:
uv sync --extra parquetfeatures.parquet and open itIn 1–2 sentences each:
Due: before Day 2 (Dec 29, 2025)
uv run ml-baseline --helpuv run ml-baseline make-sample-datareports/model_card.md draftDeliverable: GitHub repo link with your Day 1 commit(s).
Tip
Add 1–2 commits with clear messages. Don’t wait until the last minute.