AI Professionals Bootcamp | Week 3
2025-12-30
predict will use your input schemamodels/runs/, data/processed/, outputs/ should stay gitignoredNote
Keep reports/eval_summary.md open — you’ll update it with baseline vs model results.
Goal: compute and save holdout metrics, a holdout predictions table, and an input schema for reliable prediction.
Bootcamp • SDAIA Academy
By the end of today, you can:
holdout_metrics.json and say what it means in plain Englishholdout_predictions.csv to find false positives / false negativesholdout_input.csv (inference-shaped input)schema/input_schema.json from training datareports/eval_summary.md with baseline vs model comparisonRun yesterday’s training and find your latest run folder.
macOS/Linux
Windows PowerShell
Checkpoint: latest run exists models/runs/<run_id>/.
Today: Evaluate + Save artifacts.
Tomorrow: Predict CLI uses today’s schema + holdout_input.
Holdout metrics you can trust
Training performance can look great even when the model is useless.
Warning
If your holdout split is wrong (leakage / duplicates / time mix-up), your metrics will lie.
Baseline - a “no-skill” prediction - sets the floor - file: metrics/baseline_holdout.json
Model - trained pipeline - must beat the baseline - file: metrics/holdout_metrics.json
Tip
Same split. Same holdout. Same metric. Only then you can claim improvement.
For binary classification:
F1 combines precision + recall (useful, but pick one primary metric first).
For regression:
RMSE and R² exist — but MAE is the easiest to interpret first.
Question: If false positives are very expensive, which metric do you usually care about more?
Answer: Precision (you want “predicted positive” to be trustworthy).
Scenario: we predict high-value customers to give them an expensive benefit.
Checkpoint: you can explain your choice to your partner.
Example fields (yours may include more):
Tip
You don’t have to “optimize everything.”
Report your primary metric + 1 supporting metric.
When you return: be ready to explain your primary metric choice.
Save the rows (holdout predictions + holdout input)
holdout_predictions.* (the “debug table”)holdout_input.* (inference-shaped input for Day 4)A single number can’t tell you:
Tip
Saving a predictions table makes debugging possible.
Classification - optional IDs (passthrough) - score (probability-like) - prediction (0/1 after threshold) - the true target column (for evaluation only)
Regression - optional IDs (passthrough) - prediction - the true target column (for evaluation only)
holdout_predictions.csv (classification)| user_id | score | prediction | is_high_value |
|---|---|---|---|
| U_001 | 0.91 | 1 | 1 |
| U_002 | 0.73 | 1 | 0 |
| U_003 | 0.12 | 0 | 0 |
| U_004 | 0.48 | 0 | 1 |
| U_005 | 0.66 | 1 | 1 |
U_002 is a false positive. U_004 is a false negative.
Using the table:
Checkpoint: you can point to 1 FP and 1 FN row.
prediction=1 and target=0 → U_002prediction=0 and target=1 → U_004holdout_input.*holdout_input is the holdout set without the target.
We save it because:
ml-baseline predict on itWarning
Never include the target column inside holdout_input.
run_meta.json (one file to explain the run)This is a small “receipt” for your run:
You don’t need to overthink it. You just need the habit: every run is explainable.
Question: Why do we save holdout_input separately?
Answer: It’s inference-shaped input we can reuse to test prediction (and later check for skew).
holdout_predictions enables debuggingholdout_input enables reliable prediction testingrun_meta.jsonWhen you return: open your latest run folder and inspect tables/.
Input schema contract (for Day 4 predict)
schema/input_schema.jsonReal life problem:
age_years vs age)Without a contract, prediction becomes “guess and pray”.
Tip
Your schema is a machine-checkable version of your dataset contract from Day 1.
input_schema.jsonMinimum fields:
required_feature_columns (exact list, ordered)feature_dtypes (basic numeric vs text handling)optional_id_columns (passthrough if present)forbidden_columns (usually the target)input_schema.jsonThe actual file can include dtype hints too — that’s OK.
Open your reports/model_card.md and write down:
Checkpoint: you can say: “My inference input must include ___ and must not include ___.”
is_high_value)user_idage, country, avg_spend_30d, …)Two good failures:
Warning
Fail fast with a clear error message. Silent fixes create silent bugs.
Question: Is it okay if the target column appears in the input to predict?
Answer: No. It is a forbidden column for inference.
When you return: start Hands-on Task 1 immediately.
Implement/verify evaluation artifacts + update eval summary
ml-baseline train, your latest run has:
metrics/baseline_holdout.jsonmetrics/holdout_metrics.jsontables/holdout_predictions.<csv|parquet>tables/holdout_input.<csv|parquet>schema/input_schema.jsonrun_meta.jsonreports/eval_summary.md with baseline vs model comparisonOptional ⭐
max_f1 (classification only)eval_summary.md (e.g., performance by country)metrics/ and tables/macOS/Linux
Windows PowerShell
Checkpoint: you can see holdout_metrics.json or you know it’s missing and needs implementation.
Extensions may be .parquet instead of .csv depending on your setup.
If holdout_metrics.json is missing:
src/ml_baseline/train.py, after pipe.fit(...), predict on X_testrun_dir/metrics/holdout_metrics.jsonCheckpoint: file exists and contains your primary metric.
# after: pipe.fit(X_train, y_train)
if cfg.task == "classification":
proba = pipe.predict_proba(X_test)
y_score = proba[:, 1] if proba.shape[1] > 1 else proba[:, 0]
y_true = np.asarray(y_test).astype(int)
metrics = classification_metrics(y_true, y_score, threshold=0.5)
else:
y_pred = pipe.predict(X_test)
y_true = np.asarray(y_test).astype(float)
metrics = regression_metrics(y_true, y_pred)
(run_dir / "metrics" / "holdout_metrics.json").write_text(
json.dumps(metrics, indent=2) + "\n", encoding="utf-8"
)holdout_predictions (15 minutes)run_dir/tables/holdout_predictions.<ext>Checkpoint: you can open the file and see prediction (and score if classification).
holdout_input (10 minutes)X_test (features-only)run_dir/tables/holdout_input.<ext>Checkpoint: holdout_input contains no target column.
run_dir/schema/input_schema.jsonCheckpoint: the schema lists required features + forbidden target.
reports/eval_summary.md (15 minutes)Fill in:
baseline_holdout.json)holdout_metrics.json)Checkpoint: your eval summary compares baseline vs model using the same primary metric.
<run_id><precision / recall / MAE><value><value>Tip
Keep it honest. A baseline that is “not great” is still valuable if it’s reproducible and explainable.
git status"w3d3: save holdout metrics + artifacts"Checkpoint: your repo shows the new commit online.
data/processed/features.<csv|parquet>models/registry/latest.txtX_train.shape, X_test.shapeX_train.columns.tolist()[:10]Warning
Do not “fix” by deleting random files. Fix by understanding the contract.
max_f1) and record chosen thresholdeval_summary.mdIn 1–2 sentences each:
holdout_predictions instead of only saving metrics?input_schema.json?Due: before Day 4 (Dec 31, 2025)
train creates today’s artifacts on your dataset:holdout_metrics.jsonholdout_predictions.*holdout_input.*schema/input_schema.jsonholdout_predictions, identify:
reports/eval_summary.md:
Deliverable: GitHub repo link + latest run_id.
Tip
Tomorrow you will run predict on holdout_input. If holdout_input or schema is wrong, Day 4 will hurt.