AI Professionals Bootcamp | Week 2
2025-12-22
Goal: add data checks + safe joins so your EDA won’t lie.
Bootcamp • SDAIA Academy
checks.py + run_day2_validate_and_report.pyBy the end of today, you can:
*_clean columns and dedupe using a business key rulepd.merge(..., validate=...), detect join explosions with row-count checks, and measure match ratereports/Run Day 1 and confirm Parquet outputs still look typed.
macOS/Linux
Windows PowerShell
Checkpoint: data/processed/orders.parquet exists and user_id is string.
Import-safe project layout (packages + module execution)
Typed processed outputs in data/processed/ (Parquet preserves dtypes)
Canonical workflow:
Note
Day 1 mostly covered Load + “typed I/O”. Today we seriously start Verify + Clean.
A Day 2 pipeline that runs end-to-end from repo root:
bootcamp_data/checks.py (small, testable checks)
extra helpers in bootcamp_data/transforms.py (flags, normalization, dedupe)
scripts/run_day2_validate_and_report.py (wire it together)
artifacts:
data/processed/orders_clean.parquetdata/processed/orders_enriched.parquetreports/day2_missingness_orders.csvreports/day2_summary.md (row counts + duplicates + join match rate + a missingness snapshot)uv quick review (so commands are consistent)Run inside your project environment:
uv run <command>Install a dependency (if we decide we need one):
uv add <package>Tip
We’ll stick to uv run ... so we don’t depend on “did you activate your venv?”
If you see:
Do this checklist (in order):
Are you in repo root? (folder that contains bootcamp_data/ and scripts/)
Are you running as a module?
uv run python -m scripts.run_day1_loaduv run python -m scripts.run_day2_validate_and_reportDoes bootcamp_data/__init__.py exist?
Are you using the right Python?
macOS/Linux
Windows PowerShell
Raise your hand when:
uv run python -m ...Verify — turn assumptions into checks
By the end of this session, you can:
explain “fail fast” and why it saves time
write checks for:
Bad data rarely crashes your code.
Instead it produces:
Turn assumptions into checks that stop the pipeline with a clear message.
Tip
A pipeline that crashes early is annoying. A pipeline that silently lies is dangerous.
Keep your pipeline debuggable:
bootcamp_data/io.py → read/writebootcamp_data/transforms.py → transforms (df -> df)bootcamp_data/checks.py → validations (fail fast)scripts/...py → thin entrypoints (wires everything together)If a required column is missing, nothing else matters.
data/processed/orders.parquetrequire_columns with one fake column ("NOT_A_COL")Checkpoint: you get a clear ValueError that names NOT_A_COL.
import pandas as pd
orders = pd.read_parquet("data/processed/orders.parquet")
def require_columns(df, cols, *, df_name="df"):
missing = [c for c in cols if c not in df.columns]
if missing:
raise ValueError(f"{df_name}: missing columns {missing}")
require_columns(orders, ["order_id", "user_id", "NOT_A_COL"], df_name="orders")A surprising number of pipelines produce 0 rows.
Common reasons:
You wrote:
Give two reasons paid might be empty even if there are paid orders.
Checkpoint: you can name 2 realistic reasons.
"Paid", "PAID", "paid")"paid ")Today we’ll normalize categories and make joins + groupbys safer.
Keys connect tables. If keys are wrong, joins lie.
Fast question:
data/processed/orders.parquetorder_idCheckpoint: you can answer: “duplicates = 0” or “duplicates > 0”.
Even after parsing, numbers can be invalid.
Examples:
amount >= 0quantity >= 0orders.parquetamount < 0 (ignore missing)Checkpoint: you can report n_negative.
Checks turn assumptions into clear, early failures
Start with high ROI:
Keep checks separate from transforms and I/O
When you return: we’ll quantify missingness and clean without deleting the story.
Clean — missingness + categories + duplicates
By the end of this session, you can:
.dropna() is risky*_cleanMissing values can mean:
Your move: measure first, then decide.
data/processed/orders.parquetmissingness_report(orders)Checkpoint: you can name the most-missing column.
A simple rule for analytics/EDA:
*_isna)Tip
Flags let you analyze “missingness patterns” later.
Add flags for:
amountquantityCheckpoint: your DataFrame has amount__isna and quantity__isna.
Category columns (like status) can create fake groups:
"Paid", "PAID", " paid " become 3 categories"refunded" vs "refund") split the storyBest practice:
*_clean column you use for analysisNote
re.compile(r"\s+") matches one-or-more whitespace, so multiple spaces/tabs collapse to one.
status (6 minutes)orders["status"].value_counts(dropna=False).head(10)status_norm (trim + casefold + whitespace collapse)Checkpoint: your “clean” counts look more consistent than the raw counts.
import re
_ws = re.compile(r"\s+")
print(orders["status"].value_counts(dropna=False).head(10))
status_norm = (
orders["status"].astype("string").str.strip().str.casefold().str.replace(_ws, " ", regex=True)
)
mapping = {"refunded": "refund"}
status_clean = status_norm.map(lambda x: mapping.get(x, x))
print(status_clean.value_counts(dropna=False).head(10))Not all duplicates are exact row duplicates.
First choose a business key, then choose a keep rule.
Common keep rules:
order_id duplicates exist, create a deduped DataFrameorder_idCheckpoint: deduped["order_id"].duplicated().sum() == 0.
*_clean before groupbyWhen you return: we’ll join orders to users safely and write our first Day 2 report.
Join + report — safe merges + first EDA summary
By the end of this session, you can:
pd.merge(..., validate=...) and add a simple row-count sanity checkindicator=True (match rate)A join usually doesn’t crash. It just changes your row count.
Watch for:
If orders.user_id is a string but users.user_id is an int, you can get:
country (or other user fields)Tip
Enforce key dtypes early, then merge.
pd.merge(..., validate=...) (your safety belt)validate= tells pandas what join cardinality you expect:
"one_to_one""one_to_many""many_to_one""many_to_many" (no protection)For orders → users:
So we expect: validate="many_to_one".
validate (4 minutes)You are joining:
orders (many rows per user_id)users (one row per user_id)What should you use?
one_to_onemany_to_onemany_to_manyCheckpoint: choose A/B/C and explain in 1 sentence.
B) many_to_one — many orders map to one user.
indicator=TrueUse indicator=True to see whether each row matched:
After the merge, compute:
n_totaln_matched (_merge == "both")match_rate = n_matched / n_totalCheckpoint: you can print match_rate as a %.
If users.user_id is not unique, your merge can multiply rows.
Two ways to catch it:
assert_unique_key(users, "user_id"))validate="many_to_one" (it will raise)Question: If pd.merge(..., validate="many_to_one") raises, what’s the most likely cause?
Answer: the right table (users) has duplicate user_id values.
A join can change your row count without errors.
Bad outcome: your aggregates (revenue, counts) become wrong.
import pandas as pd
orders = pd.DataFrame(
{"order_id": ["o1", "o2"], "user_id": ["u1", "u1"], "amount": [10, 20]}
)
# BUG: users should be unique by user_id, but it's not
users = pd.DataFrame(
{"user_id": ["u1", "u1"], "country": ["SA", "SA"]}
)
out = orders.merge(users, on="user_id", how="left") # no validate!
print("rows:", len(orders), "→", len(out))
print("revenue before:", orders["amount"].sum())
print("revenue after :", out["amount"].sum()) # inflatedFix: enforce uniqueness and use validate=... + row-count checks.
Before the merge:
During the merge:
validate=..., suffixes=..., indicator=TrueAfter the merge:
A report is an artifact you can:
Tip
If it matters, write it to reports/.
Use Week 1 tools (f-strings + str.join) to render tables without extra libraries.
Create a “Top 5 countries by revenue” table.
countryamountmd_table(...)Checkpoint: you have a Markdown table string you can print.
validate= so merges fail when cardinality assumptions breakindicator=True to measure match coverageWhen you return: we’ll implement Day 2 as committed code + artifacts.
Build: validations + cleaning + join + a Day 2 report
By the end, you should have:
bootcamp_data/checks.py with reusable validations
updates to bootcamp_data/transforms.py (flags, normalization, dedupe)
a runnable entrypoint: uv run python -m scripts.run_day2_validate_and_report
artifacts:
data/processed/orders_clean.parquetdata/processed/orders_enriched.parquetreports/day2_missingness_orders.csvreports/day2_summary.mdat least one commit pushed to GitHub
Warning
Do not ask GenAI to write your solution code. Ask it to explain concepts or errors.
bootcamp_data/checks.py (20 minutes)Create a new module with these functions:
require_columns(df, cols, *, df_name="df")assert_non_empty(df, *, df_name="df")missingness_report(df) -> DataFrame (counts + %)assert_unique_key(df, key, *, allow_na=False, df_name="df")assert_in_range(s, lo=None, hi=None, name="value")Checkpoint: you can import everything.
macOS/Linux
Windows PowerShell
Tip
Prefer small functions + clear error messages.
If you can’t explain the check in one sentence, it’s too complicated (for now).
bootcamp_data/checks.py (part 1)import pandas as pd
def require_columns(df: pd.DataFrame, cols: list[str], *, df_name: str = "df") -> None:
missing = [c for c in cols if c not in df.columns]
if missing:
raise ValueError(f"{df_name}: missing columns {missing}")
def assert_non_empty(df: pd.DataFrame, *, df_name: str = "df") -> None:
if len(df) == 0:
raise ValueError(f"{df_name}: 0 rows")
def missingness_report(df: pd.DataFrame) -> pd.DataFrame:
n = len(df)
rep = df.isna().sum().rename("n_missing").to_frame()
rep["p_missing"] = rep["n_missing"] / (n if n else 1)
return rep.sort_values("p_missing", ascending=False)bootcamp_data/checks.py (part 2)import pandas as pd
def assert_unique_key(
df: pd.DataFrame,
key: str,
*,
allow_na: bool = False,
df_name: str = "df",
) -> None:
s = df[key]
if (not allow_na) and (not s.notna().all()):
raise ValueError(f"{df_name}.{key}: contains NA")
dup = s.dropna().duplicated(keep=False)
if dup.any():
raise ValueError(f"{df_name}.{key}: not unique")
def assert_in_range(s: pd.Series, lo=None, hi=None, name: str = "value") -> None:
x = s.dropna()
if lo is not None and not (x >= lo).all():
raise ValueError(f"{name}: below {lo}")
if hi is not None and not (x <= hi).all():
raise ValueError(f"{name}: above {hi}")bootcamp_data/transforms.py (20 minutes)Append these helper functions below your existing enforce_schema:
add_missing_flags(df, cols) -> dfnormalize_text(series) -> seriesapply_mapping(series, mapping) -> seriesdedupe_keep_last(df, key_cols) -> dfCheckpoint: you can import them.
macOS/Linux
Windows PowerShell
bootcamp_data/transforms.pyimport re
import pandas as pd
_ws = re.compile(r"\s+")
def add_missing_flags(df: pd.DataFrame, cols: list[str]) -> pd.DataFrame:
out = df.copy()
for c in cols:
out[f"{c}__isna"] = out[c].isna()
return out
def normalize_text(s: pd.Series) -> pd.Series:
return s.astype("string").str.strip().str.casefold().str.replace(_ws, " ", regex=True)
def apply_mapping(s: pd.Series, mapping: dict[str, str]) -> pd.Series:
return s.map(lambda x: mapping.get(x, x))
def dedupe_keep_last(df: pd.DataFrame, key_cols: list[str]) -> pd.DataFrame:
return df.drop_duplicates(subset=key_cols, keep="last").reset_index(drop=True)scripts/run_day2_validate_and_report.py (40 minutes)Write a Day 2 entrypoint that:
loads raw CSVs (orders + users)
runs fast checks (required columns + non-empty)
enforces schema for orders
writes a missingness CSV report in reports/
cleans orders:
status_clean (normalize + mapping)amount, quantity)order_id (keep last)runs post-clean checks (unique order_id, non-negative amount/quantity)
joins to users with validate="many_to_one" + indicator=True
writes:
orders_clean.parquetorders_enriched.parquetreports/day2_missingness_orders.csvreports/day2_summary.mdCheckpoint: script runs from repo root:
macOS/Linux
Windows PowerShell
validate= + indicator=Warning
Do not “fix” data by deleting everything. Measure first. Keep your actions explicit.
scripts/run_day2_validate_and_report.py (1/8): imports + setupimport logging
from pathlib import Path
from bootcamp_data.config import make_paths
from bootcamp_data.io import read_orders_csv, read_users_csv, write_parquet
from bootcamp_data.checks import (
require_columns,
assert_non_empty,
missingness_report,
assert_unique_key,
assert_in_range,
)
from bootcamp_data.transforms import (
enforce_schema,
add_missing_flags,
normalize_text,
apply_mapping,
dedupe_keep_last,
)
log = logging.getLogger(__name__)
ROOT = Path(__file__).resolve().parents[1]scripts/run_day2_validate_and_report.py (2/8): helpers + main()def md_table(headers, rows) -> str:
lines = []
lines.append("| " + " | ".join(headers) + " |")
lines.append("| " + " | ".join(["---"] * len(headers)) + " |")
for row in rows:
lines.append("| " + " | ".join(str(x) for x in row) + " |")
return "\n".join(lines)
def main() -> None:
logging.basicConfig(level=logging.INFO, format="%(levelname)s %(name)s: %(message)s")
p = make_paths(ROOT)
reports_dir = ROOT / "reports"
reports_dir.mkdir(parents=True, exist_ok=True)
# (See the next slides for the body of `main()`.)
if __name__ == "__main__":
main()scripts/run_day2_validate_and_report.py (3/8): load + fast checks log.info("Load raw inputs")
orders_raw = read_orders_csv(p.raw / "orders.csv")
users = read_users_csv(p.raw / "users.csv")
require_columns(
orders_raw,
["order_id", "user_id", "amount", "quantity", "created_at", "status"],
df_name="orders_raw",
)
require_columns(users, ["user_id", "country", "signup_date"], df_name="users")
assert_non_empty(orders_raw, df_name="orders_raw")
assert_non_empty(users, df_name="users")scripts/run_day2_validate_and_report.py (4/8): schema + missingness artifactscripts/run_day2_validate_and_report.py (5/8): clean + post-checks status_norm = normalize_text(orders["status"])
mapping = {"paid": "paid", "refund": "refund", "refunded": "refund"}
orders_clean = (
orders.assign(status_clean=apply_mapping(status_norm, mapping))
.pipe(add_missing_flags, cols=["amount", "quantity"])
.pipe(dedupe_keep_last, key_cols=["order_id"])
)
assert_unique_key(orders_clean, "order_id", df_name="orders_clean")
assert_in_range(orders_clean["amount"], lo=0, name="amount")
assert_in_range(orders_clean["quantity"], lo=0, name="quantity")scripts/run_day2_validate_and_report.py (6/8): safe join + metrics assert_unique_key(users, "user_id", df_name="users")
orders_enriched = orders_clean.merge(
users[["user_id", "country"]],
on="user_id",
how="left",
validate="many_to_one",
indicator=True,
)
orders_enriched["user_found"] = orders_enriched["_merge"] == "both"
orders_enriched = orders_enriched.drop(columns=["_merge"])
n_total = len(orders_enriched)
n_matched = int(orders_enriched["user_found"].sum())
total_revenue = orders_clean["amount"].sum(min_count=1)scripts/run_day2_validate_and_report.py (7/8): build the report # quick missingness snapshot (top 10 columns)
top_missing = (
rep.head(10)
.reset_index()
.rename(columns={"index": "column"})
)
missing_rows = [
(r["column"], int(r["n_missing"]), f"{r['p_missing']:.1%}")
for _, r in top_missing.iterrows()
]
# duplicates snapshot (before dedupe)
n_dup_rows = int(orders["order_id"].duplicated(keep=False).sum())
n_removed_by_dedupe = len(orders) - len(orders_clean)
# revenue by country (matched users only)
by_country = (
orders_enriched.loc[orders_enriched["user_found"]]
.groupby("country", dropna=False)["amount"]
.sum(min_count=1)
.sort_values(ascending=False)
.head(10)
)
country_rows = [
(c if c == c else "(missing)", f"{v:.2f}" if v == v else "NA")
for c, v in by_country.items()
]
out_path = reports_dir / "day2_summary.md"
report = []
report.append("# Day 2 summary")
report.append("")
report.append("## Data checks")
report.append(f"- orders_raw rows: **{len(orders_raw)}**")
report.append(f"- orders (typed) rows: **{len(orders)}**")
report.append(f"- orders_clean rows: **{len(orders_clean)}**")
report.append(f"- orders_enriched rows: **{len(orders_enriched)}**")
report.append(f"- duplicate `order_id` rows (pre-dedupe): **{n_dup_rows}**")
report.append(f"- rows removed by dedupe: **{n_removed_by_dedupe}**")
report.append(f"- join match rate: **{n_matched / n_total:.1%}** ({n_matched}/{n_total})")
report.append(f"- missingness report: `{rep_path}`")
if total_revenue == total_revenue:
report.append(f"- total revenue (clean): **{float(total_revenue):.2f}**")
else:
report.append("- total revenue (clean): **NA**")
report.append("")
report.append("## Missingness snapshot (top 10 columns)")
report.append(md_table(["column", "n_missing", "p_missing"], missing_rows))
report.append("")
report.append("## Top countries by revenue (matched users only)")
report.append(md_table(["country", "revenue"], country_rows))
report.append("")
out_path.write_text("\n".join(report), encoding="utf-8")scripts/run_day2_validate_and_report.py (8/8): write outputsRun Day 2 and verify outputs exist.
macOS/Linux
uv run python -m scripts.run_day2_validate_and_report
uv run python -c "import pandas as pd; df=pd.read_parquet('data/processed/orders_enriched.parquet'); print(df[['status_clean','amount__isna','quantity__isna','user_found']].head())"
uv run python -c "import pandas as pd; print(pd.read_csv('reports/day2_missingness_orders.csv').head())"Windows PowerShell
uv run python -m scripts.run_day2_validate_and_report
uv run python -c "import pandas as pd; df=pd.read_parquet('data/processed/orders_enriched.parquet'); print(df[['status_clean','amount__isna','quantity__isna','user_found']].head())"
uv run python -c "import pandas as pd; print(pd.read_csv('reports/day2_missingness_orders.csv').head())"Checkpoint: you have:
data/processed/orders_clean.parquetdata/processed/orders_enriched.parquetreports/day2_missingness_orders.csvreports/day2_summary.mdgit status"w2d2: validate + clean + join + report"Checkpoint: repo shows your new commit online.
macOS/Linux
Windows PowerShell
When something fails:
Confirm you’re in repo root (ls should show bootcamp_data/ and scripts/)
Re-run the exact command: uv run python -m scripts.run_day2_validate_and_report
Print key facts:
df.shape, df.dtypes, df.head()If merge fails:
users.user_id)If a check fails:
Tip
Most bugs are “my assumption about format was wrong.” Make the assumption explicit, then measure.
If you finish early:
status_clean counts)missingness_report)user_found == False count to the report (unmatched users)reports/day2_checks.md (what checks ran + what they asserted)In 1–2 sentences:
What does validate="many_to_one" protect you from, and what problem does indicator=True help you measure?
Due: before Day 3 starts
Make sure these commands work from a fresh terminal (repo root):
uv run python -m scripts.run_day1_loaduv run python -m scripts.run_day2_validate_and_reportPush your changes to GitHub
Confirm these artifacts exist in your repo:
bootcamp_data/checks.pydata/processed/orders_clean.parquetdata/processed/orders_enriched.parquetreports/day2_missingness_orders.csvreports/day2_summary.md (row counts + duplicates + missingness snapshot + join match rate)Add one dataset-specific check (an invariant you believe should always hold) and include the result in reports/day2_summary.md.
status_clean is in an allowed set; a high-percentile cap is reasonable; key columns aren’t missing above X%; etc. Deliverable: GitHub repo link + screenshot of reports/day2_summary.md opened.Tip
Add 1–2 commits with clear messages. Don’t wait until the end of the week.