AI Professionals Bootcamp | Week 2
2025-12-23
Goal: turn Day 2’s processed outputs into an analysis-ready feature table (datetimes + time parts + outlier flags) and a rerunnable EDA tables report you’ll use for Plotly charts on Day 4.
Bootcamp • SDAIA Academy
melt / pivot_table)data/processed/orders_features.parquet + reports/day3_eda_tables.md (+ long/wide samples for tomorrow)By the end of today, you can:
pd.to_datetime(..., errors="coerce", utc=True) and measure invalidsmonth, dow, hour) for groupingmelt() / pivot_table()data/processed/orders_features.parquet + reports/day3_eda_tables.md from a script you can rerunUse uv to stay reproducible: uv add ... and uv run ...
Run entrypoints as modules from repo root (imports stay stable)
Write artifacts you can commit/review:
data/processed/reports/Today we build on these artifacts:
data/processed/orders_clean.parquetdata/processed/orders_enriched.parquetreports/day2_summary.mdNote
Week 2 is offline-first: we do not fetch anything from the internet today.
Run Day 2 and confirm the outputs exist.
macOS/Linux
Windows PowerShell
Checkpoint: these files exist:
data/processed/orders_clean.parquetdata/processed/orders_enriched.parquetreports/day2_summary.mdYou’re not in repo root
bootcamp_data/, scripts/, data/, reports/You ran by file path instead of as a module
uv run python -m scripts.run_day2_validate_and_reportpython scripts/run_day2_validate_and_report.pyYour environment is missing deps
uv add pandas pyarrow httpxWarning
Do not “fix imports” by editing sys.path or setting PYTHONPATH.
Datetimes that don’t lie
By the end of this session, you can:
errors="coerce" and utc=TrueIf time is wrong, you get wrong:
Datetime bugs often don’t crash. They quietly lie.
When you parse time:
Tip
“I parsed it” is not enough. You need the missing-after-parse count.
pd.to_datetime safe defaultsGood Week 2 default:
errors="coerce" → invalid strings become missing values (measurable)utc=True → timestamps become UTC-aware (avoids mixing time zones)Warning
If your raw timestamps have mixed formats (e.g., 03/04/2025), you need a plan. Don’t guess.
created_at and count invalids (6 minutes)Load data/processed/orders_enriched.parquet
Parse created_at with errors="coerce" and utc=True
Print:
created_atCheckpoint: dtype is datetime-like and you can print n_missing_created_at.
Time parts turn a timestamp into group keys:
month (trend)dow (day-of-week)hour (hour-of-day)Note
.dt only works after parsing to a datetime type.
Start from the parsed created_at
Add:
month = created_at.dt.to_period("M").astype("string")dow = created_at.dt.day_name()hour = created_at.dt.hourCompute orders_per_month (count of rows per month)
Checkpoint: you can print a small month table sorted by month.
.pipe() makes transforms readable.pipe(fn, ...) lets you apply a function in a clean top-to-bottom flow.
We’ll use this in hands-on when we move parsing logic into bootcamp_data/transforms.py.
In one sentence each, explain what these steps do:
Checkpoint: you can explain the pipeline without reading code line-by-line.
errors="coerce" + measure missing-after-parse.pipe() to keep transforms readable (df → df)When you return: we’ll handle outliers without corrupting the story.
Outliers + robust summaries
By the end of this session, you can:
Outliers might be:
Rule for Week 2: don’t delete silently. Flag first.
Useful first look for a numeric column:
If p99 is wildly bigger than p50, you may have a heavy tail or errors.
amount (6 minutes)orders_enriched.parquetamountCheckpoint: you can explain what p99 means in one sentence.
If data is skewed (common in money), the mean can be misleading.
For numeric EDA, always include:
Tip
If mean ≫ median, your “average” is being pulled by a small number of large values.
amount (7 minutes)Create a small summary table with rows:
n, mean, median, p25, p75, p90, min, max
Checkpoint: you can explain “median vs mean” in one sentence.
import pandas as pd
df = pd.read_parquet("data/processed/orders_enriched.parquet")
s = pd.to_numeric(df["amount"], errors="coerce")
summary = pd.DataFrame(
{
"metric": ["n", "mean", "median", "p25", "p75", "p90", "min", "max"],
"value": [
int(s.notna().sum()),
float(s.mean()),
float(s.median()),
float(s.quantile(0.25)),
float(s.quantile(0.75)),
float(s.quantile(0.90)),
float(s.min()),
float(s.max()),
],
}
)
print(summary)IQR method:
[Q1 − 1.5×IQR, Q3 + 1.5×IQR]We use bounds to flag unusual values.
amountamount__is_outlierCheckpoint: you can print outlier_rate as a number between 0 and 1.
import pandas as pd
df = pd.read_parquet("data/processed/orders_enriched.parquet")
x = df["amount"].dropna()
q1, q3 = x.quantile(0.25), x.quantile(0.75)
iqr = q3 - q1
lo, hi = float(q1 - 1.5 * iqr), float(q3 + 1.5 * iqr)
df = df.assign(amount__is_outlier=(df["amount"] < lo) | (df["amount"] > hi))
print("bounds:", lo, hi)
print("outlier_rate:", float(df["amount__is_outlier"].mean()))Sometimes you want readable charts without deleting rows.
Winsorization (simple version):
[p1, p99]Warning
Use capped values for visualization. Use raw values for final totals unless you have a business rule.
Create amount_winsor by capping to p1/p99
Compute top 5 countries by revenue using:
amountamount_winsorCompare: do the top 5 change?
Checkpoint: you can answer: “Does capping change the story?”
import pandas as pd
df = pd.read_parquet("data/processed/orders_enriched.parquet")
a, b = df["amount"].dropna().quantile(0.01), df["amount"].dropna().quantile(0.99)
df = df.assign(amount_winsor=df["amount"].clip(lower=a, upper=b))
raw = (
df.groupby("country", dropna=False)["amount"]
.sum()
.sort_values(ascending=False)
.head(5)
)
cap = (
df.groupby("country", dropna=False)["amount_winsor"]
.sum()
.sort_values(ascending=False)
.head(5)
)
print("raw top 5:\n", raw)
print("\ncapped top 5:\n", cap)When you return: we’ll turn these ideas into a rerunnable report (and optional figures).
Tidy data + reshape (wide ↔︎ long with melt/pivot)
By the end of this session, you can:
melt() (plotting-friendly)pivot_table() (report-friendly)A dataset is “tidy” when:
Why you care: groupby + Plotly become much easier.
Long (tidy):
month, metric, valueWide:
month, n_orders, revenue, aovRule of thumb: store features as normal tables; reshape for viz/reporting.
We want one row per month:
n_ordersrevenueaov (mean order amount)This is perfect for a Markdown report.
Start from your transformed DataFrame with a month column
Group by month
Compute:
n_ordersrevenueaov (mean amount)Checkpoint: you can print a month table sorted by month.
melt() = wide → long (plotting-friendly)melt turns metric columns into two columns:
metric (a category)value (a numeric value)This is the format Plotly loves: x = month, y = value, color = metric.
Start from monthly (wide)
Convert to long with:
id_vars=["month"]value_vars=["n_orders","revenue","aov"]Checkpoint: your long table has columns: month, metric, value.
pivot_table() = long → wide (report-friendly)pivot_table is how you go back to wide.
Use it when:
monthly_longCheckpoint: row count equals number of distinct months.
If your keys are not unique, pivot_table will aggregate without yelling.
Safer habits:
aggfunc intentionally (never rely on defaults)Today we will write:
data/processed/orders_features.parquet (analysis-ready features)reports/day3_eda_tables.md (tables + caveats)reports/day3_monthly_metrics_long.csv (plot-ready long form)reports/day3_monthly_metrics_wide.csv (wide form for review)melt() and pivot_table() are the bridgeWhen you return: we’ll build orders_features.parquet + day3_eda_tables.md + tidy reshape artifacts end-to-end.
Build: feature table + tidy reshape artifacts + EDA tables report
By the end, you should have:
data/processed/orders_features.parquet (NEW)bootcamp_data/eda.py (updated helpers)bootcamp_data/transforms.py with datetime + outlier helpersscripts/run_day3_features_and_tables.py (NEW runner)reports/day3_eda_tables.mdreports/day3_monthly_metrics_long.csv (plot-ready long form)reports/day3_monthly_metrics_wide.csv (wide form for review)bootcamp_data/
config.py
checks.py
transforms.py
eda.py
scripts/
run_day1_load.py
run_day2_validate_and_report.py
run_day3_features_and_tables.py # NEW today
data/processed/
orders_enriched.parquet
orders_features.parquet # NEW today
reports/
day2_summary.md
day3_eda_tables.md # NEW today
day3_monthly_metrics_long.csv # NEW today
day3_monthly_metrics_wide.csv # optionalFrom repo root, confirm these exist:
data/processed/orders_enriched.parquetreports/day2_summary.mdmacOS/Linux
Windows PowerShell
bootcamp_data/eda.py (25 minutes)In bootcamp_data/eda.py, implement:
md_table(df, max_rows=10) → DataFrame → Markdown table stringtop_k_table(df, group_col, value_col, k=10) → grouped totals tabledescribe_numeric(df, col) → robust numeric summary (mean/median/quantiles)summary_metrics(df) → top-line metrics (rows, revenue, missing timestamps, match rate, outlier rate)write_markdown_report(path, title, sections) → write a Markdown report from sectionsCheckpoint:
bootcamp_data/eda.py (complete)from __future__ import annotations
from pathlib import Path
import pandas as pd
def md_table(df: pd.DataFrame, max_rows: int = 10) -> str:
view = df.head(max_rows).copy().astype("string").fillna("")
cols = list(view.columns)
header = "| " + " | ".join(cols) + " |"
sep = "| " + " | ".join(["---"] * len(cols)) + " |"
rows = ["| " + " | ".join(row) + " |" for row in view.itertuples(index=False, name=None)]
tail = "" if len(df) <= max_rows else f"\n\n::: {{.muted}}Showing first {max_rows} of {len(df)} rows.:::\n"
return "\n".join([header, sep, *rows]) + tail
def top_k_table(df: pd.DataFrame, group_col: str, value_col: str, k: int = 10) -> pd.DataFrame:
return (
df.groupby(group_col, dropna=False)[value_col]
.agg(n="size", total="sum")
.reset_index()
.sort_values("total", ascending=False)
.head(k)
)
def describe_numeric(df: pd.DataFrame, col: str) -> pd.DataFrame:
s = pd.to_numeric(df[col], errors="coerce")
return pd.DataFrame(
{
"metric": ["n", "mean", "median", "p25", "p75", "p90", "min", "max"],
"value": [
int(s.notna().sum()),
float(s.mean()),
float(s.median()),
float(s.quantile(0.25)),
float(s.quantile(0.75)),
float(s.quantile(0.90)),
float(s.min()),
float(s.max()),
],
}
)
def summary_metrics(df: pd.DataFrame) -> pd.DataFrame:
rows = len(df)
revenue = df["amount"].sum(min_count=1) if "amount" in df.columns else None
missing_ts = df["created_at"].isna().sum() if "created_at" in df.columns else None
match_rate = None
if "user_found" in df.columns:
match_rate = float(df["user_found"].mean())
elif "country" in df.columns:
match_rate = 1.0 - float(df["country"].isna().mean())
outlier_rate = float(df["amount__is_outlier"].mean()) if "amount__is_outlier" in df.columns else None
out = pd.DataFrame(
{
"metric": ["rows", "revenue_sum", "missing_created_at", "user_match_rate", "outlier_rate"],
"value": [rows, revenue, missing_ts, match_rate, outlier_rate],
}
)
return out.dropna(subset=["value"])
def write_markdown_report(path: Path, title: str, sections: list[tuple[str, str]]) -> None:
path.parent.mkdir(parents=True, exist_ok=True)
parts = [f"# {title}\n"]
for h, body in sections:
parts.append(f"## {h}\n\n{body}\n")
path.write_text("\n".join(parts), encoding="utf-8")In bootcamp_data/transforms.py, add:
parse_datetime(df, col, utc=True)add_time_parts(df, ts_col) → month/dow/houriqr_bounds(s, k=1.5)winsorize(s, lo=0.01, hi=0.99)add_outlier_flag(df, col, k=1.5)Checkpoint:
import pandas as pd
def parse_datetime(df: pd.DataFrame, col: str, *, utc: bool = True) -> pd.DataFrame:
dt = pd.to_datetime(df[col], errors="coerce", utc=utc)
return df.assign(**{col: dt})
def add_time_parts(df: pd.DataFrame, ts_col: str) -> pd.DataFrame:
ts = df[ts_col]
return df.assign(
month=ts.dt.to_period("M").astype("string"),
dow=ts.dt.day_name(),
hour=ts.dt.hour,
)
def iqr_bounds(s: pd.Series, k: float = 1.5) -> tuple[float, float]:
x = s.dropna()
q1, q3 = x.quantile(0.25), x.quantile(0.75)
iqr = q3 - q1
return float(q1 - k * iqr), float(q3 + k * iqr)
def winsorize(s: pd.Series, lo: float = 0.01, hi: float = 0.99) -> pd.Series:
x = s.dropna()
a, b = x.quantile(lo), x.quantile(hi)
return s.clip(lower=a, upper=b)
def add_outlier_flag(df: pd.DataFrame, col: str, *, k: float = 1.5) -> pd.DataFrame:
lo, hi = iqr_bounds(df[col], k=k)
return df.assign(**{f"{col}__is_outlier": (df[col] < lo) | (df[col] > hi)})scripts/run_day3_features_and_tables.py (40 minutes)Create a new runner that:
reads data/processed/orders_enriched.parquet
verifies required columns
parses created_at and adds time parts
adds amount_winsor + amount__is_outlier
writes data/processed/orders_features.parquet
builds a monthly metrics table (wide) and a melted long form table
writes:
reports/day3_eda_tables.mdreports/day3_monthly_metrics_long.csvreports/day3_monthly_metrics_wide.csvThis keeps imports stable (no sys.path hacks).
macOS/Linux
Windows PowerShell
Put this in scripts/run_day3_features_and_tables.py:
from __future__ import annotations
import logging
from pathlib import Path
import pandas as pd
from bootcamp_data.checks import assert_non_empty, require_columns
from bootcamp_data.eda import describe_numeric, md_table, summary_metrics, top_k_table, write_markdown_report
from bootcamp_data.transforms import add_outlier_flag, add_time_parts, parse_datetime, winsorize
log = logging.getLogger(__name__)
def main() -> None:
logging.basicConfig(level=logging.INFO, format="%(levelname)s %(name)s: %(message)s")
root = Path(__file__).resolve().parents[1]
in_path = root / "data" / "processed" / "orders_enriched.parquet"
df = pd.read_parquet(in_path)
require_columns(
df,
["order_id", "user_id", "amount", "status_clean", "created_at", "country"],
df_name="orders_enriched",
)
assert_non_empty(df, df_name="orders_enriched")Continue inside main():
features = (
df.pipe(parse_datetime, col="created_at", utc=True)
.pipe(add_time_parts, ts_col="created_at")
.assign(amount_winsor=lambda d: winsorize(d["amount"]))
.pipe(add_outlier_flag, col="amount", k=1.5)
)
out_features = root / "data" / "processed" / "orders_features.parquet"
out_features.parent.mkdir(parents=True, exist_ok=True)
features.to_parquet(out_features, index=False)
log.info("wrote: %s (%s rows)", out_features, len(features))Continue inside main():
monthly = (
features.groupby("month", dropna=False)
.agg(
n_orders=("order_id", "size"),
revenue=("amount", "sum"),
aov=("amount", "mean"),
)
.reset_index()
.sort_values("month")
)
monthly_long = monthly.melt(
id_vars=["month"],
value_vars=["n_orders", "revenue", "aov"],
var_name="metric",
value_name="value",
)
monthly_wide = (
monthly_long.pivot_table(index=["month"], columns="metric", values="value", aggfunc="first")
.reset_index()
)
reports = root / "reports"
reports.mkdir(parents=True, exist_ok=True)
monthly_long.to_csv(reports / "day3_monthly_metrics_long.csv", index=False)
monthly_wide.to_csv(reports / "day3_monthly_metrics_wide.csv", index=False)Continue inside main():
metrics = summary_metrics(features)
amount_desc = describe_numeric(features, "amount")
by_month = monthly.head(24) # show first ~2 years if present
by_country = top_k_table(features, "country", "amount", k=10)
by_status = top_k_table(features, "status_clean", "amount", k=10)
caveats = (
"- `created_at` was parsed with `errors='coerce'`; invalid timestamps become missing.\n"
"- Outliers are **flagged** (`amount__is_outlier`). For charts we may cap values (`amount_winsor`).\n"
"- Join coverage may be < 100% (see match rate)."
)
out_report = reports / "day3_eda_tables.md"
sections = [
("Top-line metrics", md_table(metrics)),
("Amount summary (robust stats)", md_table(amount_desc)),
("Monthly metrics (wide)", md_table(by_month, max_rows=24)),
("Revenue by country (top 10)", md_table(by_country)),
("Revenue by status_clean (top 10)", md_table(by_status)),
("Data caveats", caveats),
("Files written", "- `data/processed/orders_features.parquet`\n- `reports/day3_monthly_metrics_long.csv`"),
]
write_markdown_report(out_report, title="Day 3 — Feature table + EDA tables", sections=sections)
log.info("wrote: %s", out_report)
if __name__ == "__main__":
main()macOS/Linux
Windows PowerShell
Checkpoint: these files exist:
data/processed/orders_features.parquetreports/day3_eda_tables.mdreports/day3_monthly_metrics_long.csvgit status"w2d3: features + tidy tables"When results look wrong:
created_at dtype + missing-after-parse countmonth/dow/hour exist and look reasonableuser_found (if present) or % missing countrymonthly_long should have n_months × 3 rowsIf you finish early:
Add a second grouped table:
dow (day-of-week)hourSave a second long-form table for tomorrow:
reports/day3_by_dow_long.csvAdd a small run metadata JSON:
reports/day3_run_meta.json (rows, missing timestamps, match rate, outlier rate)In 1–2 sentences:
When would you use melt() vs pivot_table()?
Due: before Day 4 starts
Ensure this runs from repo root:
uv run python -m scripts.run_day3_features_and_tablesCommit + push your changes
Confirm these artifacts exist:
data/processed/orders_features.parquetreports/day3_eda_tables.mdreports/day3_monthly_metrics_long.csvDeliverable: GitHub repo link + screenshot of reports/day3_eda_tables.md preview.
Tip
Tomorrow (Day 4) you should only need to read orders_features.parquet and plot.