AI Professionals Bootcamp | Week 2
2025-12-21
This week is Data Work (ETL + EDA).
Week 1 gave you: CLI + packages + uv + Git. Week 2 adds: pandas + schema discipline + offline-first ETL.
Goal: build a clean repo scaffold and produce your first processed Parquet output (typed, reproducible).
Bootcamp • SDAIA Academy
By the end of today, you can:
ModuleNotFoundError)data/processed/enforce_schema)Sanity-check your toolchain (you will use these every day).
Checkpoint: you can run all three commands with no errors.
uv daily workflowThree commands you will use all week:
uv venv → create .venv/uv pip install ... → install dependencies into the envuv run <command> → run using the env (no “wrong python” accidents)Example:
You learned uv in Week 1. Today we’ll use it as our default toolchain.
Project: Offline‑First ETL + EDA Mini Analytics Pipeline
You will ship:
data/processed/*.parquet outputs that are safe to overwritereports/summary.md (findings + caveats)For the rest of Week 2, you’ll build a full ETL + EDA project on one real dataset.
You may use:
load_dataset(...) → snapshot to disk)httpx + cache)Dataset rubric (so exercises work smoothly):
user_id, product_id, store_id)Note
In class we’ll use a tiny toy dataset (orders.csv + users.csv) to validate patterns quickly.
For your weekly project, pick a real dataset and repeat the same workflow.
Warning
Don’t commit raw data or API tokens to GitHub. Commit scripts + metadata so others can reproduce the download.
By the end of Day 1, your repo should contain:
a standard folder layout (data/, bootcamp_data/, scripts/, reports/)
bootcamp_data/config.py, bootcamp_data/io.py, bootcamp_data/transforms.py
a runnable entrypoint: python -m scripts.run_day1_load
at least one processed output in data/processed/ (Parquet preferred)
a reproducibility breadcrumb for your project dataset:
data/raw/_source_meta.json (where your data came from)scripts/download_data.py (how to download it again)at least one commit pushed to GitHub
data/raw/Rule: avoid tool sprawl. Keep it small and shippable.
Offline-first mindset + project layout (no import pain)
By the end of this session, you can:
raw/, cache/, processed/, external/pathlib.PathYou will re-run your ETL many times:
If your pipeline depends on the internet, you lose time.
Offline-first projects separate data by role:
data/raw/
data/cache/
data/processed/
data/external/
Put each file into the correct folder:
orders.csv you received from a teammateusers_api_page_1.json downloaded from an endpointorders_clean.parquet generated by your ETLcountry_codes.xlsx you manually downloaded as referenceCheckpoint: you can justify your choices in 1 sentence each.
orders.csv → data/raw/users_api_page_1.json → data/cache/orders_clean.parquet → data/processed/country_codes.xlsx → data/external/Question: Why should data/raw/ be “immutable”?
Answer: if you edit raw data, you lose the ability to reproduce results and debug changes.
Idempotent: re-running the pipeline produces the same processed outputs (given same inputs + config).
Good pattern:
data/processed/orders.parquet every runBad pattern:
data/processed/orders.csv every runYou run this twice:
Checkpoint: your fixed code is safe to run 20 times.
Note
If you truly need incremental data, do it intentionally (partitioned folders, dedupe keys). Default for this bootcamp: overwrite processed outputs.
Question: What is the most common symptom of a non‑idempotent pipeline?
Answer: row counts grow every run even when inputs didn’t change.
A very common Day 1 error:
ModuleNotFoundError: No module named 'bootcamp_data'
This usually happens when you run a file by path (Python changes where it searches).
Rule: run entrypoints as modules from the repo root.
python -m scripts.run_day1_loadpython scripts/run_day1_load.py (easy to break imports)Warning
Avoid “fixing” imports by editing runtime paths in your code. If imports fail: change how you run, or fix the package structure.
Your repo looks like this:
You want to run the loader.
Which command is correct?
python scripts/run_day1_load.pypython -m scripts.run_day1_loadCheckpoint: you can explain why one works and the other often fails.
Correct: B) python -m scripts.run_day1_load
Reason: running as a module keeps the repo root on Python’s import search path.
If everyone uses different paths:
We fix this with a consistent layout + a central paths config.
Question: What folder should notebooks read from?
Answer: data/processed/ (not data/raw/).
Hardcoding strings like "../data/raw/orders.csv" breaks when:
Use pathlib.Path + a central config.py.
@dataclass (why we use it)A dataclass is a lightweight way to make a “data container” class.
__init__ for you (less boilerplate)repr (nice for debugging)frozen=True makes it immutable (prevents accidental edits)Today we use it to store project paths in one place.
Paths + make_paths (pattern)from dataclasses import dataclass
from pathlib import Path
@dataclass(frozen=True)
class Paths:
root: Path
raw: Path
cache: Path
processed: Path
external: Path
def make_paths(root: Path) -> Paths:
data = root / "data"
return Paths(
root=root,
raw=data / "raw",
cache=data / "cache",
processed=data / "processed",
external=data / "external",
)make_paths (6 minutes)Create bootcamp_data/config.py with:
Paths dataclassmake_paths(root: Path) functionraw/cache/processed/external paths under root/data/Checkpoint: in a Python REPL, make_paths(Path.cwd()).processed prints a real folder path.
bootcamp_data/config.py (example)from dataclasses import dataclass
from pathlib import Path
@dataclass(frozen=True)
class Paths:
root: Path
raw: Path
cache: Path
processed: Path
external: Path
def make_paths(root: Path) -> Paths:
data = root / "data"
return Paths(
root=root,
raw=data / "raw",
cache=data / "cache",
processed=data / "processed",
external=data / "external",
)Question: Why use Path objects instead of plain strings?
Answer: Path handles OS differences and safe joins (root / "data" / "raw").
python -m ...pathlib.Path in config.pyWhen you return: open your repo and locate data/raw, data/processed, bootcamp_data/, and scripts/.
Data sources + caching patterns
By the end of this session, you can:
data/raw/ and record source metadataYou’ll see data in a few common “shapes”:
| Source | Typical format | Where it goes | Offline-first rule |
|---|---|---|---|
| Kaggle dataset | ZIP → CSV | data/cache/ (zip) → data/raw/ (snapshot) |
pin a snapshot + record metadata |
| HuggingFace Datasets | Arrow → pandas | data/raw/ (snapshot) |
load once → save locally |
| Direct URL download | CSV/JSON/ZIP | data/cache/ → data/raw/ |
cache the bytes + record the URL |
| API extraction | JSON pages | data/cache/ |
never “depend on live calls” |
| Teammate drop | CSV/Parquet | data/raw/ |
treat as immutable input |
We care about repeatability, not “where it came from”. Your repo should run even when Wi‑Fi is bad.
Kaggle is great for realistic tabular datasets, but it requires a one‑time API token setup.
One-time setup (per machine): 1. Create a Kaggle account (if you don’t have one) 2. In Kaggle → Account → create a new API token → download kaggle.json 3. Move it to:
~/.kaggle/kaggle.jsonC:\Users\<you>\.kaggle\kaggle.jsonWarning
Never commit kaggle.json to your repo. It contains credentials.
Install the CLI (if needed):
Download a dataset ZIP into data/cache/ and unzip a snapshot into data/raw/:
Different Kaggle datasets unzip into different filenames. Your job is to decide the snapshot folder name and record it in metadata.
HuggingFace datasets load into memory (or stream), then you save a local snapshot.
Install:
Minimal snapshot example (save as Parquet):
Note
Offline-first rule: once you have a local snapshot, your ETL reads from disk, not from HuggingFace.
This is the most universal option: if you have a URL, you can cache it.
from pathlib import Path
import httpx
def download_bytes(url: str, out_path: Path) -> None:
out_path.parent.mkdir(parents=True, exist_ok=True)
with httpx.Client(timeout=60.0, follow_redirects=True) as client:
r = client.get(url)
r.raise_for_status()
out_path.write_bytes(r.content)
download_bytes(
url="<paste-a-csv-or-zip-url>",
out_path=Path("data/cache/downloads/source_file.zip"),
)Then unzip/copy into data/raw/<snapshot_name>/ and treat it as immutable.
Every raw snapshot should have a tiny metadata record so someone else can reproduce it.
Recommended location (simple): data/raw/_source_meta.json
Example:
We’ll expand metadata later (row counts, schema summary, git commit). Today: just make sure the origin is documented.
scripts/download_data.py (project dataset)Keep dataset acquisition repeatable by putting it in one script.
Minimum for Day 1: support one of (URL / HuggingFace / Kaggle).
Stretch: support 2–3 sources in the same file.
Tiny URL downloader (works everywhere):
from __future__ import annotations
from pathlib import Path
import json
from datetime import datetime, timezone
import httpx
def download(url: str, out_path: Path) -> None:
out_path.parent.mkdir(parents=True, exist_ok=True)
with httpx.Client(timeout=60.0, follow_redirects=True) as client:
r = client.get(url)
r.raise_for_status()
out_path.write_bytes(r.content)
def write_meta(meta_path: Path, meta: dict) -> None:
meta_path.parent.mkdir(parents=True, exist_ok=True)
meta_path.write_text(json.dumps(meta, indent=2), encoding="utf-8")
def main() -> None:
url = "<paste-url>"
out_path = Path("data/cache/downloads/source_file.csv")
download(url, out_path)
write_meta(
Path("data/raw/_source_meta.json"),
{
"dataset_name": "my_week2_project",
"source": "url",
"dataset_id_or_url": url,
"downloaded_at_utc": datetime.now(timezone.utc).isoformat(),
"raw_snapshot_folder": "data/raw/my_week2_project/",
"files": [],
"notes": "downloaded bytes cached; raw snapshot extracted separately"
},
)
if __name__ == "__main__":
main()Note
This script is your “repro button” — it doesn’t need to be fancy. It just needs to be honest and rerunnable.
Two common failure modes:
Your job: make extraction reproducible.
Before you trust extracted data:
Store this next to the cached file (same name + .meta.json).
You downloaded: data/cache/users_page_1.json
Checkpoint: your filename ends with .meta.json and your keys support reproducibility.
filename: data/cache/users_page_1.meta.json
keys (example):
timestamp_utcendpointparamsstatus_codeQuestion: If you don’t store params, what breaks later?
Answer: you can’t reproduce which slice of data you extracted (numbers won’t match).
Internet is unreliable.
Caching means:
data/cache/Operational rule:
This prevents “live calls” from breaking your workflow.
httpx in 60 secondshttpx is a Python HTTP client (like requests)Today’s code is a pattern. We don’t depend on live internet during labs.
fetch_json_cached (offline-first)from pathlib import Path
import json
import time
import httpx
def fetch_json_cached(url: str, cache_path: Path, *, ttl_s: int | None = None) -> dict:
"""Offline-first JSON fetch with optional TTL."""
cache_path.parent.mkdir(parents=True, exist_ok=True)
# Offline-first default: if cache exists, use it (unless TTL says it's too old)
if cache_path.exists():
age_s = time.time() - cache_path.stat().st_mtime
if ttl_s is None or age_s < ttl_s:
return json.loads(cache_path.read_text(encoding="utf-8"))
# Otherwise: fetch and overwrite cache
with httpx.Client(timeout=20.0) as client:
r = client.get(url)
r.raise_for_status()
data = r.json()
cache_path.write_text(json.dumps(data, ensure_ascii=False, indent=2), encoding="utf-8")
return dataAssume cache_path already exists.
What happens?
ttl_s=Nonettl_s=3600 and cache age is 10 minutesttl_s=3600 and cache age is 3 daysCheckpoint: you can answer all 3 without running code.
ttl_s=None → read from cache (offline-first default)ttl_s=3600 and age 10m → read from cachettl_s=3600 and age 3d → re-download then overwrite cacheQuestion: When is ttl_s=None a good default?
Answer: when you want the pipeline to run offline and you refresh cache manually.
In Week 1 you used csv.DictReader (rows = dicts).
In Week 2 we mostly use pandas (tables):
pd.read_csv(...) → DataFramedf.head() → previewdf.shape → row/col countdf.dtypes → types (important!)Bridge (manual ↔︎ pandas):
Tip
pandas is powerful, but it can guess types wrong. That’s why we add guardrails.
CSV is common, but it is fragile:
; vs ,)1,23 vs 1.23)NA, null, empty)read_csv guardrails (high ROI options)When reading CSV, consider:
dtype= (especially IDs)na_values= (custom missing markers)encoding=sep=decimal=read_csv (6 minutes)Scenario:
;12,500007 must keep leading zerosWrite the pd.read_csv(...) call.
Checkpoint: your call includes sep, decimal, and dtype for IDs.
pd.read_csv (example)dtype, sep, decimal, na_values)When you return: open your editor and be ready to write io.py and transforms.py.
pandas I/O + schema basics
By the end of this session, you can:
io.pyenforce_schema(df) transformWhen you load any dataset, do this first:
df.head() → sanity check valuesdf.shape → row/col countdf.dtypes → confirm types (especially IDs)This takes 10 seconds and prevents 2 hours of confusion.
If pandas guesses wrong, you might lose information without an error.
Classic example:
00123 becomes 123 (leading zeros lost forever)Operational rules:
IDs are strings unless you truly compute on them
use pandas nullable dtypes when missing values exist:
"string", "Int64", "Float64", "boolean"Risk
user_id becomes int640007 becomes 7You have orders.csv with columns:
order_id, user_id, amount, quantity, created_at, statusWrite:
dtype={...} mapping for IDsquantity when missing values existCheckpoint: your mapping keeps leading zeros.
Rule: if quantity is “integer but can be missing”, use "Int64" after parsing.
Question: If quantity is an integer but has missing values, what dtype should you use?
Answer: "Int64" (nullable integer), not plain int64.
Parquet is a columnar file format that:
Note
Parquet requires an engine. We use pyarrow this week.
io.pyIf every notebook reads data differently:
Centralized I/O makes team work consistent.
In bootcamp_data/io.py:
read_orders_csv(path) -> DataFrameread_users_csv(path) -> DataFramewrite_parquet(df, path) -> Noneread_parquet(path) -> DataFrameio.py patternfrom pathlib import Path
import pandas as pd
NA = ["", "NA", "N/A", "null", "None"]
def read_orders_csv(path: Path) -> pd.DataFrame:
return pd.read_csv(
path,
dtype={"order_id": "string", "user_id": "string"},
na_values=NA,
keep_default_na=True,
)
def write_parquet(df: pd.DataFrame, path: Path) -> None:
path.parent.mkdir(parents=True, exist_ok=True)
df.to_parquet(path, index=False)In your read_orders_csv(...), add one extra guardrail:
Choose one:
sep=";"encoding="utf-8"decimal=","Checkpoint: you can explain why your chosen guardrail matters.
Even with dtype=..., you often need:
amount, quantity)status casing)We enforce types after loading.
enforce_schema(df) (minimal)import pandas as pd
def enforce_schema(df: pd.DataFrame) -> pd.DataFrame:
return df.assign(
order_id=df["order_id"].astype("string"),
user_id=df["user_id"].astype("string"),
amount=pd.to_numeric(df["amount"], errors="coerce").astype("Float64"),
quantity=pd.to_numeric(df["quantity"], errors="coerce").astype("Int64"),
status=df["status"].astype("string").str.strip().str.lower(),
)Complete this safely:
Checkpoint: invalid values become missing (not crashes).
Question: What does errors="coerce" do?
Answer: invalid values become missing (NaN) instead of raising an exception.
io.pyenforce_schema as a small, testable correctness stepDay 2 we’ll turn assumptions into checks, like:
amount >= 0)Why it matters: catching bad data early prevents join disasters and wasted debugging.
When you return: we will scaffold the repo and write our first processed Parquet file.
In data pipelines, you want evidence during every run:
Minimal pattern:
Tip
Start with logging. Use notebooks for exploration, not for pipelines.
Build: scaffold repo + typed I/O + first processed output
By the end, you should have:
config.py, io.py, transforms.py inside bootcamp_data/data/raw/ (toy dataset is fine for in-class)data/raw/_source_meta.json documenting where your project dataset comes frompython -m scripts.run_day1_loaddata/processed/orders.parquet)Warning
Do not ask GenAI to write your solution code. Ask it to explain concepts or errors.
data/, bootcamp_data/, scripts/, reports/)README.md__init__.py files (so folders can be imported as packages)Checkpoint: git status shows your new files.
macOS/Linux
Windows PowerShell
uv environment + install deps (15 minutes)uvpandas, pyarrow, httpxdatasets (HuggingFace), kaggle (Kaggle CLI)requirements.txtCheckpoint: python -c "import pandas; import pyarrow; import httpx" runs with no error.
Most Day 1 issues come from:
Quick checks:
which python (macOS/Linux)Get-Command python (Windows PowerShell)uv venv + depsmacOS/Linux
Windows PowerShell
Warning
If pyarrow install fails, ask the instructor. Fallback: you can write CSV today, but Parquet is strongly preferred for Week 2.
Track A (in-class, recommended): create the tiny toy dataset below:
data/raw/orders.csvdata/raw/users.csvThis lets you validate your pipeline quickly.
Track B (weekly project): download a real dataset (Kaggle / HuggingFace / URL) into a snapshot folder like:
data/raw/<your_dataset_name>/…and record provenance in:
data/raw/_source_meta.jsonCheckpoint: you have some raw data to run through today’s loader, and you know where your project dataset will come from.
data/raw/orders.csv (toy dataset)data/raw/users.csv (toy dataset)data/raw/_source_meta.json (5 minutes)Create a tiny provenance record for your raw snapshot.
Even if you’re using the toy dataset today, practice the habit — you’ll keep this file for your weekly project.
Example (edit values to match your situation):
Tomorrow we’ll generate additional metadata automatically (row counts, schema summary).
config.py (12 minutes)Create bootcamp_data/config.py:
Paths dataclassmake_paths(root: Path) -> PathsCheckpoint: a tiny snippet prints a valid processed path.
bootcamp_data/config.pyfrom dataclasses import dataclass
from pathlib import Path
@dataclass(frozen=True)
class Paths:
root: Path
raw: Path
cache: Path
processed: Path
external: Path
def make_paths(root: Path) -> Paths:
data = root / "data"
return Paths(
root=root,
raw=data / "raw",
cache=data / "cache",
processed=data / "processed",
external=data / "external",
)io.py (18 minutes)Create bootcamp_data/io.py with:
read_orders_csv(path: Path) -> pd.DataFrameread_users_csv(path: Path) -> pd.DataFramewrite_parquet(df, path)read_parquet(path)Checkpoint: you can import these functions without errors.
bootcamp_data/io.pyfrom pathlib import Path
import pandas as pd
NA = ["", "NA", "N/A", "null", "None"]
def read_orders_csv(path: Path) -> pd.DataFrame:
return pd.read_csv(
path,
dtype={"order_id": "string", "user_id": "string"},
na_values=NA,
keep_default_na=True,
encoding="utf-8",
)
def read_users_csv(path: Path) -> pd.DataFrame:
return pd.read_csv(
path,
dtype={"user_id": "string"},
na_values=NA,
keep_default_na=True,
encoding="utf-8",
)
def write_parquet(df: pd.DataFrame, path: Path) -> None:
path.parent.mkdir(parents=True, exist_ok=True)
df.to_parquet(path, index=False)
def read_parquet(path: Path) -> pd.DataFrame:
return pd.read_parquet(path)transforms.py with enforce_schema (15 minutes)Create bootcamp_data/transforms.py:
enforce_schema(df) -> dfamount and quantity using pd.to_numeric(..., errors="coerce")status to lowercaseCheckpoint: enforce_schema returns Float64 / Int64 dtypes and cleaned status.
bootcamp_data/transforms.py (minimal)import pandas as pd
def enforce_schema(df: pd.DataFrame) -> pd.DataFrame:
return df.assign(
order_id=df["order_id"].astype("string"),
user_id=df["user_id"].astype("string"),
amount=pd.to_numeric(df["amount"], errors="coerce").astype("Float64"),
quantity=pd.to_numeric(df["quantity"], errors="coerce").astype("Int64"),
status=df["status"].astype("string").str.strip().str.lower(),
)Create scripts/run_day1_load.py:
ROOT (repo root)make_paths(ROOT)enforce_schema to ordersdata/processed/Checkpoint: python -m scripts.run_day1_load creates data/processed/orders.parquet.
In scripts/run_day1_load.py:
This works even if you run commands from different folders.
scripts/run_day1_load.pyfrom pathlib import Path
import logging
from bootcamp_data.config import make_paths
from bootcamp_data.io import read_orders_csv, read_users_csv, write_parquet
from bootcamp_data.transforms import enforce_schema
log = logging.getLogger(__name__)
def main() -> None:
logging.basicConfig(level=logging.INFO, format="%(levelname)s %(name)s: %(message)s")
ROOT = Path(__file__).resolve().parents[1]
p = make_paths(ROOT)
orders = enforce_schema(read_orders_csv(p.raw / "orders.csv"))
users = read_users_csv(p.raw / "users.csv")
log.info("Loaded rows: orders=%s users=%s", len(orders), len(users))
log.info("Orders dtypes:\n%s", orders.dtypes)
write_parquet(orders, p.processed / "orders.parquet")
write_parquet(users, p.processed / "users.parquet")
log.info("Wrote processed files to: %s", p.processed)
if __name__ == "__main__":
main()Tip
Run it as a module from the repo root:
python -m scripts.run_day1_loadRun the loader and verify:
amount became missingCheckpoint: you can load Parquet back into pandas and see expected dtypes.
macOS/Linux
Windows PowerShell
git status"w2d1: scaffold + typed io + first processed parquet"Checkpoint: you can see your commit online.
If you already have a remote, skip the git remote add step.
When stuck:
ROOT, paths, row counts, df.dtypesCommon Day 1 fixes:
python -m ... from repo rootuv environmentpyarrow installedIf you finish early:
scripts/download_data.py for your project dataset (Kaggle / HF / URL)data/processed/_run_meta.json with row counts + output pathsREADME.md section: “How to run Day 1”assert str(orders["user_id"].dtype) == "string"In 1–2 sentences:
Why do we keep IDs as strings and prefer Parquet for processed outputs?
Due: before Day 2 starts
Today’s work becomes the foundation of your Week 2 data project.
Pick your Week 2 project dataset (Kaggle / HuggingFace / URL).
Create / update data/raw/_source_meta.json with:
sourcedataset_id_or_urldownloaded_at_utcraw_snapshot_folderfilesMake the download reproducible:
scripts/download_data.py (support at least one of: Kaggle, HF, URL)Ensure python -m scripts.run_day1_load works from a fresh terminal.
Confirm you can create at least one processed Parquet file in data/processed/.
Push at least one commit to GitHub.
Deliverable: GitHub repo link + evidence of:
data/raw/_source_meta.json (contents)data/processed/Tip
Commit early. Commit often. Future you will thank you.