Data Work (ETL + EDA)

AI Professionals Bootcamp | Week 2

2025-12-21

Bootcamp calendar

Week 1: Python & Tooling
Week 2: Data Work (ETL + EDA)
Week 3: Machine Learning
Week 4: Deep Learning & Computer Vision
Week 5: LLM-based NLP
Week 6: Building AI Apps
Week 7: Agentic AI & Practical MLOps
Week 8: Capstone Sprint + Job Readiness

Week 2 focus

This week is Data Work (ETL + EDA).

Internet is not reliable → we work offline-first
We ship a reproducible pipeline (raw → processed → notebook)
GitHub is daily (small commits, clear messages)

Week 1 gave you: CLI + packages + uv + Git. Week 2 adds: pandas + schema discipline + offline-first ETL.

Day 1: Foundations for an Offline‑First Data Workflow

Goal: build a clean repo scaffold and produce your first processed Parquet output (typed, reproducible).

Bootcamp • SDAIA Academy

Today’s Flow

Session 1 (60m): Offline-first mindset + project layout
Asr Prayer (20m)
Session 2 (60m): Data sources + caching patterns
Maghrib Prayer (20m)
Session 3 (60m): pandas I/O + schema basics (IDs, missing values, Parquet)
Isha Prayer (20m)
Hands-on (120m): Scaffold repo + load raw → processed Parquet

Learning Objectives

By the end of today, you can:

explain raw vs cache vs processed (and why we separate them)
scaffold a repo with a standard data project layout
run project code the right way (avoid ModuleNotFoundError)
load CSV in pandas with explicit dtypes (IDs as strings)
write Parquet outputs to data/processed/
implement a minimal schema enforcement step (enforce_schema)
push a working Day 1 baseline to GitHub

Warm-up (5 minutes)

Sanity-check your toolchain (you will use these every day).

uv --version
git --version
python -V

Checkpoint: you can run all three commands with no errors.

Quick review: `uv` daily workflow

Three commands you will use all week:

uv venv → create .venv/
uv pip install ... → install dependencies into the env
uv run <command> → run using the env (no “wrong python” accidents)

Example:

uv run python -c "import pandas as pd; print(pd.__version__)"

You learned uv in Week 1. Today we’ll use it as our default toolchain.

This week’s project (high-level)

Project: Offline‑First ETL + EDA Mini Analytics Pipeline

You will ship:

ETL code that runs end‑to‑end (load → verify → clean → transform → write)
data/processed/*.parquet outputs that are safe to overwrite
an EDA notebook that reads only processed data
a short reports/summary.md (findings + caveats)

Choose your Week 2 dataset (today)

For the rest of Week 2, you’ll build a full ETL + EDA project on one real dataset.

You may use:

Kaggle (download + unzip a snapshot)
HuggingFace Datasets (load_dataset(...) → snapshot to disk)
Direct URL / API (download with httpx + cache)
A teammate-provided file (CSV/JSON/Parquet)

Dataset rubric (so exercises work smoothly):

≥ 1 categorical column (e.g., country, status, category)
≥ 1 numeric column (e.g., amount, quantity, rating)
≥ 1 datetime/timestamp column (e.g., created_at)
≥ 1,000 rows recommended (distributions + outliers are visible)
Bonus: a join key (e.g., user_id, product_id, store_id)

Note

In class we’ll use a tiny toy dataset (orders.csv + users.csv) to validate patterns quickly.

For your weekly project, pick a real dataset and repeat the same workflow.

Warning

Don’t commit raw data or API tokens to GitHub. Commit scripts + metadata so others can reproduce the download.

End-state (by end of today)

By the end of Day 1, your repo should contain:

a standard folder layout (data/, bootcamp_data/, scripts/, reports/)
bootcamp_data/config.py, bootcamp_data/io.py, bootcamp_data/transforms.py
a runnable entrypoint: python -m scripts.run_day1_load
at least one processed output in data/processed/ (Parquet preferred)
a reproducibility breadcrumb for your project dataset:
- data/raw/_source_meta.json (where your data came from)
- optionally: scripts/download_data.py (how to download it again)
at least one commit pushed to GitHub

Tool stack this week (minimal + high ROI)

pandas — load/clean/join/reshape/EDA (DataFrame)
pyarrow + Parquet — fast, typed processed files
httpx — extraction (but cached)
logging — evidence during runs (row counts, dtypes)
Plotly (Day 4) — one plotting library
DuckDB (Day 5) — local SQL on files
(Optional) datasets / kaggle — download catalog datasets (HuggingFace / Kaggle) and snapshot to data/raw/

Rule: avoid tool sprawl. Keep it small and shippable.

Canonical workflow (repeat every project)

Load (raw/cache)
Verify (schema, dtypes, keys, row counts)
Clean (missingness, duplicates, normalization)
Transform (joins, reshape, features)
Analyze (tables + comparisons)
Visualize (Plotly, export figures)
Conclude (written summary + caveats)

Session 1

Offline-first mindset + project layout (no import pain)

Session 1 objectives

By the end of this session, you can:

define offline-first for data work
explain the purpose of raw/, cache/, processed/, external/
describe “raw immutable” and “processed idempotent”
run code using modules (so imports work without hacks)
centralize project paths using pathlib.Path

Context: why “offline-first” is worth it

You will re-run your ETL many times:

debugging
adding a cleaning rule
fixing a dtype bug
adding a feature column

If your pipeline depends on the internet, you lose time.

Concept: raw vs cache vs processed vs external

Offline-first projects separate data by role:

raw: original snapshots (never edited)
cache: downloads/API responses (safe to delete)
processed: clean, typed outputs (safe to recreate / overwrite)
external: reference drops (manual downloads you want to keep)

Example: folder roles (mental model)

data/raw/

immutable inputs
“source of truth”
never edited

data/cache/

API responses
intermediate downloads
safe to delete

data/processed/

clean + typed
analysis-ready
safe to overwrite

data/external/

manual reference
lookup tables
rarely changes

Micro-exercise: classify these files (5 minutes)

Put each file into the correct folder:

orders.csv you received from a teammate
users_api_page_1.json downloaded from an endpoint
orders_clean.parquet generated by your ETL
country_codes.xlsx you manually downloaded as reference

Checkpoint: you can justify your choices in 1 sentence each.

Solution: classification

orders.csv → data/raw/
users_api_page_1.json → data/cache/
orders_clean.parquet → data/processed/
country_codes.xlsx → data/external/

Quick Check

Question: Why should data/raw/ be “immutable”?

Answer: if you edit raw data, you lose the ability to reproduce results and debug changes.

Concept: idempotent processed outputs

Idempotent: re-running the pipeline produces the same processed outputs (given same inputs + config).

Good pattern:

overwrite data/processed/orders.parquet every run

Bad pattern:

append to data/processed/orders.csv every run

Micro-exercise: spot the idempotency bug (6 minutes)

You run this twice:

df.to_csv("data/processed/orders.csv", mode="a", index=False, header=False)

What goes wrong on run #2?
Rewrite it to be idempotent.
Bonus: write Parquet instead of CSV.

Checkpoint: your fixed code is safe to run 20 times.

Solution: idempotent write (example)

# safest default: overwrite
df.to_parquet("data/processed/orders.parquet", index=False)

Note

If you truly need incremental data, do it intentionally (partitioned folders, dedupe keys). Default for this bootcamp: overwrite processed outputs.

Quick Check

Question: What is the most common symptom of a non‑idempotent pipeline?

Answer: row counts grow every run even when inputs didn’t change.

Context: imports fail when you run code “the wrong way”

A very common Day 1 error:

ModuleNotFoundError: No module named 'bootcamp_data'

This usually happens when you run a file by path (Python changes where it searches).

Concept: our “no pain” run rule

Rule: run entrypoints as modules from the repo root.

✅ python -m scripts.run_day1_load
❌ python scripts/run_day1_load.py (easy to break imports)

Warning

Avoid “fixing” imports by editing runtime paths in your code. If imports fail: change how you run, or fix the package structure.

Micro-exercise: choose the right command (4 minutes)

Your repo looks like this:

week2-data-work/
  bootcamp_data/
  scripts/

You want to run the loader.

Which command is correct?

python scripts/run_day1_load.py
python -m scripts.run_day1_load

Checkpoint: you can explain why one works and the other often fails.

Solution: choose the right command

Correct: B) python -m scripts.run_day1_load

Reason: running as a module keeps the repo root on Python’s import search path.

Context: project layout prevents “path chaos”

If everyone uses different paths:

notebooks break
scripts break
teammates can’t run your repo

We fix this with a consistent layout + a central paths config.

Example: repo tree (Day 1 target)

week2-data-work/
  data/{raw,cache,processed,external}/
  notebooks/
  reports/figures/
  scripts/
    __init__.py
    run_day1_load.py
  bootcamp_data/
    __init__.py
    config.py
    io.py
    transforms.py

Quick Check

Question: What folder should notebooks read from?

Answer: data/processed/ (not data/raw/).

Concept: one source of truth for paths

Hardcoding strings like "../data/raw/orders.csv" breaks when:

you run from a different folder
someone uses Windows paths
files move

Use pathlib.Path + a central config.py.

Quick review: `@dataclass` (why we use it)

A dataclass is a lightweight way to make a “data container” class.

auto-generates __init__ for you (less boilerplate)
gives a readable repr (nice for debugging)
frozen=True makes it immutable (prevents accidental edits)

Today we use it to store project paths in one place.

Example: `Paths` + `make_paths` (pattern)

from dataclasses import dataclass
from pathlib import Path

@dataclass(frozen=True)
class Paths:
    root: Path
    raw: Path
    cache: Path
    processed: Path
    external: Path

def make_paths(root: Path) -> Paths:
    data = root / "data"
    return Paths(
        root=root,
        raw=data / "raw",
        cache=data / "cache",
        processed=data / "processed",
        external=data / "external",
    )

Micro-exercise: implement `make_paths` (6 minutes)

Create bootcamp_data/config.py with:

a Paths dataclass
a make_paths(root: Path) function
raw/cache/processed/external paths under root/data/

Checkpoint: in a Python REPL, make_paths(Path.cwd()).processed prints a real folder path.

Solution: `bootcamp_data/config.py` (example)

from dataclasses import dataclass
from pathlib import Path

@dataclass(frozen=True)
class Paths:
    root: Path
    raw: Path
    cache: Path
    processed: Path
    external: Path

def make_paths(root: Path) -> Paths:
    data = root / "data"
    return Paths(
        root=root,
        raw=data / "raw",
        cache=data / "cache",
        processed=data / "processed",
        external=data / "external",
    )

Quick Check

Question: Why use Path objects instead of plain strings?

Answer: Path handles OS differences and safe joins (root / "data" / "raw").

Session 1 recap

Offline-first = your project runs without internet
Separate data by role: raw / cache / processed / external
Processed outputs should be idempotent (safe reruns)
Run entrypoints as modules: python -m ...
Centralize paths with pathlib.Path in config.py

Asr break

20 minutes

When you return: open your repo and locate data/raw, data/processed, bootcamp_data/, and scripts/.

Session 2

Data sources + caching patterns

Session 2 objectives

By the end of this session, you can:

identify common data sources (CSV/JSON/API) and catalog sources (Kaggle, HuggingFace)
download a dataset via Kaggle, HuggingFace, or a direct URL
write a raw snapshot to data/raw/ and record source metadata
implement an offline-first cache read/write pattern (reuse cache when present)
explain why CSV needs guardrails (dtype, separators, missing markers)

Data sources you can use (and how we treat them)

You’ll see data in a few common “shapes”:

Source	Typical format	Where it goes	Offline-first rule
Kaggle dataset	ZIP → CSV	`data/cache/` (zip) → `data/raw/` (snapshot)	pin a snapshot + record metadata
HuggingFace Datasets	Arrow → pandas	`data/raw/` (snapshot)	load once → save locally
Direct URL download	CSV/JSON/ZIP	`data/cache/` → `data/raw/`	cache the bytes + record the URL
API extraction	JSON pages	`data/cache/`	never “depend on live calls”
Teammate drop	CSV/Parquet	`data/raw/`	treat as immutable input

We care about repeatability, not “where it came from”. Your repo should run even when Wi‑Fi is bad.

Path A — Kaggle (CLI) snapshot download

Kaggle is great for realistic tabular datasets, but it requires a one‑time API token setup.

One-time setup (per machine): 1. Create a Kaggle account (if you don’t have one) 2. In Kaggle → Account → create a new API token → download kaggle.json 3. Move it to:

macOS/Linux: ~/.kaggle/kaggle.json
Windows: C:\Users\<you>\.kaggle\kaggle.json

(macOS/Linux) fix permissions:

chmod 600 ~/.kaggle/kaggle.json

Warning

Never commit kaggle.json to your repo. It contains credentials.

Kaggle download → cache → raw (example commands)

Install the CLI (if needed):

uv pip install kaggle

Download a dataset ZIP into data/cache/ and unzip a snapshot into data/raw/:

# from repo root
DATASET_ID="<kaggle-owner>/<kaggle-dataset-slug>"

mkdir -p data/cache/kaggle data/raw/kaggle
kaggle datasets download -d "$DATASET_ID" -p data/cache/kaggle --unzip
# move/copy the extracted files into a snapshot folder you will not edit

Different Kaggle datasets unzip into different filenames. Your job is to decide the snapshot folder name and record it in metadata.

Path B — HuggingFace Datasets (Python) snapshot

HuggingFace datasets load into memory (or stream), then you save a local snapshot.

Install:

uv pip install datasets

Minimal snapshot example (save as Parquet):

from pathlib import Path
from datasets import load_dataset

out_dir = Path("data/raw/hf_my_dataset")
out_dir.mkdir(parents=True, exist_ok=True)

ds = load_dataset("<dataset_name>", split="train")  # change name/split
df = ds.to_pandas()
df.to_parquet(out_dir / "train.parquet", index=False)

Note

Offline-first rule: once you have a local snapshot, your ETL reads from disk, not from HuggingFace.

Path C — Direct URL download (CSV/ZIP) snapshot

This is the most universal option: if you have a URL, you can cache it.

from pathlib import Path
import httpx

def download_bytes(url: str, out_path: Path) -> None:
    out_path.parent.mkdir(parents=True, exist_ok=True)
    with httpx.Client(timeout=60.0, follow_redirects=True) as client:
        r = client.get(url)
        r.raise_for_status()
        out_path.write_bytes(r.content)

download_bytes(
    url="<paste-a-csv-or-zip-url>",
    out_path=Path("data/cache/downloads/source_file.zip"),
)

Then unzip/copy into data/raw/<snapshot_name>/ and treat it as immutable.

Raw snapshot metadata (required)

Every raw snapshot should have a tiny metadata record so someone else can reproduce it.

Recommended location (simple): data/raw/_source_meta.json

Example:

{
  "dataset_name": "my_week2_project",
  "source": "kaggle | huggingface | url | teammate",
  "dataset_id_or_url": "…",
  "downloaded_at_utc": "2025-12-21T10:15:00Z",
  "raw_snapshot_folder": "data/raw/my_week2_project/",
  "files": ["train.parquet"],
  "notes": "any caveats (sampling, filters, auth)"
}

We’ll expand metadata later (row counts, schema summary, git commit). Today: just make sure the origin is documented.

Starter: `scripts/download_data.py` (project dataset)

Keep dataset acquisition repeatable by putting it in one script.

Minimum for Day 1: support one of (URL / HuggingFace / Kaggle).
Stretch: support 2–3 sources in the same file.

Tiny URL downloader (works everywhere):

from __future__ import annotations
from pathlib import Path
import json
from datetime import datetime, timezone
import httpx

def download(url: str, out_path: Path) -> None:
    out_path.parent.mkdir(parents=True, exist_ok=True)
    with httpx.Client(timeout=60.0, follow_redirects=True) as client:
        r = client.get(url)
        r.raise_for_status()
        out_path.write_bytes(r.content)

def write_meta(meta_path: Path, meta: dict) -> None:
    meta_path.parent.mkdir(parents=True, exist_ok=True)
    meta_path.write_text(json.dumps(meta, indent=2), encoding="utf-8")

def main() -> None:
    url = "<paste-url>"
    out_path = Path("data/cache/downloads/source_file.csv")
    download(url, out_path)

    write_meta(
        Path("data/raw/_source_meta.json"),
        {
            "dataset_name": "my_week2_project",
            "source": "url",
            "dataset_id_or_url": url,
            "downloaded_at_utc": datetime.now(timezone.utc).isoformat(),
            "raw_snapshot_folder": "data/raw/my_week2_project/",
            "files": [],
            "notes": "downloaded bytes cached; raw snapshot extracted separately"
        },
    )

if __name__ == "__main__":
    main()

Note

This script is your “repro button” — it doesn’t need to be fancy. It just needs to be honest and rerunnable.

Context: extraction is where “silent drift” starts

Two common failure modes:

upstream data changes → your results change
extraction fails partially → you analyze incomplete data

Your job: make extraction reproducible.

Concept: minimal extraction checklist

Before you trust extracted data:

did you get all pages (pagination)?
did you record params/time window?
did you store a snapshot (cache)?
did you validate row count / file size?

Example: minimal extraction metadata (JSON)

{
  "timestamp_utc": "2025-12-21T09:15:00Z",
  "source": "api",
  "endpoint": "/v1/users",
  "params": {"page": 1, "per_page": 100},
  "status_code": 200
}

Store this next to the cached file (same name + .meta.json).

Micro-exercise: design the metadata file (5 minutes)

You downloaded: data/cache/users_page_1.json

What should the metadata filename be?
List 4 keys you will store.

Checkpoint: your filename ends with .meta.json and your keys support reproducibility.

Solution: metadata naming + keys

filename: data/cache/users_page_1.meta.json
keys (example):
- timestamp_utc
- endpoint
- params
- status_code

Quick Check

Question: If you don’t store params, what breaks later?

Answer: you can’t reproduce which slice of data you extracted (numbers won’t match).

Context: caching keeps you productive

Internet is unreliable.

Caching means:

first run: download → save to data/cache/
next runs: read from cache (fast, offline)

Concept: offline-first caching policy

Operational rule:

if cache exists → use it
only re-download when you choose (manual delete or TTL refresh)

This prevents “live calls” from breaking your workflow.

Quick review: `httpx` in 60 seconds

httpx is a Python HTTP client (like requests)
you call an endpoint and get JSON back
you must cache the response for reproducibility

Today’s code is a pattern. We don’t depend on live internet during labs.

Example: `fetch_json_cached` (offline-first)

from pathlib import Path
import json
import time
import httpx

def fetch_json_cached(url: str, cache_path: Path, *, ttl_s: int | None = None) -> dict:
    """Offline-first JSON fetch with optional TTL."""
    cache_path.parent.mkdir(parents=True, exist_ok=True)

    # Offline-first default: if cache exists, use it (unless TTL says it's too old)
    if cache_path.exists():
        age_s = time.time() - cache_path.stat().st_mtime
        if ttl_s is None or age_s < ttl_s:
            return json.loads(cache_path.read_text(encoding="utf-8"))

    # Otherwise: fetch and overwrite cache
    with httpx.Client(timeout=20.0) as client:
        r = client.get(url)
        r.raise_for_status()
        data = r.json()

    cache_path.write_text(json.dumps(data, ensure_ascii=False, indent=2), encoding="utf-8")
    return data

Micro-exercise: predict caching behavior (4 minutes)

Assume cache_path already exists.

What happens?

ttl_s=None
ttl_s=3600 and cache age is 10 minutes
ttl_s=3600 and cache age is 3 days

Checkpoint: you can answer all 3 without running code.

Solution: predict caching behavior

ttl_s=None → read from cache (offline-first default)
ttl_s=3600 and age 10m → read from cache
ttl_s=3600 and age 3d → re-download then overwrite cache

Quick Check

Question: When is ttl_s=None a good default?

Answer: when you want the pipeline to run offline and you refresh cache manually.

Quick review: pandas = “tables in Python”

In Week 1 you used csv.DictReader (rows = dicts).

In Week 2 we mostly use pandas (tables):

pd.read_csv(...) → DataFrame
df.head() → preview
df.shape → row/col count
df.dtypes → types (important!)

Bridge (manual ↔︎ pandas):

import pandas as pd

# list[dict] → DataFrame
rows = [{"a": 1, "b": "x"}, {"a": 2, "b": "y"}]
df = pd.DataFrame(rows)

# DataFrame → list[dict] (records)
records = df.to_dict(orient="records")

Tip

pandas is powerful, but it can guess types wrong. That’s why we add guardrails.

Context: CSV is a “lowest common denominator”

CSV is common, but it is fragile:

encoding surprises
separators differ (; vs ,)
decimal separators differ (1,23 vs 1.23)
missing markers vary (NA, null, empty)

Concept: `read_csv` guardrails (high ROI options)

When reading CSV, consider:

dtype= (especially IDs)
na_values= (custom missing markers)
encoding=
sep=
decimal=

Micro-exercise: write a guarded `read_csv` (6 minutes)

Scenario:

separator is ;
decimals use comma: 12,50
IDs like 0007 must keep leading zeros

Write the pd.read_csv(...) call.

Checkpoint: your call includes sep, decimal, and dtype for IDs.

Solution: guarded `pd.read_csv` (example)

import pandas as pd

df = pd.read_csv(
    path,
    sep=";",
    decimal=",",
    dtype={"user_id": "string", "order_id": "string"},
    na_values=["", "NA", "null"],
)

Session 2 recap

Extraction must be reproducible (cache + metadata)
Offline-first caching: reuse cache when present
CSV needs guardrails (dtype, sep, decimal, na_values)
Next: schema discipline + Parquet outputs

Maghrib break

20 minutes

When you return: open your editor and be ready to write io.py and transforms.py.

Session 3

pandas I/O + schema basics

Session 3 objectives

By the end of this session, you can:

explain why pandas dtype inference can be dangerous
treat IDs as strings (avoid silent corruption)
write and read Parquet with pandas
centralize I/O in io.py
implement a minimal enforce_schema(df) transform

pandas muscle memory (3 moves)

When you load any dataset, do this first:

df.head() → sanity check values
df.shape → row/col count
df.dtypes → confirm types (especially IDs)

This takes 10 seconds and prevents 2 hours of confusion.

Context: pandas inference can silently corrupt meaning

If pandas guesses wrong, you might lose information without an error.

Classic example:

ID 00123 becomes 123 (leading zeros lost forever)

Concept: treat IDs as strings

Operational rules:

IDs are strings unless you truly compute on them
use pandas nullable dtypes when missing values exist:
- "string", "Int64", "Float64", "boolean"

Example: the leading-zero bug

import pandas as pd

df = pd.read_csv("orders.csv")  # risky default
print(df.dtypes)

Risk

user_id becomes int64
0007 becomes 7
joins fail later

Micro-exercise: choose dtypes (6 minutes)

You have orders.csv with columns:

order_id, user_id, amount, quantity, created_at, status

Write:

a dtype={...} mapping for IDs
a 1-sentence rule for quantity when missing values exist

Checkpoint: your mapping keeps leading zeros.

Solution: dtype mapping + rule

dtype = {
    "order_id": "string",
    "user_id": "string",
}

Rule: if quantity is “integer but can be missing”, use "Int64" after parsing.

Quick Check

Question: If quantity is an integer but has missing values, what dtype should you use?

Answer: "Int64" (nullable integer), not plain int64.

Concept: why Parquet for processed outputs

Parquet is a columnar file format that:

preserves dtypes (critical for processed data)
is smaller than CSV (compression)
loads faster for analytics workflows

Example: write + read Parquet

df.to_parquet("data/processed/orders.parquet", index=False)
df2 = pd.read_parquet("data/processed/orders.parquet")

Note

Parquet requires an engine. We use pyarrow this week.

Context: centralize I/O in `io.py`

If every notebook reads data differently:

missing values differ
dtypes differ
results differ

Centralized I/O makes team work consistent.

Concept: four core I/O helpers

In bootcamp_data/io.py:

read_orders_csv(path) -> DataFrame
read_users_csv(path) -> DataFrame
write_parquet(df, path) -> None
read_parquet(path) -> DataFrame

Example: `io.py` pattern

from pathlib import Path
import pandas as pd

NA = ["", "NA", "N/A", "null", "None"]

def read_orders_csv(path: Path) -> pd.DataFrame:
    return pd.read_csv(
        path,
        dtype={"order_id": "string", "user_id": "string"},
        na_values=NA,
        keep_default_na=True,
    )

def write_parquet(df: pd.DataFrame, path: Path) -> None:
    path.parent.mkdir(parents=True, exist_ok=True)
    df.to_parquet(path, index=False)

Micro-exercise: add one more guardrail (5 minutes)

In your read_orders_csv(...), add one extra guardrail:

Choose one:

sep=";"
encoding="utf-8"
decimal=","

Checkpoint: you can explain why your chosen guardrail matters.

Solution: example guardrail

def read_orders_csv(path: Path) -> pd.DataFrame:
    return pd.read_csv(
        path,
        dtype={"order_id": "string", "user_id": "string"},
        na_values=NA,
        keep_default_na=True,
        encoding="utf-8",
    )

Concept: schema enforcement is a correctness step

Even with dtype=..., you often need:

numeric parsing (amount, quantity)
normalization (status casing)
consistent missing values

We enforce types after loading.

Example: `enforce_schema(df)` (minimal)

import pandas as pd

def enforce_schema(df: pd.DataFrame) -> pd.DataFrame:
    return df.assign(
        order_id=df["order_id"].astype("string"),
        user_id=df["user_id"].astype("string"),
        amount=pd.to_numeric(df["amount"], errors="coerce").astype("Float64"),
        quantity=pd.to_numeric(df["quantity"], errors="coerce").astype("Int64"),
        status=df["status"].astype("string").str.strip().str.lower(),
    )

Micro-exercise: make parsing “non-crashy” (5 minutes)

Complete this safely:

amount = pd.to_numeric(df["amount"], errors=____).astype("Float64")
quantity = pd.to_numeric(df["quantity"], errors=____).astype("Int64")

Checkpoint: invalid values become missing (not crashes).

Solution: parsing with coercion

amount = pd.to_numeric(df["amount"], errors="coerce").astype("Float64")
quantity = pd.to_numeric(df["quantity"], errors="coerce").astype("Int64")

Quick Check

Question: What does errors="coerce" do?

Answer: invalid values become missing (NaN) instead of raising an exception.

Session 3 recap

pandas inference can silently break IDs → keep IDs as strings
Parquet preserves dtypes and loads fast → use it for processed data
centralize loading/writing in io.py
use enforce_schema as a small, testable correctness step

Tomorrow: “Verify” becomes code (fail fast)

Day 2 we’ll turn assumptions into checks, like:

required columns
non-empty datasets
unique keys (before joins)
missingness report (per column)
simple range checks (e.g., amount >= 0)

Why it matters: catching bad data early prevents join disasters and wasted debugging.

Isha break

20 minutes

When you return: we will scaffold the repo and write our first processed Parquet file.

Quick review: logging beats “mystery debugging”

In data pipelines, you want evidence during every run:

row counts
dtypes
output paths

Minimal pattern:

import logging
logging.basicConfig(level=logging.INFO, format="%(levelname)s %(name)s: %(message)s")
log = logging.getLogger(__name__)
log.info("rows=%s", len(df))

Tip

Start with logging. Use notebooks for exploration, not for pipelines.

Hands-on

Build: scaffold repo + typed I/O + first processed output

Hands-on success criteria (today)

By the end, you should have:

a repo with the standard folder layout
config.py, io.py, transforms.py inside bootcamp_data/
raw inputs in data/raw/ (toy dataset is fine for in-class)
data/raw/_source_meta.json documenting where your project dataset comes from
a runnable module: python -m scripts.run_day1_load
at least one processed Parquet created by your code (e.g., data/processed/orders.parquet)
at least one commit pushed to GitHub

Project layout (target)

week2-data-work/
  data/
    raw/            # immutable inputs
    cache/          # API responses (optional today)
    processed/      # your Parquet outputs
    external/       # reference drops (optional)
  reports/figures/
  scripts/
    __init__.py
    run_day1_load.py
  bootcamp_data/
    __init__.py
    config.py
    io.py
    transforms.py
  README.md
  requirements.txt

Vibe coding (safe version)

Write the plan in 5 bullets (no code yet)
Implement the smallest piece
Run → break → read error → fix
Commit
Repeat

Warning

Do not ask GenAI to write your solution code. Ask it to explain concepts or errors.

Task 1 — Create folders + initialize git (10 minutes)

Create the repo folders (data/, bootcamp_data/, scripts/, reports/)
Initialize git
Create an empty README.md
Add __init__.py files (so folders can be imported as packages)

Checkpoint: git status shows your new files.

Solution — folders + git

macOS/Linux

mkdir -p data/{raw,cache,processed,external}
mkdir -p reports/figures scripts bootcamp_data notebooks
touch README.md bootcamp_data/__init__.py scripts/__init__.py
git init

Windows PowerShell

mkdir data, reports, scripts, bootcamp_data, notebooks
mkdir data\raw, data\cache, data\processed, data\external
mkdir reports\figures
ni README.md -ItemType File
ni bootcamp_data\__init__.py -ItemType File
ni scripts\__init__.py -ItemType File
git init

Task 2 — Create a `uv` environment + install deps (15 minutes)

Create a virtual environment with uv
Activate it
Install core deps: pandas, pyarrow, httpx
Optional (for dataset downloading today): datasets (HuggingFace), kaggle (Kaggle CLI)
Freeze requirements.txt

Checkpoint: python -c "import pandas; import pyarrow; import httpx" runs with no error.

Hint — common environment mistakes

Most Day 1 issues come from:

the wrong Python interpreter
forgetting to activate the environment
installing packages globally by accident

Quick checks:

which python (macOS/Linux)
Get-Command python (Windows PowerShell)

Solution — `uv` venv + deps

macOS/Linux

uv venv
source .venv/bin/activate

# core (required)
uv pip install pandas pyarrow httpx

# optional (only if you plan to use these sources today)
# uv pip install datasets kaggle

uv pip freeze > requirements.txt
python -c "import pandas; import pyarrow; import httpx; print('ok')"

Windows PowerShell

uv venv
.\.venv\Scripts\Activate.ps1

# core (required)
uv pip install pandas pyarrow httpx

# optional (only if you plan to use these sources today)
# uv pip install datasets kaggle

uv pip freeze > requirements.txt
python -c "import pandas; import pyarrow; import httpx; print('ok')"

Warning

If pyarrow install fails, ask the instructor. Fallback: you can write CSV today, but Parquet is strongly preferred for Week 2.

Task 3 — Acquire raw data (two tracks) (10–25 minutes)

Track A (in-class, recommended): create the tiny toy dataset below:

data/raw/orders.csv
data/raw/users.csv

This lets you validate your pipeline quickly.

Track B (weekly project): download a real dataset (Kaggle / HuggingFace / URL) into a snapshot folder like:

data/raw/<your_dataset_name>/

…and record provenance in:

data/raw/_source_meta.json

Checkpoint: you have some raw data to run through today’s loader, and you know where your project dataset will come from.

Track A solution — `data/raw/orders.csv` (toy dataset)

order_id,user_id,amount,quantity,created_at,status
A0001,0001,12.50,1,2025-12-01T10:05:00Z,Paid
A0002,0002,8.00,2,2025-12-01T11:10:00Z,paid
A0003,0003,not_a_number,1,2025-12-02T09:00:00Z,Refund
A0004,0001,25.00,,2025-12-03T14:30:00Z,PAID
A0005,0004,100.00,1,not_a_date,paid

Track A solution — `data/raw/users.csv` (toy dataset)

user_id,country,signup_date
0001,SA,2025-11-15
0002,SA,2025-11-20
0003,AE,2025-11-22
0004,SA,2025-11-25

Task 3.5 — Add `data/raw/_source_meta.json` (5 minutes)

Create a tiny provenance record for your raw snapshot.

Even if you’re using the toy dataset today, practice the habit — you’ll keep this file for your weekly project.

Example (edit values to match your situation):

{
  "dataset_name": "toy_orders_users",
  "source": "in_class_toy",
  "dataset_id_or_url": "n/a",
  "downloaded_at_utc": "2025-12-21T00:00:00Z",
  "raw_snapshot_folder": "data/raw/",
  "files": ["orders.csv", "users.csv"],
  "notes": "Toy dataset used for Week 2 Day 1 scaffolding."
}

Tomorrow we’ll generate additional metadata automatically (row counts, schema summary).

Task 4 — Implement `config.py` (12 minutes)

Create bootcamp_data/config.py:

Paths dataclass
make_paths(root: Path) -> Paths

Checkpoint: a tiny snippet prints a valid processed path.

Solution — `bootcamp_data/config.py`

from dataclasses import dataclass
from pathlib import Path

@dataclass(frozen=True)
class Paths:
    root: Path
    raw: Path
    cache: Path
    processed: Path
    external: Path

def make_paths(root: Path) -> Paths:
    data = root / "data"
    return Paths(
        root=root,
        raw=data / "raw",
        cache=data / "cache",
        processed=data / "processed",
        external=data / "external",
    )

Task 5 — Implement `io.py` (18 minutes)

Create bootcamp_data/io.py with:

read_orders_csv(path: Path) -> pd.DataFrame
read_users_csv(path: Path) -> pd.DataFrame
write_parquet(df, path)
read_parquet(path)

Checkpoint: you can import these functions without errors.

Solution — `bootcamp_data/io.py`

from pathlib import Path
import pandas as pd

NA = ["", "NA", "N/A", "null", "None"]

def read_orders_csv(path: Path) -> pd.DataFrame:
    return pd.read_csv(
        path,
        dtype={"order_id": "string", "user_id": "string"},
        na_values=NA,
        keep_default_na=True,
        encoding="utf-8",
    )

def read_users_csv(path: Path) -> pd.DataFrame:
    return pd.read_csv(
        path,
        dtype={"user_id": "string"},
        na_values=NA,
        keep_default_na=True,
        encoding="utf-8",
    )

def write_parquet(df: pd.DataFrame, path: Path) -> None:
    path.parent.mkdir(parents=True, exist_ok=True)
    df.to_parquet(path, index=False)

def read_parquet(path: Path) -> pd.DataFrame:
    return pd.read_parquet(path)

Task 6 — Add `transforms.py` with `enforce_schema` (15 minutes)

Create bootcamp_data/transforms.py:

implement enforce_schema(df) -> df
convert amount and quantity using pd.to_numeric(..., errors="coerce")
normalize status to lowercase

Checkpoint: enforce_schema returns Float64 / Int64 dtypes and cleaned status.

Solution — `bootcamp_data/transforms.py` (minimal)

import pandas as pd

def enforce_schema(df: pd.DataFrame) -> pd.DataFrame:
    return df.assign(
        order_id=df["order_id"].astype("string"),
        user_id=df["user_id"].astype("string"),
        amount=pd.to_numeric(df["amount"], errors="coerce").astype("Float64"),
        quantity=pd.to_numeric(df["quantity"], errors="coerce").astype("Int64"),
        status=df["status"].astype("string").str.strip().str.lower(),
    )

Task 7 — Write the Day 1 loader module (20 minutes)

Create scripts/run_day1_load.py:

compute ROOT (repo root)
build paths with make_paths(ROOT)
read raw CSVs
apply enforce_schema to orders
write Parquet outputs to data/processed/
print or log evidence (row counts + dtypes)

Checkpoint: python -m scripts.run_day1_load creates data/processed/orders.parquet.

Hint — compute the project root robustly

In scripts/run_day1_load.py:

from pathlib import Path
ROOT = Path(__file__).resolve().parents[1]

This works even if you run commands from different folders.

Solution — `scripts/run_day1_load.py`

from pathlib import Path
import logging

from bootcamp_data.config import make_paths
from bootcamp_data.io import read_orders_csv, read_users_csv, write_parquet
from bootcamp_data.transforms import enforce_schema

log = logging.getLogger(__name__)

def main() -> None:
    logging.basicConfig(level=logging.INFO, format="%(levelname)s %(name)s: %(message)s")

    ROOT = Path(__file__).resolve().parents[1]
    p = make_paths(ROOT)

    orders = enforce_schema(read_orders_csv(p.raw / "orders.csv"))
    users = read_users_csv(p.raw / "users.csv")

    log.info("Loaded rows: orders=%s users=%s", len(orders), len(users))
    log.info("Orders dtypes:\n%s", orders.dtypes)

    write_parquet(orders, p.processed / "orders.parquet")
    write_parquet(users, p.processed / "users.parquet")

    log.info("Wrote processed files to: %s", p.processed)

if __name__ == "__main__":
    main()

Tip

Run it as a module from the repo root:

python -m scripts.run_day1_load

Task 8 — Run + verify outputs (10 minutes)

Run the loader and verify:

processed files exist
IDs stayed as strings
invalid amount became missing

Checkpoint: you can load Parquet back into pandas and see expected dtypes.

Solution — run + verify

macOS/Linux

uv run python -m scripts.run_day1_load
uv run python -c "import pandas as pd; df=pd.read_parquet('data/processed/orders.parquet'); print(df.dtypes); print(df.head())"

Windows PowerShell

uv run python -m scripts.run_day1_load
uv run python -c "import pandas as pd; df=pd.read_parquet('data/processed/orders.parquet'); print(df.dtypes); print(df.head())"

Git checkpoint (5 minutes)

git status
commit with message: "w2d1: scaffold + typed io + first processed parquet"
push to GitHub

Checkpoint: you can see your commit online.

Solution — git commands

git add -A
git commit -m "w2d1: scaffold + typed io + first processed parquet"
git branch -M main
git remote add origin <YOUR_REPO_URL>
git push -u origin main

If you already have a remote, skip the git remote add step.

Debug playbook

When stuck:

Read the full error (don’t guess)
Identify: file + line number
Print/log: ROOT, paths, row counts, df.dtypes
Fix the smallest thing
Re-run

Common Day 1 fixes:

If imports fail: run with python -m ... from repo root
If modules are missing: install deps in the activated uv environment
If Parquet write fails: confirm pyarrow installed

Stretch goals (optional)

If you finish early:

create scripts/download_data.py for your project dataset (Kaggle / HF / URL)
write data/processed/_run_meta.json with row counts + output paths
add a README.md section: “How to run Day 1”
add one assertion: assert str(orders["user_id"].dtype) == "string"

Exit Ticket

In 1–2 sentences:

Why do we keep IDs as strings and prefer Parquet for processed outputs?

What to do after class (Day 1 assignment)

Due: before Day 2 starts

Today’s work becomes the foundation of your Week 2 data project.

Pick your Week 2 project dataset (Kaggle / HuggingFace / URL).
Create / update data/raw/_source_meta.json with:
- source
- dataset_id_or_url
- downloaded_at_utc
- raw_snapshot_folder
- files
Make the download reproducible:
- Preferred: add scripts/download_data.py (support at least one of: Kaggle, HF, URL)
- Minimum: add a README section with exact commands you used
Ensure python -m scripts.run_day1_load works from a fresh terminal.
Confirm you can create at least one processed Parquet file in data/processed/.
Push at least one commit to GitHub.

Deliverable: GitHub repo link + evidence of:

data/raw/_source_meta.json (contents)
a processed Parquet file in data/processed/
the commands/script used to acquire the project dataset

Tip

Commit early. Commit often. Future you will thank you.

Data Work (ETL + EDA)

Bootcamp calendar

Week 2 focus

Day 1: Foundations for an Offline‑First Data Workflow

Today’s Flow

Learning Objectives

Warm-up (5 minutes)

Quick review: uv daily workflow

This week’s project (high-level)

Choose your Week 2 dataset (today)

End-state (by end of today)

Tool stack this week (minimal + high ROI)

Canonical workflow (repeat every project)

Session 1

Session 1 objectives

Context: why “offline-first” is worth it

Concept: raw vs cache vs processed vs external

Example: folder roles (mental model)

Micro-exercise: classify these files (5 minutes)

Solution: classification

Quick Check

Concept: idempotent processed outputs

Micro-exercise: spot the idempotency bug (6 minutes)

Solution: idempotent write (example)

Quick Check

Context: imports fail when you run code “the wrong way”

Concept: our “no pain” run rule

Micro-exercise: choose the right command (4 minutes)

Solution: choose the right command

Context: project layout prevents “path chaos”

Example: repo tree (Day 1 target)

Quick Check

Concept: one source of truth for paths

Quick review: @dataclass (why we use it)

Example: Paths + make_paths (pattern)

Micro-exercise: implement make_paths (6 minutes)

Solution: bootcamp_data/config.py (example)

Quick Check

Session 1 recap

Asr break

20 minutes

Session 2

Session 2 objectives

Data sources you can use (and how we treat them)

Path A — Kaggle (CLI) snapshot download

Kaggle download → cache → raw (example commands)

Path B — HuggingFace Datasets (Python) snapshot

Path C — Direct URL download (CSV/ZIP) snapshot

Raw snapshot metadata (required)

Starter: scripts/download_data.py (project dataset)

Context: extraction is where “silent drift” starts

Concept: minimal extraction checklist

Example: minimal extraction metadata (JSON)

Micro-exercise: design the metadata file (5 minutes)

Solution: metadata naming + keys

Quick Check

Context: caching keeps you productive

Concept: offline-first caching policy

Quick review: httpx in 60 seconds

Example: fetch_json_cached (offline-first)

Micro-exercise: predict caching behavior (4 minutes)

Solution: predict caching behavior

Quick Check

Quick review: pandas = “tables in Python”

Context: CSV is a “lowest common denominator”

Concept: read_csv guardrails (high ROI options)

Micro-exercise: write a guarded read_csv (6 minutes)

Solution: guarded pd.read_csv (example)

Session 2 recap

Maghrib break

20 minutes

Session 3

Session 3 objectives

pandas muscle memory (3 moves)

Context: pandas inference can silently corrupt meaning

Concept: treat IDs as strings

Example: the leading-zero bug

Micro-exercise: choose dtypes (6 minutes)

Solution: dtype mapping + rule

Quick Check

Quick review: `uv` daily workflow

Quick review: `@dataclass` (why we use it)

Example: `Paths` + `make_paths` (pattern)

Micro-exercise: implement `make_paths` (6 minutes)

Solution: `bootcamp_data/config.py` (example)

Starter: `scripts/download_data.py` (project dataset)

Quick review: `httpx` in 60 seconds

Example: `fetch_json_cached` (offline-first)

Concept: `read_csv` guardrails (high ROI options)

Micro-exercise: write a guarded `read_csv` (6 minutes)

Solution: guarded `pd.read_csv` (example)

Context: centralize I/O in `io.py`

Example: `io.py` pattern

Example: `enforce_schema(df)` (minimal)

Task 2 — Create a `uv` environment + install deps (15 minutes)

Solution — `uv` venv + deps

Track A solution — `data/raw/orders.csv` (toy dataset)

Track A solution — `data/raw/users.csv` (toy dataset)

Task 3.5 — Add `data/raw/_source_meta.json` (5 minutes)

Task 4 — Implement `config.py` (12 minutes)

Solution — `bootcamp_data/config.py`

Task 5 — Implement `io.py` (18 minutes)

Solution — `bootcamp_data/io.py`

Task 6 — Add `transforms.py` with `enforce_schema` (15 minutes)

Solution — `bootcamp_data/transforms.py` (minimal)

Solution — `scripts/run_day1_load.py`