Data Work (ETL + EDA)

AI Professionals Bootcamp | Week 2

2025-12-21

Bootcamp calendar

  • Week 1: Python & Tooling
  • Week 2: Data Work (ETL + EDA)
  • Week 3: Machine Learning
  • Week 4: Deep Learning & Computer Vision
  • Week 5: LLM-based NLP
  • Week 6: Building AI Apps
  • Week 7: Agentic AI & Practical MLOps
  • Week 8: Capstone Sprint + Job Readiness

Week 2 focus

This week is Data Work (ETL + EDA).

  • Internet is not reliable → we work offline-first
  • We ship a reproducible pipeline (raw → processed → notebook)
  • GitHub is daily (small commits, clear messages)

Week 1 gave you: CLI + packages + uv + Git. Week 2 adds: pandas + schema discipline + offline-first ETL.

Day 1: Foundations for an Offline‑First Data Workflow

Goal: build a clean repo scaffold and produce your first processed Parquet output (typed, reproducible).

Bootcamp • SDAIA Academy

Today’s Flow

  • Session 1 (60m): Offline-first mindset + project layout
  • Asr Prayer (20m)
  • Session 2 (60m): Data sources + caching patterns
  • Maghrib Prayer (20m)
  • Session 3 (60m): pandas I/O + schema basics (IDs, missing values, Parquet)
  • Isha Prayer (20m)
  • Hands-on (120m): Scaffold repo + load raw → processed Parquet

Learning Objectives

By the end of today, you can:

  • explain raw vs cache vs processed (and why we separate them)
  • scaffold a repo with a standard data project layout
  • run project code the right way (avoid ModuleNotFoundError)
  • load CSV in pandas with explicit dtypes (IDs as strings)
  • write Parquet outputs to data/processed/
  • implement a minimal schema enforcement step (enforce_schema)
  • push a working Day 1 baseline to GitHub

Warm-up (5 minutes)

Sanity-check your toolchain (you will use these every day).

uv --version
git --version
python -V

Checkpoint: you can run all three commands with no errors.

Quick review: uv daily workflow

Three commands you will use all week:

  1. uv venv → create .venv/
  2. uv pip install ... → install dependencies into the env
  3. uv run <command> → run using the env (no “wrong python” accidents)

Example:

uv run python -c "import pandas as pd; print(pd.__version__)"

You learned uv in Week 1. Today we’ll use it as our default toolchain.

This week’s project (high-level)

Project: Offline‑First ETL + EDA Mini Analytics Pipeline

You will ship:

  • ETL code that runs end‑to‑end (load → verify → clean → transform → write)
  • data/processed/*.parquet outputs that are safe to overwrite
  • an EDA notebook that reads only processed data
  • a short reports/summary.md (findings + caveats)

Choose your Week 2 dataset (today)

For the rest of Week 2, you’ll build a full ETL + EDA project on one real dataset.

You may use:

  • Kaggle (download + unzip a snapshot)
  • HuggingFace Datasets (load_dataset(...) → snapshot to disk)
  • Direct URL / API (download with httpx + cache)
  • A teammate-provided file (CSV/JSON/Parquet)

Dataset rubric (so exercises work smoothly):

  • 1 categorical column (e.g., country, status, category)
  • 1 numeric column (e.g., amount, quantity, rating)
  • 1 datetime/timestamp column (e.g., created_at)
  • ≥ 1,000 rows recommended (distributions + outliers are visible)
  • Bonus: a join key (e.g., user_id, product_id, store_id)

Note

In class we’ll use a tiny toy dataset (orders.csv + users.csv) to validate patterns quickly.

For your weekly project, pick a real dataset and repeat the same workflow.

Warning

Don’t commit raw data or API tokens to GitHub. Commit scripts + metadata so others can reproduce the download.

End-state (by end of today)

By the end of Day 1, your repo should contain:

  • a standard folder layout (data/, bootcamp_data/, scripts/, reports/)

  • bootcamp_data/config.py, bootcamp_data/io.py, bootcamp_data/transforms.py

  • a runnable entrypoint: python -m scripts.run_day1_load

  • at least one processed output in data/processed/ (Parquet preferred)

  • a reproducibility breadcrumb for your project dataset:

    • data/raw/_source_meta.json (where your data came from)
    • optionally: scripts/download_data.py (how to download it again)
  • at least one commit pushed to GitHub

Tool stack this week (minimal + high ROI)

  • pandas — load/clean/join/reshape/EDA (DataFrame)
  • pyarrow + Parquet — fast, typed processed files
  • httpx — extraction (but cached)
  • logging — evidence during runs (row counts, dtypes)
  • Plotly (Day 4) — one plotting library
  • DuckDB (Day 5) — local SQL on files
  • (Optional) datasets / kaggle — download catalog datasets (HuggingFace / Kaggle) and snapshot to data/raw/

Rule: avoid tool sprawl. Keep it small and shippable.

Canonical workflow (repeat every project)

  1. Load (raw/cache)
  2. Verify (schema, dtypes, keys, row counts)
  3. Clean (missingness, duplicates, normalization)
  4. Transform (joins, reshape, features)
  5. Analyze (tables + comparisons)
  6. Visualize (Plotly, export figures)
  7. Conclude (written summary + caveats)

Session 1

Offline-first mindset + project layout (no import pain)

Session 1 objectives

By the end of this session, you can:

  • define offline-first for data work
  • explain the purpose of raw/, cache/, processed/, external/
  • describe “raw immutable” and “processed idempotent”
  • run code using modules (so imports work without hacks)
  • centralize project paths using pathlib.Path

Context: why “offline-first” is worth it

You will re-run your ETL many times:

  • debugging
  • adding a cleaning rule
  • fixing a dtype bug
  • adding a feature column

If your pipeline depends on the internet, you lose time.

Concept: raw vs cache vs processed vs external

Offline-first projects separate data by role:

  • raw: original snapshots (never edited)
  • cache: downloads/API responses (safe to delete)
  • processed: clean, typed outputs (safe to recreate / overwrite)
  • external: reference drops (manual downloads you want to keep)

Example: folder roles (mental model)

data/raw/

  • immutable inputs
  • “source of truth”
  • never edited

data/cache/

  • API responses
  • intermediate downloads
  • safe to delete

data/processed/

  • clean + typed
  • analysis-ready
  • safe to overwrite

data/external/

  • manual reference
  • lookup tables
  • rarely changes

Micro-exercise: classify these files (5 minutes)

Put each file into the correct folder:

  1. orders.csv you received from a teammate
  2. users_api_page_1.json downloaded from an endpoint
  3. orders_clean.parquet generated by your ETL
  4. country_codes.xlsx you manually downloaded as reference

Checkpoint: you can justify your choices in 1 sentence each.

Solution: classification

  1. orders.csvdata/raw/
  2. users_api_page_1.jsondata/cache/
  3. orders_clean.parquetdata/processed/
  4. country_codes.xlsxdata/external/

Quick Check

Question: Why should data/raw/ be “immutable”?

Answer: if you edit raw data, you lose the ability to reproduce results and debug changes.

Concept: idempotent processed outputs

Idempotent: re-running the pipeline produces the same processed outputs (given same inputs + config).

Good pattern:

  • overwrite data/processed/orders.parquet every run

Bad pattern:

  • append to data/processed/orders.csv every run

Micro-exercise: spot the idempotency bug (6 minutes)

You run this twice:

df.to_csv("data/processed/orders.csv", mode="a", index=False, header=False)
  1. What goes wrong on run #2?
  2. Rewrite it to be idempotent.
  3. Bonus: write Parquet instead of CSV.

Checkpoint: your fixed code is safe to run 20 times.

Solution: idempotent write (example)

# safest default: overwrite
df.to_parquet("data/processed/orders.parquet", index=False)

Note

If you truly need incremental data, do it intentionally (partitioned folders, dedupe keys). Default for this bootcamp: overwrite processed outputs.

Quick Check

Question: What is the most common symptom of a non‑idempotent pipeline?

Answer: row counts grow every run even when inputs didn’t change.

Context: imports fail when you run code “the wrong way”

A very common Day 1 error:

ModuleNotFoundError: No module named 'bootcamp_data'

This usually happens when you run a file by path (Python changes where it searches).

Concept: our “no pain” run rule

Rule: run entrypoints as modules from the repo root.

  • python -m scripts.run_day1_load
  • python scripts/run_day1_load.py (easy to break imports)

Warning

Avoid “fixing” imports by editing runtime paths in your code. If imports fail: change how you run, or fix the package structure.

Micro-exercise: choose the right command (4 minutes)

Your repo looks like this:

week2-data-work/
  bootcamp_data/
  scripts/

You want to run the loader.

Which command is correct?

  1. python scripts/run_day1_load.py
  2. python -m scripts.run_day1_load

Checkpoint: you can explain why one works and the other often fails.

Solution: choose the right command

Correct: B) python -m scripts.run_day1_load

Reason: running as a module keeps the repo root on Python’s import search path.

Context: project layout prevents “path chaos”

If everyone uses different paths:

  • notebooks break
  • scripts break
  • teammates can’t run your repo

We fix this with a consistent layout + a central paths config.

Example: repo tree (Day 1 target)

week2-data-work/
  data/{raw,cache,processed,external}/
  notebooks/
  reports/figures/
  scripts/
    __init__.py
    run_day1_load.py
  bootcamp_data/
    __init__.py
    config.py
    io.py
    transforms.py

Quick Check

Question: What folder should notebooks read from?

Answer: data/processed/ (not data/raw/).

Concept: one source of truth for paths

Hardcoding strings like "../data/raw/orders.csv" breaks when:

  • you run from a different folder
  • someone uses Windows paths
  • files move

Use pathlib.Path + a central config.py.

Quick review: @dataclass (why we use it)

A dataclass is a lightweight way to make a “data container” class.

  • auto-generates __init__ for you (less boilerplate)
  • gives a readable repr (nice for debugging)
  • frozen=True makes it immutable (prevents accidental edits)

Today we use it to store project paths in one place.

Example: Paths + make_paths (pattern)

from dataclasses import dataclass
from pathlib import Path

@dataclass(frozen=True)
class Paths:
    root: Path
    raw: Path
    cache: Path
    processed: Path
    external: Path

def make_paths(root: Path) -> Paths:
    data = root / "data"
    return Paths(
        root=root,
        raw=data / "raw",
        cache=data / "cache",
        processed=data / "processed",
        external=data / "external",
    )

Micro-exercise: implement make_paths (6 minutes)

Create bootcamp_data/config.py with:

  1. a Paths dataclass
  2. a make_paths(root: Path) function
  3. raw/cache/processed/external paths under root/data/

Checkpoint: in a Python REPL, make_paths(Path.cwd()).processed prints a real folder path.

Solution: bootcamp_data/config.py (example)

from dataclasses import dataclass
from pathlib import Path

@dataclass(frozen=True)
class Paths:
    root: Path
    raw: Path
    cache: Path
    processed: Path
    external: Path

def make_paths(root: Path) -> Paths:
    data = root / "data"
    return Paths(
        root=root,
        raw=data / "raw",
        cache=data / "cache",
        processed=data / "processed",
        external=data / "external",
    )

Quick Check

Question: Why use Path objects instead of plain strings?

Answer: Path handles OS differences and safe joins (root / "data" / "raw").

Session 1 recap

  • Offline-first = your project runs without internet
  • Separate data by role: raw / cache / processed / external
  • Processed outputs should be idempotent (safe reruns)
  • Run entrypoints as modules: python -m ...
  • Centralize paths with pathlib.Path in config.py

Asr break

20 minutes

When you return: open your repo and locate data/raw, data/processed, bootcamp_data/, and scripts/.

Session 2

Data sources + caching patterns

Session 2 objectives

By the end of this session, you can:

  • identify common data sources (CSV/JSON/API) and catalog sources (Kaggle, HuggingFace)
  • download a dataset via Kaggle, HuggingFace, or a direct URL
  • write a raw snapshot to data/raw/ and record source metadata
  • implement an offline-first cache read/write pattern (reuse cache when present)
  • explain why CSV needs guardrails (dtype, separators, missing markers)

Data sources you can use (and how we treat them)

You’ll see data in a few common “shapes”:

Source Typical format Where it goes Offline-first rule
Kaggle dataset ZIP → CSV data/cache/ (zip) → data/raw/ (snapshot) pin a snapshot + record metadata
HuggingFace Datasets Arrow → pandas data/raw/ (snapshot) load once → save locally
Direct URL download CSV/JSON/ZIP data/cache/data/raw/ cache the bytes + record the URL
API extraction JSON pages data/cache/ never “depend on live calls”
Teammate drop CSV/Parquet data/raw/ treat as immutable input

We care about repeatability, not “where it came from”. Your repo should run even when Wi‑Fi is bad.

Path A — Kaggle (CLI) snapshot download

Kaggle is great for realistic tabular datasets, but it requires a one‑time API token setup.

One-time setup (per machine): 1. Create a Kaggle account (if you don’t have one) 2. In Kaggle → Account → create a new API token → download kaggle.json 3. Move it to:

  • macOS/Linux: ~/.kaggle/kaggle.json
  • Windows: C:\Users\<you>\.kaggle\kaggle.json
  1. (macOS/Linux) fix permissions:
chmod 600 ~/.kaggle/kaggle.json

Warning

Never commit kaggle.json to your repo. It contains credentials.

Kaggle download → cache → raw (example commands)

Install the CLI (if needed):

uv pip install kaggle

Download a dataset ZIP into data/cache/ and unzip a snapshot into data/raw/:

# from repo root
DATASET_ID="<kaggle-owner>/<kaggle-dataset-slug>"

mkdir -p data/cache/kaggle data/raw/kaggle
kaggle datasets download -d "$DATASET_ID" -p data/cache/kaggle --unzip
# move/copy the extracted files into a snapshot folder you will not edit

Different Kaggle datasets unzip into different filenames. Your job is to decide the snapshot folder name and record it in metadata.

Path B — HuggingFace Datasets (Python) snapshot

HuggingFace datasets load into memory (or stream), then you save a local snapshot.

Install:

uv pip install datasets

Minimal snapshot example (save as Parquet):

from pathlib import Path
from datasets import load_dataset

out_dir = Path("data/raw/hf_my_dataset")
out_dir.mkdir(parents=True, exist_ok=True)

ds = load_dataset("<dataset_name>", split="train")  # change name/split
df = ds.to_pandas()
df.to_parquet(out_dir / "train.parquet", index=False)

Note

Offline-first rule: once you have a local snapshot, your ETL reads from disk, not from HuggingFace.

Path C — Direct URL download (CSV/ZIP) snapshot

This is the most universal option: if you have a URL, you can cache it.

from pathlib import Path
import httpx

def download_bytes(url: str, out_path: Path) -> None:
    out_path.parent.mkdir(parents=True, exist_ok=True)
    with httpx.Client(timeout=60.0, follow_redirects=True) as client:
        r = client.get(url)
        r.raise_for_status()
        out_path.write_bytes(r.content)

download_bytes(
    url="<paste-a-csv-or-zip-url>",
    out_path=Path("data/cache/downloads/source_file.zip"),
)

Then unzip/copy into data/raw/<snapshot_name>/ and treat it as immutable.

Raw snapshot metadata (required)

Every raw snapshot should have a tiny metadata record so someone else can reproduce it.

Recommended location (simple): data/raw/_source_meta.json

Example:

{
  "dataset_name": "my_week2_project",
  "source": "kaggle | huggingface | url | teammate",
  "dataset_id_or_url": "…",
  "downloaded_at_utc": "2025-12-21T10:15:00Z",
  "raw_snapshot_folder": "data/raw/my_week2_project/",
  "files": ["train.parquet"],
  "notes": "any caveats (sampling, filters, auth)"
}

We’ll expand metadata later (row counts, schema summary, git commit). Today: just make sure the origin is documented.

Starter: scripts/download_data.py (project dataset)

Keep dataset acquisition repeatable by putting it in one script.

Minimum for Day 1: support one of (URL / HuggingFace / Kaggle).
Stretch: support 2–3 sources in the same file.

Tiny URL downloader (works everywhere):

from __future__ import annotations
from pathlib import Path
import json
from datetime import datetime, timezone
import httpx

def download(url: str, out_path: Path) -> None:
    out_path.parent.mkdir(parents=True, exist_ok=True)
    with httpx.Client(timeout=60.0, follow_redirects=True) as client:
        r = client.get(url)
        r.raise_for_status()
        out_path.write_bytes(r.content)

def write_meta(meta_path: Path, meta: dict) -> None:
    meta_path.parent.mkdir(parents=True, exist_ok=True)
    meta_path.write_text(json.dumps(meta, indent=2), encoding="utf-8")

def main() -> None:
    url = "<paste-url>"
    out_path = Path("data/cache/downloads/source_file.csv")
    download(url, out_path)

    write_meta(
        Path("data/raw/_source_meta.json"),
        {
            "dataset_name": "my_week2_project",
            "source": "url",
            "dataset_id_or_url": url,
            "downloaded_at_utc": datetime.now(timezone.utc).isoformat(),
            "raw_snapshot_folder": "data/raw/my_week2_project/",
            "files": [],
            "notes": "downloaded bytes cached; raw snapshot extracted separately"
        },
    )

if __name__ == "__main__":
    main()

Note

This script is your “repro button” — it doesn’t need to be fancy. It just needs to be honest and rerunnable.

Context: extraction is where “silent drift” starts

Two common failure modes:

  • upstream data changes → your results change
  • extraction fails partially → you analyze incomplete data

Your job: make extraction reproducible.

Concept: minimal extraction checklist

Before you trust extracted data:

  • did you get all pages (pagination)?
  • did you record params/time window?
  • did you store a snapshot (cache)?
  • did you validate row count / file size?

Example: minimal extraction metadata (JSON)

{
  "timestamp_utc": "2025-12-21T09:15:00Z",
  "source": "api",
  "endpoint": "/v1/users",
  "params": {"page": 1, "per_page": 100},
  "status_code": 200
}

Store this next to the cached file (same name + .meta.json).

Micro-exercise: design the metadata file (5 minutes)

You downloaded: data/cache/users_page_1.json

  1. What should the metadata filename be?
  2. List 4 keys you will store.

Checkpoint: your filename ends with .meta.json and your keys support reproducibility.

Solution: metadata naming + keys

  • filename: data/cache/users_page_1.meta.json

  • keys (example):

    • timestamp_utc
    • endpoint
    • params
    • status_code

Quick Check

Question: If you don’t store params, what breaks later?

Answer: you can’t reproduce which slice of data you extracted (numbers won’t match).

Context: caching keeps you productive

Internet is unreliable.

Caching means:

  • first run: download → save to data/cache/
  • next runs: read from cache (fast, offline)

Concept: offline-first caching policy

Operational rule:

  • if cache exists → use it
  • only re-download when you choose (manual delete or TTL refresh)

This prevents “live calls” from breaking your workflow.

Quick review: httpx in 60 seconds

  • httpx is a Python HTTP client (like requests)
  • you call an endpoint and get JSON back
  • you must cache the response for reproducibility

Today’s code is a pattern. We don’t depend on live internet during labs.

Example: fetch_json_cached (offline-first)

from pathlib import Path
import json
import time
import httpx

def fetch_json_cached(url: str, cache_path: Path, *, ttl_s: int | None = None) -> dict:
    """Offline-first JSON fetch with optional TTL."""
    cache_path.parent.mkdir(parents=True, exist_ok=True)

    # Offline-first default: if cache exists, use it (unless TTL says it's too old)
    if cache_path.exists():
        age_s = time.time() - cache_path.stat().st_mtime
        if ttl_s is None or age_s < ttl_s:
            return json.loads(cache_path.read_text(encoding="utf-8"))

    # Otherwise: fetch and overwrite cache
    with httpx.Client(timeout=20.0) as client:
        r = client.get(url)
        r.raise_for_status()
        data = r.json()

    cache_path.write_text(json.dumps(data, ensure_ascii=False, indent=2), encoding="utf-8")
    return data

Micro-exercise: predict caching behavior (4 minutes)

Assume cache_path already exists.

What happens?

  1. ttl_s=None
  2. ttl_s=3600 and cache age is 10 minutes
  3. ttl_s=3600 and cache age is 3 days

Checkpoint: you can answer all 3 without running code.

Solution: predict caching behavior

  1. ttl_s=Noneread from cache (offline-first default)
  2. ttl_s=3600 and age 10m → read from cache
  3. ttl_s=3600 and age 3d → re-download then overwrite cache

Quick Check

Question: When is ttl_s=None a good default?

Answer: when you want the pipeline to run offline and you refresh cache manually.

Quick review: pandas = “tables in Python”

In Week 1 you used csv.DictReader (rows = dicts).

In Week 2 we mostly use pandas (tables):

  • pd.read_csv(...)DataFrame
  • df.head() → preview
  • df.shape → row/col count
  • df.dtypes → types (important!)

Bridge (manual ↔︎ pandas):

import pandas as pd

# list[dict] → DataFrame
rows = [{"a": 1, "b": "x"}, {"a": 2, "b": "y"}]
df = pd.DataFrame(rows)

# DataFrame → list[dict] (records)
records = df.to_dict(orient="records")

Tip

pandas is powerful, but it can guess types wrong. That’s why we add guardrails.

Context: CSV is a “lowest common denominator”

CSV is common, but it is fragile:

  • encoding surprises
  • separators differ (; vs ,)
  • decimal separators differ (1,23 vs 1.23)
  • missing markers vary (NA, null, empty)

Concept: read_csv guardrails (high ROI options)

When reading CSV, consider:

  • dtype= (especially IDs)
  • na_values= (custom missing markers)
  • encoding=
  • sep=
  • decimal=

Micro-exercise: write a guarded read_csv (6 minutes)

Scenario:

  • separator is ;
  • decimals use comma: 12,50
  • IDs like 0007 must keep leading zeros

Write the pd.read_csv(...) call.

Checkpoint: your call includes sep, decimal, and dtype for IDs.

Solution: guarded pd.read_csv (example)

import pandas as pd

df = pd.read_csv(
    path,
    sep=";",
    decimal=",",
    dtype={"user_id": "string", "order_id": "string"},
    na_values=["", "NA", "null"],
)

Session 2 recap

  • Extraction must be reproducible (cache + metadata)
  • Offline-first caching: reuse cache when present
  • CSV needs guardrails (dtype, sep, decimal, na_values)
  • Next: schema discipline + Parquet outputs

Maghrib break

20 minutes

When you return: open your editor and be ready to write io.py and transforms.py.

Session 3

pandas I/O + schema basics

Session 3 objectives

By the end of this session, you can:

  • explain why pandas dtype inference can be dangerous
  • treat IDs as strings (avoid silent corruption)
  • write and read Parquet with pandas
  • centralize I/O in io.py
  • implement a minimal enforce_schema(df) transform

pandas muscle memory (3 moves)

When you load any dataset, do this first:

  • df.head() → sanity check values
  • df.shape → row/col count
  • df.dtypes → confirm types (especially IDs)

This takes 10 seconds and prevents 2 hours of confusion.

Context: pandas inference can silently corrupt meaning

If pandas guesses wrong, you might lose information without an error.

Classic example:

  • ID 00123 becomes 123 (leading zeros lost forever)

Concept: treat IDs as strings

Operational rules:

  • IDs are strings unless you truly compute on them

  • use pandas nullable dtypes when missing values exist:

    • "string", "Int64", "Float64", "boolean"

Example: the leading-zero bug

import pandas as pd

df = pd.read_csv("orders.csv")  # risky default
print(df.dtypes)

Risk

  • user_id becomes int64
  • 0007 becomes 7
  • joins fail later

Micro-exercise: choose dtypes (6 minutes)

You have orders.csv with columns:

  • order_id, user_id, amount, quantity, created_at, status

Write:

  1. a dtype={...} mapping for IDs
  2. a 1-sentence rule for quantity when missing values exist

Checkpoint: your mapping keeps leading zeros.

Solution: dtype mapping + rule

dtype = {
    "order_id": "string",
    "user_id": "string",
}

Rule: if quantity is “integer but can be missing”, use "Int64" after parsing.

Quick Check

Question: If quantity is an integer but has missing values, what dtype should you use?

Answer: "Int64" (nullable integer), not plain int64.

Concept: why Parquet for processed outputs

Parquet is a columnar file format that:

  • preserves dtypes (critical for processed data)
  • is smaller than CSV (compression)
  • loads faster for analytics workflows

Example: write + read Parquet

df.to_parquet("data/processed/orders.parquet", index=False)
df2 = pd.read_parquet("data/processed/orders.parquet")

Note

Parquet requires an engine. We use pyarrow this week.

Context: centralize I/O in io.py

If every notebook reads data differently:

  • missing values differ
  • dtypes differ
  • results differ

Centralized I/O makes team work consistent.

Concept: four core I/O helpers

In bootcamp_data/io.py:

  • read_orders_csv(path) -> DataFrame
  • read_users_csv(path) -> DataFrame
  • write_parquet(df, path) -> None
  • read_parquet(path) -> DataFrame

Example: io.py pattern

from pathlib import Path
import pandas as pd

NA = ["", "NA", "N/A", "null", "None"]

def read_orders_csv(path: Path) -> pd.DataFrame:
    return pd.read_csv(
        path,
        dtype={"order_id": "string", "user_id": "string"},
        na_values=NA,
        keep_default_na=True,
    )

def write_parquet(df: pd.DataFrame, path: Path) -> None:
    path.parent.mkdir(parents=True, exist_ok=True)
    df.to_parquet(path, index=False)

Micro-exercise: add one more guardrail (5 minutes)

In your read_orders_csv(...), add one extra guardrail:

Choose one:

  • sep=";"
  • encoding="utf-8"
  • decimal=","

Checkpoint: you can explain why your chosen guardrail matters.

Solution: example guardrail

def read_orders_csv(path: Path) -> pd.DataFrame:
    return pd.read_csv(
        path,
        dtype={"order_id": "string", "user_id": "string"},
        na_values=NA,
        keep_default_na=True,
        encoding="utf-8",
    )

Concept: schema enforcement is a correctness step

Even with dtype=..., you often need:

  • numeric parsing (amount, quantity)
  • normalization (status casing)
  • consistent missing values

We enforce types after loading.

Example: enforce_schema(df) (minimal)

import pandas as pd

def enforce_schema(df: pd.DataFrame) -> pd.DataFrame:
    return df.assign(
        order_id=df["order_id"].astype("string"),
        user_id=df["user_id"].astype("string"),
        amount=pd.to_numeric(df["amount"], errors="coerce").astype("Float64"),
        quantity=pd.to_numeric(df["quantity"], errors="coerce").astype("Int64"),
        status=df["status"].astype("string").str.strip().str.lower(),
    )

Micro-exercise: make parsing “non-crashy” (5 minutes)

Complete this safely:

amount = pd.to_numeric(df["amount"], errors=____).astype("Float64")
quantity = pd.to_numeric(df["quantity"], errors=____).astype("Int64")

Checkpoint: invalid values become missing (not crashes).

Solution: parsing with coercion

amount = pd.to_numeric(df["amount"], errors="coerce").astype("Float64")
quantity = pd.to_numeric(df["quantity"], errors="coerce").astype("Int64")

Quick Check

Question: What does errors="coerce" do?

Answer: invalid values become missing (NaN) instead of raising an exception.

Session 3 recap

  • pandas inference can silently break IDs → keep IDs as strings
  • Parquet preserves dtypes and loads fast → use it for processed data
  • centralize loading/writing in io.py
  • use enforce_schema as a small, testable correctness step

Tomorrow: “Verify” becomes code (fail fast)

Day 2 we’ll turn assumptions into checks, like:

  • required columns
  • non-empty datasets
  • unique keys (before joins)
  • missingness report (per column)
  • simple range checks (e.g., amount >= 0)

Why it matters: catching bad data early prevents join disasters and wasted debugging.

Isha break

20 minutes

When you return: we will scaffold the repo and write our first processed Parquet file.

Quick review: logging beats “mystery debugging”

In data pipelines, you want evidence during every run:

  • row counts
  • dtypes
  • output paths

Minimal pattern:

import logging
logging.basicConfig(level=logging.INFO, format="%(levelname)s %(name)s: %(message)s")
log = logging.getLogger(__name__)
log.info("rows=%s", len(df))

Tip

Start with logging. Use notebooks for exploration, not for pipelines.

Hands-on

Build: scaffold repo + typed I/O + first processed output

Hands-on success criteria (today)

By the end, you should have:

  • a repo with the standard folder layout
  • config.py, io.py, transforms.py inside bootcamp_data/
  • raw inputs in data/raw/ (toy dataset is fine for in-class)
  • data/raw/_source_meta.json documenting where your project dataset comes from
  • a runnable module: python -m scripts.run_day1_load
  • at least one processed Parquet created by your code (e.g., data/processed/orders.parquet)
  • at least one commit pushed to GitHub

Project layout (target)

week2-data-work/
  data/
    raw/            # immutable inputs
    cache/          # API responses (optional today)
    processed/      # your Parquet outputs
    external/       # reference drops (optional)
  reports/figures/
  scripts/
    __init__.py
    run_day1_load.py
  bootcamp_data/
    __init__.py
    config.py
    io.py
    transforms.py
  README.md
  requirements.txt

Vibe coding (safe version)

  1. Write the plan in 5 bullets (no code yet)
  2. Implement the smallest piece
  3. Run → break → read error → fix
  4. Commit
  5. Repeat

Warning

Do not ask GenAI to write your solution code. Ask it to explain concepts or errors.

Task 1 — Create folders + initialize git (10 minutes)

  • Create the repo folders (data/, bootcamp_data/, scripts/, reports/)
  • Initialize git
  • Create an empty README.md
  • Add __init__.py files (so folders can be imported as packages)

Checkpoint: git status shows your new files.

Solution — folders + git

macOS/Linux

mkdir -p data/{raw,cache,processed,external}
mkdir -p reports/figures scripts bootcamp_data notebooks
touch README.md bootcamp_data/__init__.py scripts/__init__.py
git init

Windows PowerShell

mkdir data, reports, scripts, bootcamp_data, notebooks
mkdir data\raw, data\cache, data\processed, data\external
mkdir reports\figures
ni README.md -ItemType File
ni bootcamp_data\__init__.py -ItemType File
ni scripts\__init__.py -ItemType File
git init

Task 2 — Create a uv environment + install deps (15 minutes)

  • Create a virtual environment with uv
  • Activate it
  • Install core deps: pandas, pyarrow, httpx
  • Optional (for dataset downloading today): datasets (HuggingFace), kaggle (Kaggle CLI)
  • Freeze requirements.txt

Checkpoint: python -c "import pandas; import pyarrow; import httpx" runs with no error.

Hint — common environment mistakes

Most Day 1 issues come from:

  • the wrong Python interpreter
  • forgetting to activate the environment
  • installing packages globally by accident

Quick checks:

  • which python (macOS/Linux)
  • Get-Command python (Windows PowerShell)

Solution — uv venv + deps

macOS/Linux

uv venv
source .venv/bin/activate

# core (required)
uv pip install pandas pyarrow httpx

# optional (only if you plan to use these sources today)
# uv pip install datasets kaggle

uv pip freeze > requirements.txt
python -c "import pandas; import pyarrow; import httpx; print('ok')"

Windows PowerShell

uv venv
.\.venv\Scripts\Activate.ps1

# core (required)
uv pip install pandas pyarrow httpx

# optional (only if you plan to use these sources today)
# uv pip install datasets kaggle

uv pip freeze > requirements.txt
python -c "import pandas; import pyarrow; import httpx; print('ok')"

Warning

If pyarrow install fails, ask the instructor. Fallback: you can write CSV today, but Parquet is strongly preferred for Week 2.

Task 3 — Acquire raw data (two tracks) (10–25 minutes)

Track A (in-class, recommended): create the tiny toy dataset below:

  • data/raw/orders.csv
  • data/raw/users.csv

This lets you validate your pipeline quickly.

Track B (weekly project): download a real dataset (Kaggle / HuggingFace / URL) into a snapshot folder like:

  • data/raw/<your_dataset_name>/

…and record provenance in:

  • data/raw/_source_meta.json

Checkpoint: you have some raw data to run through today’s loader, and you know where your project dataset will come from.

Track A solution — data/raw/orders.csv (toy dataset)

order_id,user_id,amount,quantity,created_at,status
A0001,0001,12.50,1,2025-12-01T10:05:00Z,Paid
A0002,0002,8.00,2,2025-12-01T11:10:00Z,paid
A0003,0003,not_a_number,1,2025-12-02T09:00:00Z,Refund
A0004,0001,25.00,,2025-12-03T14:30:00Z,PAID
A0005,0004,100.00,1,not_a_date,paid

Track A solution — data/raw/users.csv (toy dataset)

user_id,country,signup_date
0001,SA,2025-11-15
0002,SA,2025-11-20
0003,AE,2025-11-22
0004,SA,2025-11-25

Task 3.5 — Add data/raw/_source_meta.json (5 minutes)

Create a tiny provenance record for your raw snapshot.

Even if you’re using the toy dataset today, practice the habit — you’ll keep this file for your weekly project.

Example (edit values to match your situation):

{
  "dataset_name": "toy_orders_users",
  "source": "in_class_toy",
  "dataset_id_or_url": "n/a",
  "downloaded_at_utc": "2025-12-21T00:00:00Z",
  "raw_snapshot_folder": "data/raw/",
  "files": ["orders.csv", "users.csv"],
  "notes": "Toy dataset used for Week 2 Day 1 scaffolding."
}

Tomorrow we’ll generate additional metadata automatically (row counts, schema summary).

Task 4 — Implement config.py (12 minutes)

Create bootcamp_data/config.py:

  • Paths dataclass
  • make_paths(root: Path) -> Paths

Checkpoint: a tiny snippet prints a valid processed path.

Solution — bootcamp_data/config.py

from dataclasses import dataclass
from pathlib import Path

@dataclass(frozen=True)
class Paths:
    root: Path
    raw: Path
    cache: Path
    processed: Path
    external: Path

def make_paths(root: Path) -> Paths:
    data = root / "data"
    return Paths(
        root=root,
        raw=data / "raw",
        cache=data / "cache",
        processed=data / "processed",
        external=data / "external",
    )

Task 5 — Implement io.py (18 minutes)

Create bootcamp_data/io.py with:

  • read_orders_csv(path: Path) -> pd.DataFrame
  • read_users_csv(path: Path) -> pd.DataFrame
  • write_parquet(df, path)
  • read_parquet(path)

Checkpoint: you can import these functions without errors.

Solution — bootcamp_data/io.py

from pathlib import Path
import pandas as pd

NA = ["", "NA", "N/A", "null", "None"]

def read_orders_csv(path: Path) -> pd.DataFrame:
    return pd.read_csv(
        path,
        dtype={"order_id": "string", "user_id": "string"},
        na_values=NA,
        keep_default_na=True,
        encoding="utf-8",
    )

def read_users_csv(path: Path) -> pd.DataFrame:
    return pd.read_csv(
        path,
        dtype={"user_id": "string"},
        na_values=NA,
        keep_default_na=True,
        encoding="utf-8",
    )

def write_parquet(df: pd.DataFrame, path: Path) -> None:
    path.parent.mkdir(parents=True, exist_ok=True)
    df.to_parquet(path, index=False)

def read_parquet(path: Path) -> pd.DataFrame:
    return pd.read_parquet(path)

Task 6 — Add transforms.py with enforce_schema (15 minutes)

Create bootcamp_data/transforms.py:

  • implement enforce_schema(df) -> df
  • convert amount and quantity using pd.to_numeric(..., errors="coerce")
  • normalize status to lowercase

Checkpoint: enforce_schema returns Float64 / Int64 dtypes and cleaned status.

Solution — bootcamp_data/transforms.py (minimal)

import pandas as pd

def enforce_schema(df: pd.DataFrame) -> pd.DataFrame:
    return df.assign(
        order_id=df["order_id"].astype("string"),
        user_id=df["user_id"].astype("string"),
        amount=pd.to_numeric(df["amount"], errors="coerce").astype("Float64"),
        quantity=pd.to_numeric(df["quantity"], errors="coerce").astype("Int64"),
        status=df["status"].astype("string").str.strip().str.lower(),
    )

Task 7 — Write the Day 1 loader module (20 minutes)

Create scripts/run_day1_load.py:

  • compute ROOT (repo root)
  • build paths with make_paths(ROOT)
  • read raw CSVs
  • apply enforce_schema to orders
  • write Parquet outputs to data/processed/
  • print or log evidence (row counts + dtypes)

Checkpoint: python -m scripts.run_day1_load creates data/processed/orders.parquet.

Hint — compute the project root robustly

In scripts/run_day1_load.py:

from pathlib import Path
ROOT = Path(__file__).resolve().parents[1]

This works even if you run commands from different folders.

Solution — scripts/run_day1_load.py

from pathlib import Path
import logging

from bootcamp_data.config import make_paths
from bootcamp_data.io import read_orders_csv, read_users_csv, write_parquet
from bootcamp_data.transforms import enforce_schema

log = logging.getLogger(__name__)

def main() -> None:
    logging.basicConfig(level=logging.INFO, format="%(levelname)s %(name)s: %(message)s")

    ROOT = Path(__file__).resolve().parents[1]
    p = make_paths(ROOT)

    orders = enforce_schema(read_orders_csv(p.raw / "orders.csv"))
    users = read_users_csv(p.raw / "users.csv")

    log.info("Loaded rows: orders=%s users=%s", len(orders), len(users))
    log.info("Orders dtypes:\n%s", orders.dtypes)

    write_parquet(orders, p.processed / "orders.parquet")
    write_parquet(users, p.processed / "users.parquet")

    log.info("Wrote processed files to: %s", p.processed)

if __name__ == "__main__":
    main()

Tip

Run it as a module from the repo root:

  • python -m scripts.run_day1_load

Task 8 — Run + verify outputs (10 minutes)

Run the loader and verify:

  • processed files exist
  • IDs stayed as strings
  • invalid amount became missing

Checkpoint: you can load Parquet back into pandas and see expected dtypes.

Solution — run + verify

macOS/Linux

uv run python -m scripts.run_day1_load
uv run python -c "import pandas as pd; df=pd.read_parquet('data/processed/orders.parquet'); print(df.dtypes); print(df.head())"

Windows PowerShell

uv run python -m scripts.run_day1_load
uv run python -c "import pandas as pd; df=pd.read_parquet('data/processed/orders.parquet'); print(df.dtypes); print(df.head())"

Git checkpoint (5 minutes)

  • git status
  • commit with message: "w2d1: scaffold + typed io + first processed parquet"
  • push to GitHub

Checkpoint: you can see your commit online.

Solution — git commands

git add -A
git commit -m "w2d1: scaffold + typed io + first processed parquet"
git branch -M main
git remote add origin <YOUR_REPO_URL>
git push -u origin main

If you already have a remote, skip the git remote add step.

Debug playbook

When stuck:

  1. Read the full error (don’t guess)
  2. Identify: file + line number
  3. Print/log: ROOT, paths, row counts, df.dtypes
  4. Fix the smallest thing
  5. Re-run

Common Day 1 fixes:

  • If imports fail: run with python -m ... from repo root
  • If modules are missing: install deps in the activated uv environment
  • If Parquet write fails: confirm pyarrow installed

Stretch goals (optional)

If you finish early:

  • create scripts/download_data.py for your project dataset (Kaggle / HF / URL)
  • write data/processed/_run_meta.json with row counts + output paths
  • add a README.md section: “How to run Day 1”
  • add one assertion: assert str(orders["user_id"].dtype) == "string"

Exit Ticket

In 1–2 sentences:

Why do we keep IDs as strings and prefer Parquet for processed outputs?

What to do after class (Day 1 assignment)

Due: before Day 2 starts

Today’s work becomes the foundation of your Week 2 data project.

  1. Pick your Week 2 project dataset (Kaggle / HuggingFace / URL).

  2. Create / update data/raw/_source_meta.json with:

    • source
    • dataset_id_or_url
    • downloaded_at_utc
    • raw_snapshot_folder
    • files
  3. Make the download reproducible:

    • Preferred: add scripts/download_data.py (support at least one of: Kaggle, HF, URL)
    • Minimum: add a README section with exact commands you used
  4. Ensure python -m scripts.run_day1_load works from a fresh terminal.

  5. Confirm you can create at least one processed Parquet file in data/processed/.

  6. Push at least one commit to GitHub.

Deliverable: GitHub repo link + evidence of:

  • data/raw/_source_meta.json (contents)
  • a processed Parquet file in data/processed/
  • the commands/script used to acquire the project dataset

Tip

Commit early. Commit often. Future you will thank you.

Thank You!