AI Professionals Bootcamp | Week 1
2025-12-16
Goal: Turn your profiler into a clean Python package and expose it as a real CLI.
Bootcamp • SDAIA Academy
By the end of today, you can:
sys.path and PYTHONPATHos, sys, time, shutil--helpRun your Day 2 project (whatever layout you have right now).
If you already have a src/ folder (src-layout):
If you do not have a src/ folder yet (flat-layout):
Run your Day 2 project (whatever layout you have right now).
Windows PowerShell (src-layout):
Windows PowerShell (flat-layout):
Checkpoint: outputs/report.json and outputs/report.md are updated.
You already have:
number vs text)Today you will add:
src/csv_profiler/...)Modules + packages + built-in modules
.py file
profiling.pycsv_profiler/ with __init__.pyWhy packages?
import something?Python searches for something in this order:
sys.path)Debug tool:
Create debug_paths.py:
Run:
Question: Do you see your project root? Do you see .../.venv/...?
PYTHONPATH (adds folders to Python’s import search).Tip
You don’t need to memorize many environment variables. Today we only care about PYTHONPATH.
PYTHONPATH: add folders to import searchIf your code lives in src/, add it to the path:
Tip
This is the simplest way to use a src/ layout before we finalize packaging.
Good defaults
import csvimport jsonfrom pathlib import PathWhy?
Also okay (be explicit)
import numpy as np (common convention)import utilities.arithmetic.units as convertAvoid:
from module import * (hides names)__name__ == "__main__": run vs importA file can be:
Pattern:
-mInstead of:
Prefer:
Why?
io.py
list[dict[str, str]])profiling.py
render.py
cli.py
System & OS
os → environment variables, current directorysys → argv, stdin/out, import pathtime → measure runtime, timestampsshutil → file operations + check if tools existTip
These are “glue” modules that make your Python code behave like a real tool.
os: environment + current folderUse cases:
OUTPUT_DIRsys: argv + exit codestime: measure how long profiling takesWhy?
shutil: find tools + move filesUseful later:
git is installed before Day 5 tasksCreate src/csv_profiler/strings.py:
Then import it in main.py (or another file):
Checkpoint: prints my-report-01
slugify__init__.py in a package foldercsv.py or json.py (shadows built-ins!)Warning
Never name your file the same as a standard library module. Example: don’t create time.py.
.py file; a package is a folder of modulessys.path → debug it!PYTHONPATH=src (for now) to support src/ layoutos, sys, time, shutil20 minutes
OOP essentials (only what you need)
Use classes when you want:
Don’t force OOP when:
class Person: ...)p = Person(...))A class can contain:
name, age)greet())Key idea:
__init__ runs when you create the objectself is the object being created/usedTip
A method is just a function that lives inside a class. It always receives self as the first parameter.
__repr__If you print an object without __repr__, you usually see something like:
<__main__.Person object at 0x...>
Add this:
Now:
Tip
To get the output of repr as a str value for any object, you can use the builtin function repr(). In addition, you can also print the output in f-strings if you use the !r format specifier as in the example above.
Sometimes you want an attribute that is computed from other data.
We want: “age must be between 0 and 200”.
class Person:
def __init__(self, name: str, age: int) -> None:
self.name = name
self.age = age # calls the setter
@property
def age(self) -> int:
return self._age
@age.setter
def age(self, value: int) -> None:
if value < 0 or value > 200:
raise ValueError("age must be between 0 and 200")
self._age = valueTip
We store the real value in _age. By convention, a leading _ means “internal use”.
Person class (6 minutes)Checkpoint: you get a clear error (ValueError).
class Employee(Person):
def __init__(self, name: str, age: int, salary: float) -> None:
super().__init__(name, age)
self.salary = salary
class Student(Person):
def __init__(self, name: str, age: int, grades: list[float]) -> None:
super().__init__(name, age)
self.grades = grades
@property
def average(self) -> float:
if not self.grades:
return 0.0
return sum(self.grades) / len(self.grades)Why careful?
Key idea:
.count()?”)Option A (fine): keep using dicts
Option B (cleaner): use a small class
Today: we’ll implement one small class to practice.
ColumnProfile (10 minutes)Create src/csv_profiler/models.py:
Checkpoint: missing_pct returns a number between 0 and 100.
ColumnProfileclass ColumnProfile:
def __init__(self, name: str, inferred_type: str, total: int, missing: int, unique: int):
self.name = name
self.inferred_type = inferred_type
self.total = total
self.missing = missing
self.unique = unique
@property
def missing_pct(self) -> float:
return 0.0 if self.total == 0 else 100.0 * self.missing / self.total
def to_dict(self) -> dict[str, str | int | float]:
return {
"name": self.name,
"type": self.inferred_type,
"total": self.total,
"missing": self.missing,
"missing_pct": self.missing_pct,
"unique": self.unique,
}
def __repr__(self) -> str:
return (
f"ColumnProfile(name={self.name!r}, type={self.inferred_type!r}, "
f"missing={self.missing}, total={self.total}, unique={self.unique})"
)Instead of building a dict per column:
When exporting JSON:
first_name) or validate updates (age)20 minutes
Typer CLI from type hints
profile command for your projectA CLI makes your project:
Inside your project environment:
Quick check:
name: str is a type hint (also called an annotation).Example:
Today we’ll mostly use: str, int, float, and Path.
@something mean?@ is a decorator.Two decorators you’ll see today:
@property (makes a method act like an attribute)@app.command() (registers a function as a CLI command)Run:
profile, validate, version)profile data/sample.csv--out-dir outputsTip
In Typer, Python type hints become CLI parsing.
Path objects for file pathsInstead of passing file paths as plain strings, we often use Path objects.
Why use Path?
\ vs /).exists(), .mkdir(), .read_text(), .write_text()pathlib.Path for file pathsInside the function:
--help descriptionsWhy?
What should your CLI do if the input file doesn’t exist?
Preferred: C
Run:
profile command (10 minutes)Create src/csv_profiler/cli.py with:
app = typer.Typer()profile command:
input_path--out-dir--report-name (default report)Checkpoint: --help shows your options.
from pathlib import Path
import typer
app = typer.Typer()
@app.command(help="Profile a CSV file and write JSON + Markdown")
def profile(
input_path: Path = typer.Argument(..., help="Input CSV file"),
out_dir: Path = typer.Option(Path("outputs"), "--out-dir", help="Output folder"),
report_name: str = typer.Option("report", "--report-name", help="Base name for outputs"),
):
# implementation comes in hands-on
typer.echo(f"Input: {input_path}")
typer.echo(f"Out: {out_dir}")
typer.echo(f"Name: {report_name}")
if __name__ == "__main__":
app()-m)From your project root:
Try:
--help20 minutes
CSV Profiler — Part 3 (Package + CLI)
Goal: Run one command that generates:
outputs/<name>.jsonoutputs/<name>.mdYou need:
Deliverable: CLI works on data/sample.csv.
By the end, you can run:
And you get:
outputs/report.jsonoutputs/report.mdCreate empty modules:
io.pyprofiling.pyrender.pycli.pyCheckpoint: the folder tree matches the target structure.
Tip
Windows users: if you don’t have touch, create files from VS Code.
csv.DictReader (2 minutes)csv.DictReader reads a CSV file and gives you one dictionary per row.Example (prints the first row dict):
io.py (15 minutes)Create src/csv_profiler/io.py:
read_csv_rows(path: Path) -> list[dict[str, str]]csv.DictReaderCheckpoint: you can import and call it from a scratch script.
read_csv_rowsimport csv
from pathlib import Path
def read_csv_rows(path: Path) -> list[dict[str, str]]:
"""Read a CSV file and return a list of row dictionaries."""
if not path.exists():
raise FileNotFoundError(f"CSV not found: {path}")
with path.open("r", encoding="utf-8") as f:
reader = csv.DictReader(f)
rows = list(reader)
if not rows:
raise ValueError("CSV has no data rows")
return rowsprofiling.py (25 minutes)In src/csv_profiler/profiling.py:
is_missing, try_float, infer_typeprofile_rows(rows: list[dict[str, str]]) -> dictReport keys (minimum):
n_rowsn_colscolumns (list)Checkpoint: profile_rows(rows) returns a JSON-serializable dict.
def is_missing(value: str | None) -> bool:
if value is None:
return True
cleaned = value.strip().casefold()
return cleaned in {"", "na", "n/a", "null", "none", "nan"}
def try_float(value: str) -> float | None:
try:
return float(value)
except ValueError:
return None
def infer_type(values: list[str]) -> str:
usable = [v for v in values if not is_missing(v)]
if not usable:
return "text"
for v in usable:
if try_float(v) is None:
return "text"
return "number"set() for unique valuesA set keeps only unique items (duplicates are removed).
We’ll use len(set(...)) to count unique non-missing values in a column.
profile_rows (baseline)def profile_rows(rows: list[dict[str, str]]) -> dict:
n_rows, columns = len(rows), list(rows[0].keys())
col_profiles = []
for col in columns:
values = [r.get(col, "") for r in rows]
usable = [v for v in values if not is_missing(v)]
missing = len(values) - len(usable)
inferred = infer_type(values)
unique = len(set(usable))
profile = {
"name": col,
"type": inferred,
"missing": missing,
"missing_pct": 100.0 * missing / n_rows if n_rows else 0.0,
"unique": unique,
}
if inferred == "number":
nums = [try_float(v) for v in usable]
nums = [x for x in nums if x is not None]
if nums:
profile.update({"min": min(nums), "max": max(nums), "mean": sum(nums) / len(nums)})
col_profiles.append(profile)
return {"n_rows": n_rows, "n_cols": len(columns), "columns": col_profiles}render.py (20 minutes)Create src/csv_profiler/render.py:
render_markdown(report: dict) -> strCheckpoint: render_markdown(report) returns a multi-line Markdown string.
render_markdown (simple)from datetime import datetime
def render_markdown(report: dict) -> str:
lines: list[str] = []
lines.append(f"# CSV Profiling Report\n")
lines.append(f"Generated: {datetime.now().isoformat(timespec='seconds')}\n")
lines.append("## Summary\n")
lines.append(f"- Rows: **{report['n_rows']}**")
lines.append(f"- Columns: **{report['n_cols']}**\n")
lines.append("## Columns\n")
lines.append("| name | type | missing | missing_pct | unique |")
lines.append("|---|---:|---:|---:|---:|")
lines.extend([
f"| {c['name']} | {c['type']} | {c['missing']} | {c['missing_pct']:.1f}% | {c['unique']} |"
for c in report["columns"]
])
lines.append("\n## Notes\n")
lines.append("- Missing values are: `''`, `na`, `n/a`, `null`, `none`, `nan` (case-insensitive)")
return "\n".join(lines)cli.py (30 minutes)In src/csv_profiler/cli.py:
profile commandread_csv_rows()profile_rows()render_markdown()out_dir:
<report_name>.json<report_name>.mdCheckpoint: running the command creates both files.
cli.py (working version)import json
import time
import typer
from pathlib import Path
from csv_profiler.io import read_csv_rows
from csv_profiler.profiling import profile_rows
from csv_profiler.render import render_markdown
app = typer.Typer()
@app.command(help="Profile a CSV file and write JSON + Markdown")
def profile(
input_path: Path = typer.Argument(..., help="Input CSV file"),
out_dir: Path = typer.Option(Path("outputs"), "--out-dir", help="Output folder"),
report_name: str = typer.Option("report", "--report-name", help="Base name for outputs"),
preview: bool = typer.Option(False, "--preview", help="Print a short summary"),
):
... # (see next slide for this implementation)
if __name__ == "__main__":
app()cli.py (working version)try:
t0 = time.perf_counter_ns()
rows = read_csv_rows(input_path)
report = profile_rows(rows)
t1 = time.perf_counter_ns()
report["timing_ms"] = (t1 - t0) / 1_000_000
out_dir.mkdir(parents=True, exist_ok=True)
json_path = out_dir / f"{report_name}.json"
json_path.write_text(json.dumps(report, indent=2, ensure_ascii=False), encoding="utf-8")
typer.secho(f"Wrote {json_path}", fg=typer.colors.GREEN)
md_path = out_dir / f"{report_name}.md"
md_path.write_text(render_markdown(report), encoding="utf-8")
typer.secho(f"Wrote {md_path}", fg=typer.colors.GREEN)
if preview:
typer.echo(f"Rows: {report['n_rows']} | Cols: {report['n_cols']} | {report['timing_ms']:.2f}ms")
except Exception as e:
typer.secho(f"Error: {e}", fg=typer.colors.RED)
raise typer.Exit(code=1)Run:
Then open:
outputs/report.jsonoutputs/report.mdCheckpoint: timing_ms exists in JSON and Markdown table lists all columns
If you see ModuleNotFoundError: csv_profiler:
PYTHONPATH=srcsrc/csv_profiler/__init__.py existsIf you see encoding errors:
encoding="utf-8-sig" for reading--out-dir default to a new folder per run:
outputs/2025-12-16_1930/--fail-on-missing-pct 30 option:
version commandYou now have:
Tomorrow: Streamlit GUI will reuse the same library.
In 1–2 sentences:
What caused your biggest slowdown today: imports, refactoring, or CLI wiring?
Due: before Day 4 starts (Wed, 17 Dec 2025)
--help look professional:
--delimiter (even if you keep , as default)Deliverable: updated project folder with working CLI.
Tip
Keep your changes small and commit-worthy. Even before Day 5, practicing commits helps.