AI Professionals Bootcamp | Week 1
2025-12-15
Goal: Turn yesterday’s script into clean functions and produce a richer profiling report for any CSV.
Bootcamp • SDAIA Academy
pathlib + csv + jsonBy the end of today, you can:
*args, **kwargs, keyword-only)enumerate, zip, sorted, any, alljoin()pathlib.Path for safe pathscsv.DictReader and write JSON with json.dumps()name: str, -> float | None)lambda for tiny one-off helper functions (mostly with key=)Open your Day 1 project and run it.
Target command (Unix/macOS):
Target command (Windows PowerShell):
Checkpoint: outputs/report.json and outputs/report.md are created.
If you see ModuleNotFoundError: No module named 'csv_profiler', Python can’t find your project package.
Layout A (recommended this week): package folder next to main.py
Run:
Layout B (src layout): package folder inside src/
Run with PYTHONPATH=src:
Tip
We’ll formalize packaging later. For now, pick one layout and keep moving.
You already have (Day 1):
rows: a list of dictionaries (each row looks like {"age": "19", "name": "Aisha"})report.json and report.mdToday you will add:
number vs textmin, max, mean, uniqueYou don’t need to be a Python expert — but we will use these basics repeatedly:
values = [], values.append(x)row["age"] and safe access row.get("age", "")for loops: repeat work for each item in a listNone: means “no value” (we’ll use it for “couldn’t parse”)import csv, from pathlib import PathFunctions: your unit of thinking
A function lets you name a piece of logic so you can:
def creates the function(...)return sends a value back to the callerType hints tell readers (and tooling) what you expect.
name: str means “I expect a string”-> str describes the return valueTip
Hints are helpful when a function can return different things, like float | None (“either a float or missing”).
Add type hints (no body changes):
Definition
A function’s side effect is any change to the system’s state during its execution apart from its retun value
Pure-ish (good for profiling):
Side effect (still useful, but keep controlled):
Warning
Try to keep profiling logic mostly “return values”. Keep I/O (reading/writing files) in a small number of places.
Calls:
What is printed?
Answer: 11 then 15
*args: “many positional arguments”**kwargs: “many keyword arguments”Why care?
len, sum, min, max, all, anysorted, reversed, enumerate, ziprange, iter, nextenumerate: index + valueInstead of:
Use:
zip: walk multiple lists togethersorted: keep original list unchangedAlso useful with a key:
sorted(..., key=..., reverse=True): custom orderingSometimes you’re sorting pairs like ("value", count) and you want to sort by the count.
A “pair” here is a 2-item tuple (think “small fixed list”).
key=)Sometimes you need a “throwaway” helper for a single call.
lambda <inputs>: <expression>defsorted (3 minutes)Sort words by length, longest first:
Checkpoint: ['bootcamp', 'python', 'data', 'ai']
Tip
Your key= function gets one word at a time and returns the sort key.
These work:
But for readability, prefer:
strip() + casefold()CSV values arrive as text, and they often include extra spaces or inconsistent capitalization.
We’ll use this idea in is_missing(...).
Write a function:
Treat these as missing (case-insensitive):
"""na", "n/a""null", "none", "nan"Checkpoint: is_missing(" NA ") is True.
is_missingTip
Use casefold() (stronger than lower()) for case-insensitive checks.
try/except: safe parsing (you’ll use this a lot)CSV values arrive as strings. Converting to a number can fail (example: "abc").
try: runs firstValueError, it jumps to except:None to mean “couldn’t parse”Write a safe parser:
Checkpoint: try_float("3.14") → 3.14, try_float("abc") → None
try_floatFor a list of strings:
numbertextIf a column has values: ["1", "2", "3", "x"]
number or text?Answer: text (because "x" is not numeric)
You will see these patterns later, but you don’t need them for today’s assignment:
nonlocalyield)@something)global (usually avoid)Warning
If you don’t need a “fancy tool”, don’t use it yet. Write the simplest working code.
A function that calls itself needs:
nonlocal (state without globals)A nested function can “remember” variables from the outer function.
A decorator takes a function and returns a new function.
global (exists, but avoid it)Why avoid?
Bad:
Better:
Warning
Default value is evaluated and created only once, not on each function call.
Why mention it?
*args, **kwargs, keyword-only)enumerate, zip, sorted) to write less codeis_missing, try_float, infer_type20 minutes
Strings → Markdown reports
Your strategy:
lines: list[str] (a list of strings)text = "\n".join(lines) + "\n"Notes: - lines.append("...") adds one line - lines.extend(other_lines) adds many lines (when a helper returns a list of lines)
This avoids painful string concatenation.
Example output:
Rows: 1,234,567Missing: 3.5%:, → thousands separators.2f → 2 decimals.1% → percent with 1 decimal>10 / <10 → align widthExample:
Python ships with lots of useful modules (CSV, JSON, dates…).
Two common styles:
Create a function that returns a markdown header block:
Must include:
# CSV Profiling Reportdatetime.now().isoformat(timespec="seconds"))md_headerRules:
---: (optional but nice)Write a function to render the table header:
It should return:
md_table_headerTip
type is a built-in reserved keyword in Python. When you need to use it as a variable name, a common convention is to append an underscore (e.g., type_) to avoid conflicts while keeping the name meaningful.
Write:
md_bulletsWhen working with CSVs:
strip() to remove whitespacecasefold() for robust lowercasingreplace() to normalize textsplit() to break things apartjoin() to assemble outputWhat is the output?
Answer: na
lines + "\n".join(lines)md_header, md_table_header, md_col_row)20 minutes
Files + pathlib + formats
pathlib.Path instead of fragile string pathscsv.DictReadermkdir, encoding, newline)pathlib.Path is your defaultWhy?
Why do we use mkdir(parents=True, exist_ok=True)?
A. It deletes old folders
B. It creates folders if missing, and doesn’t crash if they exist
C. It makes the file smaller
Answer: B
with blocks: open files safelyWhen you open a file, you should close it. The with ... as ...: pattern closes it automatically (even if something errors).
DictReaderNotes:
newline="" is recommended for CSV filesWrite a helper to get columns names from rows:
Rules:
rows is empty → return []get_columnsKeep it simple and stable:
Tip
A clear schema makes your CLI and Streamlit app easy later.
base64: text encoding of bytes (useful for transport)pickle: Python-only serializationWarning
Never unpickle data from someone you don’t trust. Pickle can execute code during loading.
Path for paths + folder creationDictReaderjson.dumps(..., indent=2)20 minutes
Start the project: CSV Profiler (Part 2)
By the end of the day:
number vs textcount, missing, unique, min, max, meanoutputs/report.jsonoutputs/report.md (with a summary + a table + per-column details)If you already know Git:
If you don’t know Git yet: skip this. We’ll cover Git later this week.
In csv_profiler/profile.py (or src/csv_profiler/profile.py if you’re using the src layout), add:
is_missing(value)try_float(value)infer_type(values)Checkpoint: you can call these from a Python REPL and they behave correctly.
MISSING = {"", "na", "n/a", "null", "none", "nan"}
def is_missing(value: str | None) -> bool:
if value is None:
return True
return value.strip().casefold() in MISSING
def try_float(value: str) -> float | None:
try:
return float(value)
except ValueError:
return None
def infer_type(values: list[str]) -> str:
usable = [v for v in values if not is_missing(v)]
if not usable:
return "text"
for v in usable:
if try_float(v) is None:
return "text"
return "number"Add a helper:
Rules:
row.get(col, ""))column_valuesImplement:
Requirements:
count, missing, unique, min, max, meanusable = [v for v in values if not is_missing(v)]nums = [try_float(v) for v in usable]None → treat as text elsewhere (don’t call this function)count = len(nums)unique = len(set(nums))min(nums), max(nums), sum(nums)/countTip
set(nums) removes duplicates. So len(set(nums)) is “how many distinct values?”
numeric_stats (example)def numeric_stats(values: list[str]) -> dict:
usable = [v for v in values if not is_missing(v)]
missing = len(values) - len(usable)
nums: list[float] = []
for v in usable:
x = try_float(v)
if x is None:
raise ValueError(f"Non-numeric value found: `{v}`")
nums.append(x)
count = len(nums)
unique = len(set(nums))
return {
"count": count,
"missing": missing,
"unique": unique,
"min": min(nums) if nums else None,
"max": max(nums) if nums else None,
"mean": (sum(nums) / count) if count else None,
}Implement:
Requirements:
count, missing, uniquetop: top_k most common values with countsYou can implement counts with a dict:
Then sort by count descending:
text_stats (example)def text_stats(values: list[str], top_k: int = 5) -> dict:
usable = [v for v in values if not is_missing(v)]
missing = len(values) - len(usable)
counts: dict[str, int] = {}
for v in usable:
counts[v] = counts.get(v, 0) + 1
top_items = sorted(counts.items(), key=lambda kv: kv[1], reverse=True)[:top_k]
top = [{"value": v, "count": c} for v, c in top_items]
return {
"count": len(usable),
"missing": missing,
"unique": len(counts),
"top": top,
}basic_profile (20 minutes)Update basic_profile(rows) so it returns:
source (path optional for now)summary: rows, columnscolumns: a dict keyed by column name with:
type (number/text)numeric_stats or text_statsCheckpoint: report["columns"]["age"]["type"] == "number" (for your sample data).
basic_profile (example)def basic_profile(rows: list[dict[str, str]]) -> dict:
cols = get_columns(rows)
report = {
"summary": {
"rows": len(rows),
"columns": len(cols),
"column_names": cols,
},
"columns": {},
}
for col in cols:
values = column_values(rows, col)
type_ = infer_type(values)
if type_ == "number":
stats = numeric_stats(values)
else:
stats = text_stats(values)
report["columns"][col] = {"type": type_, **stats}
return reportTip
**stats is a way to unpack a dict and copy all its items in a new dict. The same thing can be done for lists or any iterable but with a single asterisk (e.g., ["Majid", *names, "Sami"]).
write_markdown (25 minutes)In csv_profiler/render.py (or src/csv_profiler/render.py if you’re using the src layout), update write_markdown(report, path) to include:
md_header(...))write_markdown (example, simple)from pathlib import Path
def write_markdown(report: dict, path: str | Path) -> None:
path = Path(path)
path.parent.mkdir(parents=True, exist_ok=True)
rows = report["summary"]["rows"]
lines: list[str] = []
lines.extend(md_header("data/sample.csv"))
lines.append("## Summary")
lines.append(f"- Rows: {rows:,}")
lines.append(f"- Columns: {report['summary']['columns']:,}")
lines.append("")
lines.append("## Columns (table)")
lines.extend(md_table_header())
for name, col in report["columns"].items():
missing_pct = (col["missing"] / rows) if rows else 0.0
lines.append(md_col_row(name, col["type"], col["missing"], missing_pct, col["unique"]))
lines.append("")
... # continue to the next slidewrite_markdown (example, simple) ... # write_markdown() continues here
lines.append("## Column details")
for name, col in report["columns"].items():
lines.append(f"### `{name}` ({col['type']})")
if col["type"] == "number":
lines.append(f"- min: {col['min']}")
lines.append(f"- max: {col['max']}")
lines.append(f"- mean: {col['mean']}")
else:
top = col.get("top", [])
if not top:
lines.append("- (no non-missing values)")
else:
lines.append("- top values:")
for item in top:
lines.append(f" - `{item['value']}`: {item['count']}")
lines.append("")
path.write_text("\n".join(lines) + "\n", encoding="utf-8")Tip
Keep it “simple but correct” today. We’ll polish the report formatting later.
Run:
csv_profiler/ next to main.py:If you have a src/ folder:
Checkpoint: - outputs/report.json has types and stats - outputs/report.md has a table and details
" " being treated as missing?Pick one:
median (hint: sort and pick middle)--input / --output argument using argparse (Typer comes tomorrow)In 1–2 sentences:
What made your profiler “better” today compared to Day 1?
Due: before Day 3 starts (Tue, 16 Dec 2025)
data/sample.csv (one numeric, one text):.2f where relevantDeliverable: a zip or folder with your updated csv-profiler/.