AI Professionals Bootcamp | Week 1
2025-12-17
Goal: Build a GUI for your CSV Profiler that can load data, preview results, and export JSON + Markdown.
Bootcamp • SDAIA Academy
csv_profiler packagehttpx for loading CSVs from URLs + better error UXBy the end of today, you can:
httpx (stretch)profile_rows()render_markdown()outputs/ on disk (local run)PYTHONPATH)Your code is split into:
src/csv_profiler/ → your package (profiling + rendering logic)app.py → your Streamlit script (UI only)Because the package is inside src/, we run commands with PYTHONPATH=src so Python can import it.
macOS/Linux
Windows PowerShell
Tip
If your project does not have a src/ folder (flat layout), you can usually omit PYTHONPATH=src.
Run your Day 3 CLI to confirm your profiler still works.
macOS/Linux
Windows PowerShell
Checkpoint: you still get:
outputs/report.jsonoutputs/report.md--preview just prints a small preview so you know the CLI is reading the file correctly.
You already have:
src/csv_profiler/python -m csv_profiler.cli profile ...Today you add:
app.py Streamlit GUITomorrow you add:
git + GitHub submission (deadline tomorrow 11:59pm)Streamlit basics (how it thinks)
uvst.file_uploader, st.button, st.checkboxst.write, st.table, st.jsonStreamlit lets you build a web app with only Python.
You write:
app.py)You get:
http://localhost:8501)Every user interaction triggers a rerun:
Implications
st.session_state to remember resultsFrom your project folder:
Then run:
Tip
If PYTHONPATH=src feels annoying, keep a small run script later. For today, just use it.
Create app.py in the project root:
You should see a page with a title.
Question: If you edit app.py and save… what happens?
A. The page updates automatically
B. You must restart Streamlit
C. It updates only if you refresh the browser
Answer: Usually A (Streamlit hot-reloads), but sometimes you’ll refresh.
Most Streamlit apps use a sidebar for inputs:
Rule of thumb: Inputs in sidebar, results in main area.
When we parse a CSV, we’ll store it as:
rows → a listrows → a dict (one CSV row)Example:
Accessing a value from a dict (by key):
A list can be “sliced” to take a smaller piece:
Also, list items are accessed by index (starting from 0):
We’ll use slicing for previews, and sometimes rows[0] to look at the first row.
If you have Python objects:
Streamlit can still display them:
st.file_uploaderThis widget returns an uploaded file object (or None):
.decode(...))uploaded.getvalue() returns bytes (raw file data)csv.DictReader(...) expects text (a string)So we decode bytes → text:
"utf-8-sig" helps with CSVs exported from Excel.
We’ll:
StringIO)csv.DictReader to get dictionaries per rowCheckpoint: rows is a list[dict[str, str]].
If you get decoding errors:
"utf-8-sig" (common with Excel exports)Question: Did it rerun? How do you know?
Create app.py that:
Checkpoint: App runs and shows UI.
Run:
Add:
st.file_uploader(..., type=["csv"])Checkpoint: Uploading a CSV shows a preview.
import csv
from io import StringIO
import streamlit as st
st.set_page_config(page_title="CSV Profiler", layout="wide")
st.title("CSV Profiler")
uploaded = st.file_uploader("Upload a CSV", type=["csv"])
show_preview = st.checkbox("Show preview", value=True)
if uploaded is not None:
text = uploaded.getvalue().decode("utf-8-sig")
rows = list(csv.DictReader(StringIO(text)))
st.write("Filename:", uploaded.name)
st.write("Rows loaded:", len(rows))
if show_preview:
st.write(rows[:5])
else:
st.info("Upload a CSV to begin.")st.columns + metricst.columns(n) creates side-by-side containers you can write into.
We’ll use this to make summaries easier to read.
After loading rows:
n_rowsn_colsCheckpoint: numbers match what you expect.
st.sidebar is great for inputscsv + StringIO20 minutes
Connect Streamlit → your csv_profiler package
Bad (hard to test):
app.pyGood (reusable):
csv_profiler/ handles reading/profiling/renderingapp.py only handles inputs + displayToday we’ll reuse:
csv_profiler.profiling.profile_rowscsv_profiler.render.render_markdownBecause your package lives in src/, you must run with:
Windows PowerShell:
Warning
If you see ModuleNotFoundError: csv_profiler, it’s almost always PYTHONPATH.
Profiling can be expensive.
Pattern:
st.session_stateShow:
Streamlit pattern: use an expander to hide “too much detail”:
The with ...: block means: “put the UI elements inside this expander.”
profile_rows() (15 minutes)In app.py:
report in session stateCheckpoint: Find n_rows and n_cols in report.
from csv_profiler.profiling import profile_rows
if uploaded is not None:
text = uploaded.getvalue().decode("utf-8-sig")
rows = list(csv.DictReader(StringIO(text)))
if st.button("Generate report"):
st.session_state["report"] = profile_rows(rows)
report = st.session_state.get("report")
if report is not None:
st.write("Rows:", report["n_rows"])
st.write("Cols:", report["n_cols"])Import and use:
Show Markdown preview:
st.markdown(render_markdown(report))Checkpoint: You see headings + a columns table.
Tip
If it looks too long, put the preview in an expander.
Students often confuse “download” vs “save”.
For a local app, do both:
Download buttons:
When report exists:
json_text with json.dumps(..., indent=2, ensure_ascii=False)md_text with render_markdown(report)Checkpoint: you can download both files.
import json
from csv_profiler.render import render_markdown
if report is not None:
json_text = json.dumps(report, indent=2, ensure_ascii=False)
md_text = render_markdown(report)
l, r = st.columns(2)
l.download_button("Get JSON", data=json_text, file_name="report.json")
r.download_button("Get Markdown", data=md_text, file_name="report.md")Use pathlib.Path:
Then show success with st.success(...).
Question: Why is “save to disk” sometimes a bad idea in a deployed web app?
Answer: The server may be shared, ephemeral, or read-only. For this bootcamp, you run locally, so it’s fine.
st.session_state20 minutes
Optional: httpx + better error UX
httpx.get()st.error, st.warning, st.stop)Use cases:
Rule: Always validate the URL and handle failures.
Then in Python:
What this gives you:
Same trick as upload:
Checkpoint: rows is a list of dictionaries.
Use:
st.error("...") for blocking problemsst.warning("...") for non-blocking issuesst.stop() to stop the current run cleanlyExample:
try/except (catch failures)When we do a network request, it can fail:
We don’t want the whole app to crash, so we catch the error and show a friendly message:
e is the error object. str(e) turns it into a readable message.
In the sidebar:
httpx.get() to fetchCheckpoint: A valid URL produces a report.
import csv
from io import StringIO
import httpx
use_url = st.sidebar.checkbox("Load from URL", value=False)
url = ""
if use_url:
url = st.sidebar.text_input("CSV URL", placeholder="https://.../data.csv")
if use_url:
if url == "":
st.warning("Paste a URL to load a CSV.")
st.stop()
try:
r = httpx.get(url, timeout=10.0)
r.raise_for_status()
text = r.text
rows = list(csv.DictReader(StringIO(text)))
except Exception as e:
st.error("Failed to load URL: " + str(e))
st.stop()If profiling takes time, cache the result.
Then call cached_profile(rows).
The line starting with @ is a decorator: it changes how the function runs (here: remembers results). You don’t need to write your own decorators today.
Warning
Caching can hide bugs when you change code. Use it only after your logic works.
A. HTTP requests
B. Disk reads
C. Profiling computation
D. All of the above
Answer: D. At minimum: HTTP requests.
httpx.get(url, timeout=...) + raise_for_status() is a good baseline20 minutes
CSV Profiler — Part 4: Streamlit App
By the end of the lab, your project has:
app.py (Streamlit GUI)outputs/You should be able to demo in 60 seconds.
Your Streamlit app must:
uv (same environment as the CLI)Commands you should be able to run:
app.py scaffold (10 minutes)app.py should contain:
st.set_page_config(...)Checkpoint: App starts and looks clean.
Add:
uploaded = st.file_uploader(...)rows (a list of dictionaries)Checkpoint: preview shows reasonable values.
import csv
from io import StringIO
uploaded = st.file_uploader("Upload a CSV", type=["csv"])
show_preview = st.sidebar.checkbox("Show preview", value=True)
if uploaded is not None:
text = uploaded.getvalue().decode("utf-8-sig")
rows = list(csv.DictReader(StringIO(text)))
if show_preview:
st.subheader("Preview")
st.write(rows[:5])
else:
st.info("Upload a CSV to begin.")When rows exist:
profile_rows(rows)st.session_state["report"]Checkpoint: report summary displays rows/cols.
from csv_profiler.profiling import profile_rows
if rows is not None:
if len(rows) > 0:
if st.button("Generate report"):
st.session_state["report"] = profile_rows(rows)
report = st.session_state.get("report")
if report is not None:
cols = st.columns(2)
cols[0].metric("Rows", report["n_rows"])
cols[1].metric("Columns", report["n_cols"])Display:
report["columns"] in a readable formatrender_markdown(report) (prefer an expander)Checkpoint: Markdown contains the table.
Add exports:
outputs/UI suggestion:
report_name (default: report)Checkpoint: report.json & report.md in outputs/.
import json
from pathlib import Path
from csv_profiler.render import render_markdown
if report is not None:
report_name = st.sidebar.text_input("Report name", value="report")
json_file = report_name + ".json"
json_text = json.dumps(report, indent=2, ensure_ascii=False)
md_file = report_name + ".md"
md_text = render_markdown(report)
c1, c2 = st.columns(2)
c1.download_button("Download JSON", data=json_text, file_name=json_file)
c2.download_button("Download Markdown", data=md_text, file_name=md_file)
if st.button("Save to outputs/"):
out_dir = Path("outputs")
out_dir.mkdir(parents=True, exist_ok=True)
(out_dir / json_file).write_text(json_text, encoding="utf-8")
(out_dir / md_file).write_text(md_text, encoding="utf-8")
st.success("Saved outputs/" + json_file + " and outputs/" + md_file)Add friendly errors:
st.error(...)rows exists but the first row has no columns (no headers detected):
st.warning(...)Checkpoint: app never crashes with a Python traceback for these cases.
Run:
Verify:
Problem: ModuleNotFoundError: csv_profiler
Fix: run with PYTHONPATH=src (and be in the project root)
Problem: Streamlit command not found
Fix: uv pip install streamlit
Problem: Upload works, but profiling fails
Fix: print a sample row: st.write(rows[0]) to inspect keys/values
missing_pct and show top 5numberreport["timing_ms"] if present (or measure in app)You now have:
Tomorrow:
In 1–2 sentences:
What part of Streamlit felt most “different” from normal Python scripts?
Due: before Day 5 starts (Thu, 18 Dec 2025)
st.expander, st.tabs, or st.columnsDeliverable: updated project folder (ready to be committed + pushed tomorrow).