AI Pros Bootcamp – Module 2 - Session 2: Loss Functions and Optimizers

The Training Sequence

Three lines that make learning happen:

Measure - How wrong are we?
Diagnose - What caused the error?
Update - How do we fix it?

As you move through this professional certificate, you’ll see these three lines come up again and again and again during training. They’re simple to write, but a lot happens behind the scenes. PyTorch will take care of the complex math for you, but in this session, you’ll take a quick look under the hood to understand what each line really does and why the order matters.

These three lines will work together in this sequence: Measure, Diagnose, Update. First, you’ll measure how wrong your predictions are, and that’s your loss - this one number that sums up all of your mistakes. Then, backward diagnoses the problem. It examines how each weight in your model contributed to the error, which weights made things worse, and by how much. Finally, optimizer.step updates the weights. It uses those diagnostic scores to adjust each parameter. The weights with the bigger problems will get bigger corrections.

Step 1: Measuring Loss

Loss function: compares predictions to true answers

Higher number = more wrong

Goal: minimize loss

Measuring Error for Regression Tasks

For regression tasks (predicting numbers): temperature, price, distance

Average error = \(\frac{1}{n} \sum_{i=1}^n (\hat{y}_i - y_i)\)

Prediction (minutes)	Actual (minutes)	Error
6	4	\(6 - 4 = 2\)
3	5	\(3 - 5 = -2\)

Problem: Average error = 0 (mistakes cancel out!)

Back in the delivery company example, you used the model to predict delivery time based on distance. Now, let’s say your model makes two predictions. The first prediction: the model predicted 6 minutes, but the actual time was 4. For the second prediction, say it predicted 3 minutes, but the actual time was 5.

To figure out how far off you were, you subtract the target from the prediction, and that gives you the error. It’s like measuring the distance between your guess and reality. So how wrong was your model overall? Well, loss here is the average error across all of those predictions. Think of it like a report card for your model. The lower the score, the closer your predictions are to reality on average.

But if you average out the raw errors - in this case it was +2 and -2, and you divide that by two because there’s two of them - you get zero. It looks like perfect performance, even though both predictions were actually wrong.

Mean Squared Error Loss

Squaring:

Gets rid of minus signs (all mistakes count)
Makes bigger mistakes matter more

\[ \text{MSE} = \frac{1}{n} \sum_{i=1}^n (\hat{y}_i - y_i)^2 \]

Cross-Entropy Loss

For classification tasks (predicting categories): digit, animal, word

Model outputs: confidence scores (probabilities) for each class

All scores sum to 100%

\[ \text{BCE} = -\sum_{i=1}^n y_i \log(\hat{y}_i) \]

How Cross-Entropy Works

Punishes overconfident wrong answers

95% sure it’s a 7, but it’s actually a 3 → very high loss
55% sure it’s a 7, but it’s actually a 3 → smaller loss

Goal: Confident about right answers, unsure about wrong ones

Step 2: Diagnosing the Problem

Backward calculates gradients

Gradients = diagnostic scores for each parameter

Positive gradient → increasing weight makes loss worse
Negative gradient → increasing weight helps
Large gradient → big influence
Small gradient → barely mattered

In the last video, you saw how loss functions measure how right or how wrong your predictions are, and this is the first critical step in every training loop. Now let’s take a look at the next two steps.

After measuring how wrong your predictions are with the loss function, the next step is to figure out what caused that loss, and that’s what Backward does. Think back to how neural networks work. Each neuron takes inputs, multiplies each by a weight, adds them up with a bias, and then passes that through an activation function.

Even in a small neural network, you’re dealing potentially with thousands of weights and biases. So if we look at a simple example: this network has 784 inputs, 128 hidden neurons, and 10 output classes. And it has over 100,000 weights between inputs and the hidden layers. There are 128 biases for the hidden layer, and there are 1,280 weights between the hidden layer and the output. And of course there are those 10 biases for the final output layer. And that’s a total of 101,770 trainable parameters.

So Backward acts a little bit like a detective. It looks at every single weight and bias and asks: “How much did you contribute to the loss?” These diagnostic scores are called gradients. They tell you not just who contributed to the error, but how much and in what direction.

What Backward Does NOT Do

Backward does NOT update weights

It only calculates gradients

Updates happen later with optimizer.step()

The Gradient Descent Analogy

Plos: x-axis: weight, y-axis: loss

Goal: minimize loss (reach bottom of valley)

Gradient: tells you the slope (which way is downhill?)

Go downhill → lower loss

Zooming in on a single parameter taking a step against the gradient

Multiple parameters

If we have two parameters (\(\theta_0\) and \(\theta_1\)):

Stochastic Gradient Descent (SGD)

Strategy:

Negative gradient (\(\frac{\partial \text{loss}}{\partial w_0}\)) → increase weight (\(w_0\))
Positive gradient (\(\frac{\partial \text{loss}}{\partial w_1}\)) → decrease weight (\(w_1\))
Big gradient (\(\frac{\partial \text{loss}}{\partial w_0}\)) → big change (\(w_0\))
Small gradient (\(\frac{\partial \text{loss}}{\partial w_1}\)) → small change (\(w_1\))

Scales updates with learning rate: \(\text{step size} = \text{learning rate} \times \text{gradient}\)

Learning Rate Matters

Adam Optimizer

Adapts learning rate for each weight individually

Like having an assistant:

Knows which weights need big adjustments
Knows which need fine-tuning

Popular first choice: reliable, flexible, often faster

Optimizers’ loss curves

Why `zero_grad()`?

Every backward() call adds to existing gradients

Without zero_grad(): gradients accumulate incorrectly

Result: training breaks

Now that you understand gradients as diagnostic scores for each parameter, zero_grad begins to make a lot more sense. Every time you call backward, PyTorch adds new gradients to whatever is already there. So if you don’t call zero_grad, you’re not just diagnosing parameters for this batch - you’re actually accumulating the diagnoses from every batch. And the gradients will keep accumulating incorrectly until your training breaks.

So that’s why you call optimizer.zero_grad() at the start of every training loop. Now you might wonder: well, why does PyTorch accumulate the gradients in the first place? It turns out that this behavior is really useful for advanced use cases like gradient accumulation or certain custom training schedules. But for most projects, including everything you’ll do in this course, you’re going to want to clear those gradients every time.

Complete Training Loop

for batch in dataloader:
    optimizer.zero_grad()      # Clear gradients
    outputs = model(inputs)     # Forward pass
    loss = loss_fn(outputs, targets)  # Measure
    loss.backward()            # Diagnose
    optimizer.step()           # Update

Measure → Diagnose → Update

What’s Next?

In Session 3: Device Management and Image Classification Setup, you learn:

Running on GPUs vs CPUs
Moving models and data to devices
Setting up MNIST data pipeline
Building your first image classifier architecture