Classification: Logistic Regression

import pandas as pd
import numpy as np

We have seen the x-axis for the independent variable, and y-axis for the dependent, target variable for regression.

In classification problems, we usually start with:

Let’s make up some data (X, y):

from sklearn.datasets import make_blobs

centers = [
    [5, -5],
    [5, 5]
]

X, y = make_blobs(
    n_samples=100,
    centers=centers,
    random_state=40
)
import matplotlib.pyplot as plt

fig, ax = plt.subplots(figsize=(6, 4))

scatter = ax.scatter(
    x=X[:, 0],
    y=X[:, 1],
    c=y, # color
    edgecolor="black"
)

ax.set(
    title="Synthetic Dataset",
    xlabel="$X_1$",
    ylabel="$X_2$"
)
_ = ax.legend(*scatter.legend_elements(), title="Classes")

Decision Boundary

We need to draw a Line that splits the two classes. We can do that with the following formula:

\[ h(x) = w_1 x_1 + w_2 x_2 + b \]

With this in mind, the graph will have both axes for the independent variables (\(x_1\) and \(x_2\)) without showing the target \(y\).

Changing the parameters has the following effect:

  • Rotation: \(w_1\) and \(w_2\)
  • Shifting: \(b\)

Decision Boundary

Decision Boundary

Fitting the parameters: \(w_1, w_2, b\) sets the Decision Boundary between two classes: \(y=0, y=1\).

from sklearn.linear_model import LogisticRegression

model = LogisticRegression()
model.fit(X, y)
LogisticRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Let’s look at the learned parameters:

print("w1: ", model.coef_[0][0].round(2))
print("w2: ", model.coef_[0][1].round(2))
print("b : ", model.intercept_[0].round(2))
w1:  -0.04
w2:  1.28
b :  0.34

Now let’s plot the decision boundary:

from sklearn.inspection import DecisionBoundaryDisplay

fig, ax = plt.subplots(figsize=(6, 4))

DecisionBoundaryDisplay.from_estimator(
    estimator=model,
    X=X,
    ax=ax,
    response_method="predict_proba",
    alpha=0.5,
    xlabel="$X_1$",
    ylabel="$X_2$"
)

scatter = ax.scatter(
    x=X[:, 0],
    y=X[:, 1],
    c=y,
    edgecolor="k"
)

Notice the gradual shift in color. That’s because LogisticRegression models a smooth probability, where the middle is a 0.50 (equal) probability of both classes, rather than a hard threshold.

The Sigmoid Function

This is accomplished using a logistic function (also known as sigmoid function), which looks like this:

\[ \sigma(z) = \frac{1}{1 + e^{-z}} \]

import numpy as np
import matplotlib.pyplot as plt


def sigmoid(z: np.ndarray) -> np.ndarray:
    return 1 / (1 + np.exp(-z))


z = np.linspace(-6, 6, 400)
sigma = sigmoid(z)

fig, ax = plt.subplots(figsize=(6, 4))
ax.plot(z, sigma, label=r"$\sigma(z)$")
ax.axhline(0.5, color="gray", linestyle="--", linewidth=1)
ax.axvline(0.0, color="gray", linestyle="--", linewidth=1)
ax.set(
    title="Logistic (Sigmoid) Function",
    xlabel=r"$z$",
    ylabel=r"$\sigma(z)$",
)
ax.legend()
ax.grid(True)
plt.show()

Demo: Sigmoid

Play with the Desmos demo and notice how:

  • Adjusting the intercept \(w_0\) shifts the curve left or right
  • Updating \(w_1\) affects the slope of the S-curve.
    • Higher absolute values make the curve steeper
    • While lower values make it more gradual

Decision Threshold

The suffix “Regression” often causes confusion, but it is technically accurate. Logistic regression is fundamentally a regression algorithm because it estimates a continuous numerical value: a probability between 0 and 1.

\[ 0 \le p \le 1 \]

It is also known in the literature as:

  • logit regression
  • maximum-entropy classification (MaxEnt)
  • log-linear classifier

It only becomes a classifier when we apply a decision threshold to that probability. For example, a standard rule is:

  • if p >= 0.5: y = 1
  • if p < 0.5: y = 0

Sigmoid Function with Decision Bound

Sigmoid Function with Decision Bound

We simply plug our decision boundary equation. Read: “the probability that \(y=1\) given \(x\) (the data) is equal to”:

\[ p(y=1|x) = \frac{1}{1 + e^{-(h(x))}} = \frac{1}{1 + e^{-(w_1 x_1 + w_2 x_2 + b)}} \]

We can look at it in 3D, where height corresponds the probability of y being 1 given x1 and x2:

# 3D Plot: decision boundary as probability surface using meshgrid
import numpy as np

# Create meshgrid over the feature space (same range as the 2D plot)
x1_min, x1_max = X[:, 0].min() - 0.5, X[:, 0].max() + 0.5
x2_min, x2_max = X[:, 1].min() - 0.5, X[:, 1].max() + 0.5
xx1, xx2 = np.meshgrid(np.linspace(x1_min, x1_max, 80), np.linspace(x2_min, x2_max, 80))

# Stack grid points and get P(y=1) from the logistic regression model
X_grid = np.c_[xx1.ravel(), xx2.ravel()]
Z = model.predict_proba(X_grid)[:, 1]  # probability of class 1
Z = Z.reshape(xx1.shape)

# 3D figure: x1, x2 on the base, probability on the z-axis
fig = plt.figure(figsize=(8, 6))
ax = fig.add_subplot(111, projection="3d")
ax.plot_surface(xx1, xx2, Z, alpha=0.6, cmap="RdYlBu_r")
ax.scatter(X[:, 0], X[:, 1], y, c=y, edgecolor="k", s=50)
ax.set(xlabel="$X_1$", ylabel="$X_2$", zlabel="$P(y=1|x)$", title="Logistic regression decision boundary (3D)")
plt.tight_layout()
plt.show()

How It Learns: Log Loss

Unlike standard linear regression, which minimizes the Mean Squared Error (the distance between predicted and actual points), logistic regression optimizes a cost function called Log Loss (or Binary Cross-Entropy).

Binary Cross Entropy

\[BCE = - \frac{1}{m} \sum_{i=1}^m \left[ y_i \log(\hat{y}_i) + (1 - y_i) \log(1 - \hat{y}_i) \right]\]

Where

  • \(BCE\): The total log loss calculated across the entire dataset.
  • \(m\): The total number of observations.
  • \(y_i\): The true, discrete binary label for the \(i\)-th observation (strictly 0 or 1).
  • \(\hat{y}_i\): The model’s predicted probability that the \(i\)-th observation belongs to class 1 (a continuous value bounded strictly between 0 and 1).
  • \(\log\): The natural logarithm (base \(e\)).

BCE Loss Curve

BCE Loss Curve

Categorical Cross-Entropy

When classifying more than two mutually exclusive categories, the binary log loss generalizes into categorical cross-entropy.

\[ CCE = -\frac{1}{m} \sum_{i=1}^m \sum_{k=1}^K y_{i,k} \log(\hat{y}_{i,k}) \]

Where:

  • \(K\): The total number of discrete classes.
  • \(y_{i,k}\): A binary indicator (strictly 0 or 1) denoting whether class \(k\) is the correct ground-truth classification for observation \(i\).
  • \(\hat{y}_{i,k}\): The model’s predicted probability that observation \(i\) belongs to class \(k\).

Beyond Binary Classification

To extend the algorithm beyond binary classification and into multi-class classification, there are two approaches:

  1. Multinomial: Learns all classes jointly; using the CCE loss (recommended in practice)
  2. One vs Rest: One binary classifier per class; using multiple BCE losses (used here to demonstrate the sigmoid curve in 3D)

We demonstrate the difference by going through the following steps:

  1. Generate a 3-class dataset
  2. Train multinomial and one-vs-rest logistic regression
  3. Plot decision boundaries (regions from predict)
  4. Plot lines (where p(class) = 0.5)

1. Dataset

Synthetic 3-class data with sklearn.datasets.make_blobs:

import numpy as np
from sklearn.datasets import make_blobs

centers = [
    [-5, 0],
    [0, 1.5],
    [5, -1]
]

X, y = make_blobs(
    n_samples=1_000,
    centers=centers,
    random_state=40
)

transformation = [
    [0.4, 0.2],
    [-0.4, 1.2]
]
X = np.dot(X, transformation)
import matplotlib.pyplot as plt

fig, ax = plt.subplots(figsize=(6, 4))

scatter = ax.scatter(
    X[:, 0], # x-axis
    X[:, 1], # y-axis
    c=y,     # color by class
    edgecolor="black"
)
ax.set(
    title="Synthetic Dataset",
    xlabel="Feature 1",
    ylabel="Feature 2"
)
_ = ax.legend(*scatter.legend_elements(), title="Classes")

2. Train both classifiers

Multinomial and one-vs-rest logistic regression on the same data.

from sklearn.linear_model import LogisticRegression

logistic_regression_multinomial = LogisticRegression()
logistic_regression_multinomial.fit(X, y)
accuracy_multinomial = logistic_regression_multinomial.score(X, y)
print(accuracy_multinomial)
0.995
from sklearn.multiclass import OneVsRestClassifier

logistic_regression_ovr = OneVsRestClassifier(LogisticRegression())
logistic_regression_ovr.fit(X, y)
accuracy_ovr = logistic_regression_ovr.score(X, y)
print(accuracy_ovr)
0.976

3. Decision boundaries

Regions from predict for each model.

from sklearn.inspection import DecisionBoundaryDisplay

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 5), sharex=True, sharey=True)

for model, title, ax in [
    (
        logistic_regression_multinomial,
        f"Multinomial Logistic Regression\n(Accuracy: {accuracy_multinomial:.3f})",
        ax1,
    ),
    (
        logistic_regression_ovr,
        f"One-vs-Rest Logistic Regression\n(Accuracy: {accuracy_ovr:.3f})",
        ax2,
    ),
]:
    DecisionBoundaryDisplay.from_estimator(
        model,
        X,
        ax=ax,
        response_method="predict_proba",
        alpha=0.8,
    )
    scatter = ax.scatter(X[:, 0], X[:, 1], c=y, edgecolor="k")
    legend = ax.legend(*scatter.legend_elements(), title="Classes")
    ax.add_artist(legend)
    ax.set_title(title)

It is helpful to see the one-vs-rest plot in 3D, we’ll have 3 separate plots where the Positive class is lifted up (because it’s probability is maximized):

# 3D Plots: decision boundaries as probability surfaces for one-vs-rest logistic regression (each class)
import numpy as np

# Create meshgrid over the feature space (same range as the 2D plot)
x1_min, x1_max = X[:, 0].min() - 0.5, X[:, 0].max() + 0.5
x2_min, x2_max = X[:, 1].min() - 0.5, X[:, 1].max() + 0.5
xx1, xx2 = np.meshgrid(np.linspace(x1_min, x1_max, 80), np.linspace(x2_min, x2_max, 80))

# Stack grid points for predictions
X_grid = np.c_[xx1.ravel(), xx2.ravel()]

# Make three 3D plots: probability surface for each of three classes
fig = plt.figure(figsize=(18, 5))

classes = np.unique(y)
for i, class_idx in enumerate(classes):
    Z = logistic_regression_ovr.predict_proba(X_grid)[:, class_idx]
    Z = Z.reshape(xx1.shape)
    ax = fig.add_subplot(1, 3, i + 1, projection="3d")
    # Plot the probability surface for class class_idx
    surf = ax.plot_surface(xx1, xx2, Z, alpha=0.6, cmap="RdYlBu_r")
    # Scatter the training data with its true class as color
    ax.scatter(X[:, 0], X[:, 1], (y == class_idx).astype(float), c=y, edgecolor="k", s=50)
    ax.set(
        xlabel="$X_1$",
        ylabel="$X_2$",
        zlabel=f"$P(y={class_idx}|x)$",
        title=f"One-vs-Rest Logistic Regression (Class {class_idx} vs Rest)"
    )
    ax.view_init(elev=30, azim=-120)  # Optional: adjust view

plt.tight_layout()
plt.show()