Logistic Regression from Scratch

Goal: Implement binary logistic regression with NumPy — sigmoid, cross-entropy loss, gradient descent. Then extend to multi-class.

Prerequisites: Logistic Regression, Loss Functions, Probability Basics, 01 - Linear Regression from Scratch

Core Idea

Linear regression outputs any real number. For classification, we need probabilities (0 to 1). The sigmoid function does this:

$σ (z) = \frac{1}{1 + e ^{- z}}$

The model: $P (y = 1∣ x) = σ (w \cdot x + b)$

Sigmoid and Its Gradient

import numpy as np
import matplotlib.pyplot as plt
 
def sigmoid(z):
    return 1 / (1 + np.exp(-np.clip(z, -500, 500)))  # clip for numerical stability
 
# Plot
z = np.linspace(-6, 6, 200)
plt.plot(z, sigmoid(z))
plt.axhline(0.5, color="gray", linestyle="--", alpha=0.5)
plt.axvline(0, color="gray", linestyle="--", alpha=0.5)
plt.xlabel("z"); plt.ylabel("σ(z)")
plt.title("Sigmoid function")
plt.show()

Key property: $σ^{'} (z) = σ (z) (1 - σ (z))$ — the gradient depends on the output itself.

Binary Cross-Entropy Loss

For true label $y \in {0, 1}$ and predicted probability $\overset{p}{^}$ :

$L = - \frac{1}{n} \sum [y lo g (\overset{p}{^}) + (1 - y) lo g (1 - \overset{p}{^})]$

When $y = 1$ : loss = $- lo g (\overset{p}{^})$ — penalizes low confidence in positive class When $y = 0$ : loss = $- lo g (1 - \overset{p}{^})$ — penalizes high confidence in positive class

def binary_cross_entropy(y, p):
    eps = 1e-15  # avoid log(0)
    p = np.clip(p, eps, 1 - eps)
    return -np.mean(y * np.log(p) + (1 - y) * np.log(1 - p))

Generate Data

from sklearn.datasets import make_classification
 
X, y = make_classification(n_samples=300, n_features=2, n_redundant=0,
                           n_informative=2, random_state=42, n_clusters_per_class=1)
 
plt.scatter(X[:, 0], X[:, 1], c=y, cmap="bwr", s=15, alpha=0.7)
plt.xlabel("x1"); plt.ylabel("x2")
plt.title("Binary classification data")
plt.show()

Full Implementation

class LogisticRegressionScratch:
    def __init__(self, lr=0.1):
        self.lr = lr
 
    def fit(self, X, y, n_epochs=1000):
        n, d = X.shape
        self.w = np.zeros(d)
        self.b = 0.0
        self.losses = []
 
        for epoch in range(n_epochs):
            # Forward pass
            z = X @ self.w + self.b
            p = sigmoid(z)
 
            # Loss
            loss = binary_cross_entropy(y, p)
            self.losses.append(loss)
 
            # Gradients (derive these — it's a clean result)
            error = p - y  # shape (n,)
            dw = (1 / n) * X.T @ error    # shape (d,)
            db = (1 / n) * np.sum(error)   # scalar
 
            # Update
            self.w -= self.lr * dw
            self.b -= self.lr * db
 
        return self
 
    def predict_proba(self, X):
        return sigmoid(X @ self.w + self.b)
 
    def predict(self, X, threshold=0.5):
        return (self.predict_proba(X) >= threshold).astype(int)

Why `p - y` is the gradient

The gradient of BCE with respect to $z$ simplifies beautifully:

$\frac{\partial L}{\partial z _{i}} = \overset{p}{^}_{i} - y_{i}$

This is the same form as linear regression’s gradient — the chain rule through sigmoid cancels the denominator in the log.

Train and Evaluate

model = LogisticRegressionScratch(lr=0.1)
model.fit(X, y, n_epochs=500)
 
# Accuracy
preds = model.predict(X)
acc = np.mean(preds == y)
print(f"Training accuracy: {acc:.4f}")
 
# Loss curve
plt.plot(model.losses)
plt.xlabel("Epoch"); plt.ylabel("BCE Loss")
plt.title("Training convergence")
plt.show()

Visualize Decision Boundary

def plot_decision_boundary(model, X, y):
    x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
    y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
    xx, yy = np.meshgrid(np.linspace(x_min, x_max, 200),
                         np.linspace(y_min, y_max, 200))
    grid = np.c_[xx.ravel(), yy.ravel()]
    probs = model.predict_proba(grid).reshape(xx.shape)
 
    plt.contourf(xx, yy, probs, levels=50, cmap="RdBu", alpha=0.6)
    plt.colorbar(label="P(y=1)")
    plt.scatter(X[:, 0], X[:, 1], c=y, cmap="bwr", s=15, edgecolors="k", linewidth=0.3)
    plt.title("Decision boundary")
    plt.show()
 
plot_decision_boundary(model, X, y)

The decision boundary is a straight line where $σ (z) = 0.5$ , i.e., $w \cdot x + b = 0$ .

Regularization (L2)

Add $\frac{λ}{2 n} ∣∣ w ∣ ∣^{2}$ to the loss. Gradient gets an extra term $\frac{λ}{n} w$ :

class LogisticRegressionL2(LogisticRegressionScratch):
    def __init__(self, lr=0.1, lam=0.01):
        super().__init__(lr)
        self.lam = lam
 
    def fit(self, X, y, n_epochs=1000):
        n, d = X.shape
        self.w = np.zeros(d)
        self.b = 0.0
        self.losses = []
 
        for epoch in range(n_epochs):
            z = X @ self.w + self.b
            p = sigmoid(z)
 
            loss = binary_cross_entropy(y, p) + (self.lam / (2 * n)) * np.sum(self.w ** 2)
            self.losses.append(loss)
 
            error = p - y
            dw = (1 / n) * X.T @ error + (self.lam / n) * self.w
            db = (1 / n) * np.sum(error)
 
            self.w -= self.lr * dw
            self.b -= self.lr * db
 
        return self

Verify Against sklearn

from sklearn.linear_model import LogisticRegression
 
sk_model = LogisticRegression(C=1e10)  # C=1e10 ≈ no regularization
sk_model.fit(X, y)
 
print(f"Ours:    w={model.w}, b={model.b:.4f}")
print(f"sklearn: w={sk_model.coef_[0]}, b={sk_model.intercept_[0]:.4f}")
print(f"sklearn accuracy: {sk_model.score(X, y):.4f}")

Exercises

Multi-class: Implement one-vs-rest classification. Train $k$ binary classifiers, predict the class with highest probability.
Softmax regression: Replace sigmoid with softmax for multi-class: $P (y = k ∣ x) = \frac{e ^{z_{k}}}{\sum _{j} e ^{z_{j}}}$ . Use categorical cross-entropy loss.
Precision and recall: Vary the threshold from 0 to 1 and plot precision vs recall. At what threshold is F1 maximized?
Polynomial features: The boundary is linear. Generate data with make_moons(noise=0.2) and show logistic regression fails. Then add polynomial features ( $x_{1}^{2}, x_{2}^{2}, x_{1} x_{2}$ ) and show it works.

Next: 03 - Decision Tree from Scratch — a non-linear classifier that doesn’t need feature engineering.

AI/ML Notes

Explorer

02 - Logistic Regression from Scratch

Logistic Regression from Scratch

Core Idea

Sigmoid and Its Gradient

Binary Cross-Entropy Loss

Generate Data

Full Implementation

Why `p - y` is the gradient

Train and Evaluate

Visualize Decision Boundary

Regularization (L2)

Verify Against sklearn

Exercises

Graph View

Table of Contents

Backlinks

AI/ML Notes

Explorer

02 - Logistic Regression from Scratch

Logistic Regression from Scratch

Core Idea

Sigmoid and Its Gradient

Binary Cross-Entropy Loss

Generate Data

Full Implementation

Why p - y is the gradient

Train and Evaluate

Visualize Decision Boundary

Regularization (L2)

Verify Against sklearn

Exercises

Graph View

Table of Contents

Backlinks

Why `p - y` is the gradient