Logistic Regression from Scratch

Goal: Implement binary logistic regression with NumPy — sigmoid, cross-entropy loss, gradient descent. Then extend to multi-class.

Prerequisites: Logistic Regression, Loss Functions, Probability Basics, 01 - Linear Regression from Scratch


Core Idea

Linear regression outputs any real number. For classification, we need probabilities (0 to 1). The sigmoid function does this:

The model:


Sigmoid and Its Gradient

import numpy as np
import matplotlib.pyplot as plt
 
def sigmoid(z):
    return 1 / (1 + np.exp(-np.clip(z, -500, 500)))  # clip for numerical stability
 
# Plot
z = np.linspace(-6, 6, 200)
plt.plot(z, sigmoid(z))
plt.axhline(0.5, color="gray", linestyle="--", alpha=0.5)
plt.axvline(0, color="gray", linestyle="--", alpha=0.5)
plt.xlabel("z"); plt.ylabel("σ(z)")
plt.title("Sigmoid function")
plt.show()

Key property: — the gradient depends on the output itself.


Binary Cross-Entropy Loss

For true label and predicted probability :

When : loss = — penalizes low confidence in positive class When : loss = — penalizes high confidence in positive class

def binary_cross_entropy(y, p):
    eps = 1e-15  # avoid log(0)
    p = np.clip(p, eps, 1 - eps)
    return -np.mean(y * np.log(p) + (1 - y) * np.log(1 - p))

Generate Data

from sklearn.datasets import make_classification
 
X, y = make_classification(n_samples=300, n_features=2, n_redundant=0,
                           n_informative=2, random_state=42, n_clusters_per_class=1)
 
plt.scatter(X[:, 0], X[:, 1], c=y, cmap="bwr", s=15, alpha=0.7)
plt.xlabel("x1"); plt.ylabel("x2")
plt.title("Binary classification data")
plt.show()

Full Implementation

class LogisticRegressionScratch:
    def __init__(self, lr=0.1):
        self.lr = lr
 
    def fit(self, X, y, n_epochs=1000):
        n, d = X.shape
        self.w = np.zeros(d)
        self.b = 0.0
        self.losses = []
 
        for epoch in range(n_epochs):
            # Forward pass
            z = X @ self.w + self.b
            p = sigmoid(z)
 
            # Loss
            loss = binary_cross_entropy(y, p)
            self.losses.append(loss)
 
            # Gradients (derive these — it's a clean result)
            error = p - y  # shape (n,)
            dw = (1 / n) * X.T @ error    # shape (d,)
            db = (1 / n) * np.sum(error)   # scalar
 
            # Update
            self.w -= self.lr * dw
            self.b -= self.lr * db
 
        return self
 
    def predict_proba(self, X):
        return sigmoid(X @ self.w + self.b)
 
    def predict(self, X, threshold=0.5):
        return (self.predict_proba(X) >= threshold).astype(int)

Why p - y is the gradient

The gradient of BCE with respect to simplifies beautifully:

This is the same form as linear regression’s gradient — the chain rule through sigmoid cancels the denominator in the log.


Train and Evaluate

model = LogisticRegressionScratch(lr=0.1)
model.fit(X, y, n_epochs=500)
 
# Accuracy
preds = model.predict(X)
acc = np.mean(preds == y)
print(f"Training accuracy: {acc:.4f}")
 
# Loss curve
plt.plot(model.losses)
plt.xlabel("Epoch"); plt.ylabel("BCE Loss")
plt.title("Training convergence")
plt.show()

Visualize Decision Boundary

def plot_decision_boundary(model, X, y):
    x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
    y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
    xx, yy = np.meshgrid(np.linspace(x_min, x_max, 200),
                         np.linspace(y_min, y_max, 200))
    grid = np.c_[xx.ravel(), yy.ravel()]
    probs = model.predict_proba(grid).reshape(xx.shape)
 
    plt.contourf(xx, yy, probs, levels=50, cmap="RdBu", alpha=0.6)
    plt.colorbar(label="P(y=1)")
    plt.scatter(X[:, 0], X[:, 1], c=y, cmap="bwr", s=15, edgecolors="k", linewidth=0.3)
    plt.title("Decision boundary")
    plt.show()
 
plot_decision_boundary(model, X, y)

The decision boundary is a straight line where , i.e., .


Regularization (L2)

Add to the loss. Gradient gets an extra term :

class LogisticRegressionL2(LogisticRegressionScratch):
    def __init__(self, lr=0.1, lam=0.01):
        super().__init__(lr)
        self.lam = lam
 
    def fit(self, X, y, n_epochs=1000):
        n, d = X.shape
        self.w = np.zeros(d)
        self.b = 0.0
        self.losses = []
 
        for epoch in range(n_epochs):
            z = X @ self.w + self.b
            p = sigmoid(z)
 
            loss = binary_cross_entropy(y, p) + (self.lam / (2 * n)) * np.sum(self.w ** 2)
            self.losses.append(loss)
 
            error = p - y
            dw = (1 / n) * X.T @ error + (self.lam / n) * self.w
            db = (1 / n) * np.sum(error)
 
            self.w -= self.lr * dw
            self.b -= self.lr * db
 
        return self

Verify Against sklearn

from sklearn.linear_model import LogisticRegression
 
sk_model = LogisticRegression(C=1e10)  # C=1e10 ≈ no regularization
sk_model.fit(X, y)
 
print(f"Ours:    w={model.w}, b={model.b:.4f}")
print(f"sklearn: w={sk_model.coef_[0]}, b={sk_model.intercept_[0]:.4f}")
print(f"sklearn accuracy: {sk_model.score(X, y):.4f}")

Exercises

  1. Multi-class: Implement one-vs-rest classification. Train binary classifiers, predict the class with highest probability.

  2. Softmax regression: Replace sigmoid with softmax for multi-class: . Use categorical cross-entropy loss.

  3. Precision and recall: Vary the threshold from 0 to 1 and plot precision vs recall. At what threshold is F1 maximized?

  4. Polynomial features: The boundary is linear. Generate data with make_moons(noise=0.2) and show logistic regression fails. Then add polynomial features () and show it works.


Next: 03 - Decision Tree from Scratch — a non-linear classifier that doesn’t need feature engineering.