Linear Regression from Scratch

Goal: Implement linear regression two ways (normal equation + gradient descent) using only NumPy. Compare with sklearn to verify correctness.

Prerequisites: Vectors and Matrices, Gradient Descent, Linear Regression, Loss Functions

The Problem

Given data points $(x_{i}, y_{i})$ , find weights $w$ and bias $b$ that minimize:

$MSE = \frac{1}{n} \sum_{i = 1}^{n} (y_{i} - \overset{y}{^}_{i})^{2}$

where $\overset{y}{^}_{i} = w \cdot x_{i} + b$ .

Generate Synthetic Data

import numpy as np
import matplotlib.pyplot as plt
 
np.random.seed(42)
n = 100
X = 2 * np.random.rand(n, 1)           # feature: shape (100, 1)
y = 4 + 3 * X[:, 0] + np.random.randn(n) * 0.5  # true: y = 4 + 3x + noise
 
plt.scatter(X[:, 0], y, s=10)
plt.xlabel("x"); plt.ylabel("y")
plt.title("Synthetic data: y = 4 + 3x + noise")
plt.show()

Method 1: Normal Equation

Closed-form solution: $θ = (X^{T} X)^{- 1} X^{T} y$

# Add bias column (column of ones)
X_b = np.c_[np.ones(n), X]  # shape (100, 2)
 
# Normal equation
theta = np.linalg.inv(X_b.T @ X_b) @ X_b.T @ y
print(f"Intercept: {theta[0]:.4f}, Slope: {theta[1]:.4f}")
# Expected: ~4.0, ~3.0
 
# Predict
x_line = np.array([[0], [2]])
x_line_b = np.c_[np.ones(2), x_line]
y_pred = x_line_b @ theta
 
plt.scatter(X[:, 0], y, s=10)
plt.plot(x_line, y_pred, "r-", linewidth=2)
plt.title(f"Normal equation: y = {theta[0]:.2f} + {theta[1]:.2f}x")
plt.show()

Why it works

$X^{T} X$ is the Gram matrix — its inverse exists when features aren’t collinear. Multiplying by $X^{T} y$ projects the target onto the column space of $X$ .

When it breaks

Large $n$ or many features: $O (n \cdot d^{2} + d^{3})$ is expensive
Singular $X^{T} X$ : use np.linalg.pinv (pseudoinverse) instead

Method 2: Gradient Descent

Iteratively update parameters in the direction that reduces MSE.

Gradients

For $\overset{y}{^} = X_{b} θ$ :

$\frac{\partial MSE}{\partial θ} = \frac{2}{n} X_{b}^{T} (X_{b} θ - y)$

Implementation

def linear_regression_gd(X_b, y, lr=0.1, n_epochs=1000):
    n = len(y)
    theta = np.random.randn(X_b.shape[1])  # random init
    losses = []
 
    for epoch in range(n_epochs):
        y_pred = X_b @ theta
        error = y_pred - y
        loss = np.mean(error ** 2)
        losses.append(loss)
 
        gradient = (2 / n) * X_b.T @ error
        theta -= lr * gradient
 
    return theta, losses
 
theta_gd, losses = linear_regression_gd(X_b, y, lr=0.1, n_epochs=200)
print(f"GD result — Intercept: {theta_gd[0]:.4f}, Slope: {theta_gd[1]:.4f}")

Visualize convergence

plt.plot(losses)
plt.xlabel("Epoch"); plt.ylabel("MSE")
plt.title("Gradient descent convergence")
plt.yscale("log")
plt.show()

Compare with sklearn

from sklearn.linear_model import LinearRegression
 
model = LinearRegression()
model.fit(X, y)
print(f"sklearn — Intercept: {model.intercept_:.4f}, Slope: {model.coef_[0]:.4f}")

All three should give nearly identical results (~4.0 intercept, ~3.0 slope).

Effect of Learning Rate

fig, axes = plt.subplots(1, 3, figsize=(14, 4))
for ax, lr in zip(axes, [0.001, 0.1, 1.5]):
    _, losses = linear_regression_gd(X_b, y, lr=lr, n_epochs=200)
    ax.plot(losses)
    ax.set_title(f"lr = {lr}")
    ax.set_xlabel("Epoch"); ax.set_ylabel("MSE")
    ax.set_yscale("log")
plt.tight_layout()
plt.show()

Too small (0.001): converges slowly
Just right (0.1): fast convergence
Too large (1.5): diverges or oscillates

Multivariate Extension

Same code works for multiple features — that’s the beauty of vectorized math.

# 3 features
np.random.seed(42)
X_multi = np.random.randn(200, 3)
true_weights = np.array([2.0, -1.5, 0.5])
y_multi = X_multi @ true_weights + 3.0 + np.random.randn(200) * 0.3
 
X_multi_b = np.c_[np.ones(200), X_multi]
theta_multi, _ = linear_regression_gd(X_multi_b, y_multi, lr=0.05, n_epochs=500)
print(f"Recovered: bias={theta_multi[0]:.2f}, weights={theta_multi[1:]}")
# Expected: ~[3.0, 2.0, -1.5, 0.5]

Exercises

Ridge regression: Add L2 penalty $λ ∣∣ θ ∣ ∣^{2}$ to the loss. Modify the gradient and compare how different $λ$ values affect the learned weights.
Stochastic gradient descent: Instead of computing the gradient over all samples, randomly pick one sample per step. Compare convergence speed and noise.
Feature scaling: Generate features with very different scales (e.g., one 0-1, another 0-1000). Observe what happens to GD without scaling. Then add standardization and try again.
Learning rate schedule: Implement $lr (t) = \frac{lr _{0}}{1 + t / τ}$ and compare convergence with fixed lr.

Next: 02 - Logistic Regression from Scratch — extend this to classification with the sigmoid function.

AI/ML Notes

Explorer

01 - Linear Regression from Scratch