Linear Regression from Scratch
Goal: Implement linear regression two ways (normal equation + gradient descent) using only NumPy. Compare with sklearn to verify correctness.
Prerequisites: Vectors and Matrices, Gradient Descent, Linear Regression, Loss Functions
The Problem
Given data points , find weights and bias that minimize:
where .
Generate Synthetic Data
import numpy as np
import matplotlib.pyplot as plt
np.random.seed(42)
n = 100
X = 2 * np.random.rand(n, 1) # feature: shape (100, 1)
y = 4 + 3 * X[:, 0] + np.random.randn(n) * 0.5 # true: y = 4 + 3x + noise
plt.scatter(X[:, 0], y, s=10)
plt.xlabel("x"); plt.ylabel("y")
plt.title("Synthetic data: y = 4 + 3x + noise")
plt.show()Method 1: Normal Equation
Closed-form solution:
# Add bias column (column of ones)
X_b = np.c_[np.ones(n), X] # shape (100, 2)
# Normal equation
theta = np.linalg.inv(X_b.T @ X_b) @ X_b.T @ y
print(f"Intercept: {theta[0]:.4f}, Slope: {theta[1]:.4f}")
# Expected: ~4.0, ~3.0
# Predict
x_line = np.array([[0], [2]])
x_line_b = np.c_[np.ones(2), x_line]
y_pred = x_line_b @ theta
plt.scatter(X[:, 0], y, s=10)
plt.plot(x_line, y_pred, "r-", linewidth=2)
plt.title(f"Normal equation: y = {theta[0]:.2f} + {theta[1]:.2f}x")
plt.show()Why it works
is the Gram matrix — its inverse exists when features aren’t collinear. Multiplying by projects the target onto the column space of .
When it breaks
- Large or many features: is expensive
- Singular : use
np.linalg.pinv(pseudoinverse) instead
Method 2: Gradient Descent
Iteratively update parameters in the direction that reduces MSE.
Gradients
For :
Implementation
def linear_regression_gd(X_b, y, lr=0.1, n_epochs=1000):
n = len(y)
theta = np.random.randn(X_b.shape[1]) # random init
losses = []
for epoch in range(n_epochs):
y_pred = X_b @ theta
error = y_pred - y
loss = np.mean(error ** 2)
losses.append(loss)
gradient = (2 / n) * X_b.T @ error
theta -= lr * gradient
return theta, losses
theta_gd, losses = linear_regression_gd(X_b, y, lr=0.1, n_epochs=200)
print(f"GD result — Intercept: {theta_gd[0]:.4f}, Slope: {theta_gd[1]:.4f}")Visualize convergence
plt.plot(losses)
plt.xlabel("Epoch"); plt.ylabel("MSE")
plt.title("Gradient descent convergence")
plt.yscale("log")
plt.show()Compare with sklearn
from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(X, y)
print(f"sklearn — Intercept: {model.intercept_:.4f}, Slope: {model.coef_[0]:.4f}")All three should give nearly identical results (~4.0 intercept, ~3.0 slope).
Effect of Learning Rate
fig, axes = plt.subplots(1, 3, figsize=(14, 4))
for ax, lr in zip(axes, [0.001, 0.1, 1.5]):
_, losses = linear_regression_gd(X_b, y, lr=lr, n_epochs=200)
ax.plot(losses)
ax.set_title(f"lr = {lr}")
ax.set_xlabel("Epoch"); ax.set_ylabel("MSE")
ax.set_yscale("log")
plt.tight_layout()
plt.show()- Too small (0.001): converges slowly
- Just right (0.1): fast convergence
- Too large (1.5): diverges or oscillates
Multivariate Extension
Same code works for multiple features — that’s the beauty of vectorized math.
# 3 features
np.random.seed(42)
X_multi = np.random.randn(200, 3)
true_weights = np.array([2.0, -1.5, 0.5])
y_multi = X_multi @ true_weights + 3.0 + np.random.randn(200) * 0.3
X_multi_b = np.c_[np.ones(200), X_multi]
theta_multi, _ = linear_regression_gd(X_multi_b, y_multi, lr=0.05, n_epochs=500)
print(f"Recovered: bias={theta_multi[0]:.2f}, weights={theta_multi[1:]}")
# Expected: ~[3.0, 2.0, -1.5, 0.5]Exercises
-
Ridge regression: Add L2 penalty to the loss. Modify the gradient and compare how different values affect the learned weights.
-
Stochastic gradient descent: Instead of computing the gradient over all samples, randomly pick one sample per step. Compare convergence speed and noise.
-
Feature scaling: Generate features with very different scales (e.g., one 0-1, another 0-1000). Observe what happens to GD without scaling. Then add standardization and try again.
-
Learning rate schedule: Implement and compare convergence with fixed lr.
Next: 02 - Logistic Regression from Scratch — extend this to classification with the sigmoid function.