Partial Derivatives
What
When a function has multiple inputs, a partial derivative measures change with respect to one input while holding others constant.
f(x, y) = x² + 3xy
∂f/∂x = 2x + 3y (treat y as constant)
∂f/∂y = 3x (treat x as constant)
Why it matters
A loss function depends on thousands/millions of weights. The partial derivative with respect to each weight tells you how to adjust that specific weight to reduce the loss.
Key ideas
- Same rules as regular derivatives — just ignore the other variables
- Collect all partial derivatives into a vector → that’s the Gradient
- Jacobian: matrix of all partial derivatives for vector-valued functions
- Chain rule for composition: if z = f(g(x, y)), then ∂z/∂x = (df/dg) · (∂g/∂x) — this is how backprop works through layers
More examples
Sigmoid: σ(z) = 1 / (1 + e^(-z))
dσ/dz = σ(z) · (1 - σ(z))
This neat form is why sigmoid was popular early on — the derivative is trivial to compute from the output itself.
Softmax: softmax(zᵢ) = e^(zᵢ) / Σ e^(zⱼ). The partial derivatives form a Jacobian:
∂softmax_i/∂z_j = softmax_i · (δᵢⱼ - softmax_j)
where δᵢⱼ = 1 if i=j, else 0. In practice, you rarely hand-code this — autograd handles it.
import numpy as np
def sigmoid(z):
return 1 / (1 + np.exp(-z))
z = 2.0
s = sigmoid(z)
dsigmoid = s * (1 - s) # 0.105 — derivative is small when output is saturatedHigher-order
The Hessian is the matrix of second partial derivatives:
H[i,j] = ∂²f / (∂xᵢ ∂xⱼ)
| Property | Meaning |
|---|---|
| Hessian positive definite | Local minimum |
| Hessian negative definite | Local maximum |
| Hessian indefinite | Saddle point |
The Hessian is n x n for n parameters — too large to store for neural nets (millions of params). That’s why second-order optimizers (L-BFGS, natural gradient) approximate it rather than compute it fully.