Partial Derivatives

What

When a function has multiple inputs, a partial derivative measures change with respect to one input while holding others constant.

f(x, y) = x² + 3xy
∂f/∂x = 2x + 3y    (treat y as constant)
∂f/∂y = 3x          (treat x as constant)

Why it matters

A loss function depends on thousands/millions of weights. The partial derivative with respect to each weight tells you how to adjust that specific weight to reduce the loss.

Key ideas

  • Same rules as regular derivatives — just ignore the other variables
  • Collect all partial derivatives into a vector → that’s the Gradient
  • Jacobian: matrix of all partial derivatives for vector-valued functions
  • Chain rule for composition: if z = f(g(x, y)), then ∂z/∂x = (df/dg) · (∂g/∂x) — this is how backprop works through layers

More examples

Sigmoid: σ(z) = 1 / (1 + e^(-z))

dσ/dz = σ(z) · (1 - σ(z))

This neat form is why sigmoid was popular early on — the derivative is trivial to compute from the output itself.

Softmax: softmax(zᵢ) = e^(zᵢ) / Σ e^(zⱼ). The partial derivatives form a Jacobian:

∂softmax_i/∂z_j = softmax_i · (δᵢⱼ - softmax_j)

where δᵢⱼ = 1 if i=j, else 0. In practice, you rarely hand-code this — autograd handles it.

import numpy as np
 
def sigmoid(z):
    return 1 / (1 + np.exp(-z))
 
z = 2.0
s = sigmoid(z)
dsigmoid = s * (1 - s)  # 0.105 — derivative is small when output is saturated

Higher-order

The Hessian is the matrix of second partial derivatives:

H[i,j] = ∂²f / (∂xᵢ ∂xⱼ)
PropertyMeaning
Hessian positive definiteLocal minimum
Hessian negative definiteLocal maximum
Hessian indefiniteSaddle point

The Hessian is n x n for n parameters — too large to store for neural nets (millions of params). That’s why second-order optimizers (L-BFGS, natural gradient) approximate it rather than compute it fully.