Derivatives

What

The derivative of f(x) measures how much f changes when x changes by a tiny amount. It’s the slope of the function at a point.

f(x) = x²  →  f'(x) = 2x

At x=3, the slope is 6: a small increase in x increases f by about 6× that amount.

Why it matters

  • Loss functions: the derivative tells you which direction to adjust weights to reduce error
  • Gradient descent: take the derivative → step in the opposite direction → repeat
  • Learning rate: how big a step to take along the derivative

Training a model = finding the input (weights) that minimizes the output (loss) by following derivatives.

Key ideas

  • Power rule: d/dx(xⁿ) = n·xⁿ⁻¹
  • Sum rule: d/dx(f + g) = f’ + g’
  • Product rule: d/dx(f·g) = f’·g + f·g’
  • Quotient rule: d/dx(f/g) = (f’·g - f·g’) / g²
  • Chain rule preview: d/dx(f(g(x))) = f’(g(x)) · g’(x) — see Chain Rule for the full story
  • Local minimum: where f’(x) = 0 and f”(x) > 0

MSE loss derivative — a practical example

Mean squared error for one sample: L = (y - y_hat)². Take the derivative w.r.t. the prediction:

dL/d(y_hat) = -2(y - y_hat)

This tells you: if your prediction is too low (y > y_hat), the gradient is negative, so gradient descent pushes y_hat up. Exactly what you want.

import numpy as np
y = 5.0
y_hat = 3.0
grad = -2 * (y - y_hat)   # -4.0 — negative means "increase y_hat"

Second derivatives

The second derivative f”(x) tells you how the slope itself is changing — it measures curvature (concavity).

  • f”(x) > 0: function curves upward (concave up) — you’re at or near a minimum
  • f”(x) < 0: function curves downward (concave down) — you’re at or near a maximum
  • f”(x) = 0: inflection point — curvature changes direction

In optimization, second derivatives tell you whether a critical point (f’=0) is actually a minimum. In higher dimensions, the matrix of all second derivatives is the Hessian — see Partial Derivatives.

Second-order methods like Newton’s method use f” to pick better step sizes, but they’re expensive for large models, so most deep learning sticks to first-order (just the gradient).