Expectation and Variance

What

  • Expectation (mean) E[X]: the average value you’d get over many trials
  • Variance Var(X): how spread out values are around the mean
  • Standard deviation σ = √Var(X): variance in the original units

Why it matters

  • Bias-variance tradeoff: core concept in model selection
  • Batch normalization: normalizes using running mean and variance
  • Loss functions: MSE = expectation of squared errors
  • Feature scaling: subtract mean, divide by std dev → standard normal

Key ideas

E[X] = Σ xᵢ × P(xᵢ)                  # discrete
E[X] = ∫ x × f(x) dx                  # continuous (f is the density)
Var(X) = E[(X - E[X])²] = E[X²] - E[X]²
  • High variance → data is spread out → model predictions may be unstable
  • Low variance → data is clustered → model might underfit
  • Law of large numbers: as you draw more samples, the sample mean converges to E[X]. This is why larger batches give more stable gradient estimates.

Properties

Linearity of expectation is unreasonably useful — it holds even for dependent variables:

E[aX + bY] = a·E[X] + b·E[Y]          # always true
Var(aX) = a² · Var(X)
Var(X + Y) = Var(X) + Var(Y) + 2·Cov(X,Y)

Covariance and correlation

Cov(X, Y) = E[(X - E[X])(Y - E[Y])] = E[XY] - E[X]·E[Y]
Corr(X, Y) = Cov(X, Y) / (σ_X · σ_Y)    # normalized to [-1, 1]
  • Cov > 0: X and Y tend to increase together
  • Cov = 0: no linear relationship (but could still be nonlinearly dependent)
  • Correlation = 1: perfect positive linear relationship
import numpy as np
x = np.array([1, 2, 3, 4, 5], dtype=float)
y = 2 * x + np.random.normal(0, 0.5, 5)
print(f"Cov: {np.cov(x, y)[0,1]:.2f}")
print(f"Corr: {np.corrcoef(x, y)[0,1]:.2f}")

Connection to neural nets

Batch normalization computes E[X] and Var(X) over each mini-batch, then normalizes: (x - mean) / sqrt(var + eps). During inference, it uses running averages accumulated during training. This is why batch stats matter — they’re literally expectation and variance, estimated from data.