Expectation and Variance
What
- Expectation (mean) E[X]: the average value you’d get over many trials
- Variance Var(X): how spread out values are around the mean
- Standard deviation σ = √Var(X): variance in the original units
Why it matters
- Bias-variance tradeoff: core concept in model selection
- Batch normalization: normalizes using running mean and variance
- Loss functions: MSE = expectation of squared errors
- Feature scaling: subtract mean, divide by std dev → standard normal
Key ideas
E[X] = Σ xᵢ × P(xᵢ) # discrete
E[X] = ∫ x × f(x) dx # continuous (f is the density)
Var(X) = E[(X - E[X])²] = E[X²] - E[X]²
- High variance → data is spread out → model predictions may be unstable
- Low variance → data is clustered → model might underfit
- Law of large numbers: as you draw more samples, the sample mean converges to E[X]. This is why larger batches give more stable gradient estimates.
Properties
Linearity of expectation is unreasonably useful — it holds even for dependent variables:
E[aX + bY] = a·E[X] + b·E[Y] # always true
Var(aX) = a² · Var(X)
Var(X + Y) = Var(X) + Var(Y) + 2·Cov(X,Y)
Covariance and correlation
Cov(X, Y) = E[(X - E[X])(Y - E[Y])] = E[XY] - E[X]·E[Y]
Corr(X, Y) = Cov(X, Y) / (σ_X · σ_Y) # normalized to [-1, 1]
- Cov > 0: X and Y tend to increase together
- Cov = 0: no linear relationship (but could still be nonlinearly dependent)
- Correlation = 1: perfect positive linear relationship
import numpy as np
x = np.array([1, 2, 3, 4, 5], dtype=float)
y = 2 * x + np.random.normal(0, 0.5, 5)
print(f"Cov: {np.cov(x, y)[0,1]:.2f}")
print(f"Corr: {np.corrcoef(x, y)[0,1]:.2f}")Connection to neural nets
Batch normalization computes E[X] and Var(X) over each mini-batch, then normalizes: (x - mean) / sqrt(var + eps). During inference, it uses running averages accumulated during training. This is why batch stats matter — they’re literally expectation and variance, estimated from data.
Links
- Distributions
- Probability Basics
- Bias-Variance Tradeoff
- Feature Scaling
- Eigenvalues and Eigenvectors — PCA uses the covariance matrix