Cross-Entropy and KL Divergence
What
Cross-Entropy
Measures the average number of bits needed to encode data from distribution P using a code optimized for distribution Q:
H(P, Q) = -Σ P(xᵢ) × log Q(xᵢ)
KL Divergence
How much extra information is needed — the “distance” from Q to P:
KL(P || Q) = H(P, Q) - H(P) = Σ P(xᵢ) × log(P(xᵢ) / Q(xᵢ))
Why it matters
Cross-entropy is THE classification loss function.
For binary classification:
Loss = -[y × log(ŷ) + (1-y) × log(1-ŷ)]
For multi-class:
Loss = -Σ yᵢ × log(ŷᵢ)
Where y is the true distribution (one-hot) and ŷ is the model’s predicted probabilities.
Key ideas
- Minimizing cross-entropy = maximizing likelihood = making Q match P
- KL divergence ≥ 0, equals 0 only when P = Q
- KL divergence is NOT symmetric: KL(P||Q) ≠ KL(Q||P)
- Used in VAEs (KL divergence regularizer), knowledge distillation, GANs