Cross-Entropy and KL Divergence

What

Cross-Entropy

Measures the average number of bits needed to encode data from distribution P using a code optimized for distribution Q:

H(P, Q) = -Σ P(xᵢ) × log Q(xᵢ)

KL Divergence

How much extra information is needed — the “distance” from Q to P:

KL(P || Q) = H(P, Q) - H(P) = Σ P(xᵢ) × log(P(xᵢ) / Q(xᵢ))

Why it matters

Cross-entropy is THE classification loss function.

For binary classification:

Loss = -[y × log(ŷ) + (1-y) × log(1-ŷ)]

For multi-class:

Loss = -Σ yᵢ × log(ŷᵢ)

Where y is the true distribution (one-hot) and ŷ is the model’s predicted probabilities.

Key ideas

  • Minimizing cross-entropy = maximizing likelihood = making Q match P
  • KL divergence ≥ 0, equals 0 only when P = Q
  • KL divergence is NOT symmetric: KL(P||Q) ≠ KL(Q||P)
  • Used in VAEs (KL divergence regularizer), knowledge distillation, GANs