Cross-Entropy and KL Divergence

What

Measures the average number of bits needed to encode data from distribution P using a code optimized for distribution Q:

H(P, Q) = -Σ P(xᵢ) × log Q(xᵢ)

How much extra information is needed — the “distance” from Q to P:

KL(P || Q) = H(P, Q) - H(P) = Σ P(xᵢ) × log(P(xᵢ) / Q(xᵢ))

Cross-entropy is THE classification loss function.

For binary classification:

Loss = -[y × log(ŷ) + (1-y) × log(1-ŷ)]

For multi-class:

Loss = -Σ yᵢ × log(ŷᵢ)

Where y is the true distribution (one-hot) and ŷ is the model’s predicted probabilities.