Entropy

What

Measures the average “surprise” or uncertainty in a probability distribution.

H(X) = -Σ P(xᵢ) × log₂ P(xᵢ)
  • Fair coin: H = 1 bit (maximum uncertainty for 2 outcomes)
  • Biased coin (99% heads): H ≈ 0.08 bits (very predictable)
  • Uniform over 8 outcomes: H = 3 bits

Why it matters

  • Decision trees: split on the feature that reduces entropy most (information gain)
  • Cross-entropy loss: the standard classification loss
  • Information theory: compression, coding, communication
  • Language models: perplexity = 2^(cross-entropy) — measures how “surprised” the model is

Key ideas

  • High entropy = uniform = uncertain = hard to predict
  • Low entropy = peaked = confident = easy to predict
  • Maximum entropy = uniform distribution
  • Information gain = entropy before split - weighted entropy after split

Decision tree splitting in detail

A decision tree picks the feature and threshold that maximizes information gain at each node:

IG(S, feature) = H(S) - Σ (|Sᵥ|/|S|) × H(Sᵥ)

where S is the current set, Sᵥ are the subsets after splitting. The feature with highest IG wins. Repeat recursively until a stopping criterion (max depth, min samples, etc.).

QuantityFormulaMeaning
Joint entropyH(X,Y) = -Σ P(x,y) log P(x,y)Uncertainty of X and Y together
Conditional entropyH(Y|X) = H(X,Y) - H(X)Uncertainty in Y after observing X
Mutual informationI(X;Y) = H(X) + H(Y) - H(X,Y)How much knowing X tells you about Y

Key relationships:

H(X,Y) = H(X) + H(Y|X) = H(Y) + H(X|Y)
H(X,Y) <= H(X) + H(Y)            # equality iff X, Y independent
I(X;Y) >= 0                       # always non-negative
I(X;Y) = 0  iff X, Y independent

Mutual information is a powerful feature selection tool — it captures nonlinear dependencies that correlation misses. sklearn.feature_selection.mutual_info_classif computes it for you.

from sklearn.feature_selection import mutual_info_classif
import numpy as np
X = np.random.randn(1000, 5)
y = (X[:, 0] + X[:, 2] > 0).astype(int)  # depends on features 0 and 2
mi = mutual_info_classif(X, y, random_state=42)
# Features 0 and 2 will have highest MI scores