Entropy
What
Measures the average “surprise” or uncertainty in a probability distribution.
H(X) = -Σ P(xᵢ) × log₂ P(xᵢ)
- Fair coin: H = 1 bit (maximum uncertainty for 2 outcomes)
- Biased coin (99% heads): H ≈ 0.08 bits (very predictable)
- Uniform over 8 outcomes: H = 3 bits
Why it matters
- Decision trees: split on the feature that reduces entropy most (information gain)
- Cross-entropy loss: the standard classification loss
- Information theory: compression, coding, communication
- Language models: perplexity = 2^(cross-entropy) — measures how “surprised” the model is
Key ideas
- High entropy = uniform = uncertain = hard to predict
- Low entropy = peaked = confident = easy to predict
- Maximum entropy = uniform distribution
- Information gain = entropy before split - weighted entropy after split
Decision tree splitting in detail
A decision tree picks the feature and threshold that maximizes information gain at each node:
IG(S, feature) = H(S) - Σ (|Sᵥ|/|S|) × H(Sᵥ)
where S is the current set, Sᵥ are the subsets after splitting. The feature with highest IG wins. Repeat recursively until a stopping criterion (max depth, min samples, etc.).
Related quantities
| Quantity | Formula | Meaning |
|---|---|---|
| Joint entropy | H(X,Y) = -Σ P(x,y) log P(x,y) | Uncertainty of X and Y together |
| Conditional entropy | H(Y|X) = H(X,Y) - H(X) | Uncertainty in Y after observing X |
| Mutual information | I(X;Y) = H(X) + H(Y) - H(X,Y) | How much knowing X tells you about Y |
Key relationships:
H(X,Y) = H(X) + H(Y|X) = H(Y) + H(X|Y)
H(X,Y) <= H(X) + H(Y) # equality iff X, Y independent
I(X;Y) >= 0 # always non-negative
I(X;Y) = 0 iff X, Y independent
Mutual information is a powerful feature selection tool — it captures nonlinear dependencies that correlation misses. sklearn.feature_selection.mutual_info_classif computes it for you.
from sklearn.feature_selection import mutual_info_classif
import numpy as np
X = np.random.randn(1000, 5)
y = (X[:, 0] + X[:, 2] > 0).astype(int) # depends on features 0 and 2
mi = mutual_info_classif(X, y, random_state=42)
# Features 0 and 2 will have highest MI scores