Entropy

What

Measures the average “surprise” or uncertainty in a probability distribution.

H(X) = -Σ P(xᵢ) × log₂ P(xᵢ)

Fair coin: H = 1 bit (maximum uncertainty for 2 outcomes)
Biased coin (99% heads): H ≈ 0.08 bits (very predictable)
Uniform over 8 outcomes: H = 3 bits

Why it matters

Decision trees: split on the feature that reduces entropy most (information gain)
Cross-entropy loss: the standard classification loss
Information theory: compression, coding, communication
Language models: perplexity = 2^(cross-entropy) — measures how “surprised” the model is

Key ideas

High entropy = uniform = uncertain = hard to predict
Low entropy = peaked = confident = easy to predict
Maximum entropy = uniform distribution
Information gain = entropy before split - weighted entropy after split

Decision tree splitting in detail

A decision tree picks the feature and threshold that maximizes information gain at each node:

IG(S, feature) = H(S) - Σ (|Sᵥ|/|S|) × H(Sᵥ)

where S is the current set, Sᵥ are the subsets after splitting. The feature with highest IG wins. Repeat recursively until a stopping criterion (max depth, min samples, etc.).

Quantity	Formula	Meaning
Joint entropy	H(X,Y) = -Σ P(x,y) log P(x,y)	Uncertainty of X and Y together
Conditional entropy	H(Y\|X) = H(X,Y) - H(X)	Uncertainty in Y after observing X
Mutual information	I(X;Y) = H(X) + H(Y) - H(X,Y)	How much knowing X tells you about Y

Key relationships:

H(X,Y) = H(X) + H(Y|X) = H(Y) + H(X|Y)
H(X,Y) <= H(X) + H(Y)            # equality iff X, Y independent
I(X;Y) >= 0                       # always non-negative
I(X;Y) = 0  iff X, Y independent

Mutual information is a powerful feature selection tool — it captures nonlinear dependencies that correlation misses. sklearn.feature_selection.mutual_info_classif computes it for you.

from sklearn.feature_selection import mutual_info_classif
import numpy as np
X = np.random.randn(1000, 5)
y = (X[:, 0] + X[:, 2] > 0).astype(int)  # depends on features 0 and 2
mi = mutual_info_classif(X, y, random_state=42)
# Features 0 and 2 will have highest MI scores

AI/ML Notes

Explorer

Entropy

Entropy

What

Why it matters

Key ideas

Decision tree splitting in detail

Links

Graph View

Table of Contents

Backlinks

AI/ML Notes

Explorer

Entropy

Entropy

What

Why it matters

Key ideas

Decision tree splitting in detail

Related quantities

Links

Graph View

Table of Contents

Backlinks