Maximum Likelihood Estimation
What
Find the parameters that make the observed data most probable.
Given data D and model parameters θ:
θ* = argmax P(D | θ)
In practice, maximize log-likelihood (same answer, easier math):
θ* = argmax Σ log P(xᵢ | θ)
Why it matters
MLE is why loss functions look the way they do:
- MSE loss = MLE assuming Gaussian noise
- Cross-entropy loss = MLE for classification (Bernoulli/categorical)
- Negative log-likelihood = the general MLE loss
When you minimize cross-entropy, you’re doing MLE.
Key ideas
- Likelihood: probability of data given parameters — NOT probability of parameters
- Log trick: products become sums → numerically stable, easier derivatives
- MLE can overfit: it maximizes fit to training data with no penalty for complexity
- MAP (Maximum A Posteriori): MLE + a prior on parameters → equivalent to regularization