Maximum Likelihood Estimation

What

Find the parameters that make the observed data most probable.

Given data D and model parameters θ:

θ* = argmax P(D | θ)

In practice, maximize log-likelihood (same answer, easier math):

θ* = argmax Σ log P(xᵢ | θ)

Why it matters

MLE is why loss functions look the way they do:

  • MSE loss = MLE assuming Gaussian noise
  • Cross-entropy loss = MLE for classification (Bernoulli/categorical)
  • Negative log-likelihood = the general MLE loss

When you minimize cross-entropy, you’re doing MLE.

Key ideas

  • Likelihood: probability of data given parameters — NOT probability of parameters
  • Log trick: products become sums → numerically stable, easier derivatives
  • MLE can overfit: it maximizes fit to training data with no penalty for complexity
  • MAP (Maximum A Posteriori): MLE + a prior on parameters → equivalent to regularization