Maximum Likelihood Estimation

What

Find the parameters that make the observed data most probable.

Given data D and model parameters θ:

θ* = argmax P(D | θ)

In practice, maximize log-likelihood (same answer, easier math):

θ* = argmax Σ log P(xᵢ | θ)

MLE is why loss functions look the way they do:

When you minimize cross-entropy, you’re doing MLE.

Likelihood: probability of data given parameters — NOT probability of parameters
Log trick: products become sums → numerically stable, easier derivatives
MLE can overfit: it maximizes fit to training data with no penalty for complexity
MAP (Maximum A Posteriori): MLE + a prior on parameters → equivalent to regularization