Training Language Models to Follow Instructions with Human Feedback

Ouyang et al. (2022)

Why It Matters

RLHF alignment: SFT + reward model + PPO. Made 1.3B InstructGPT preferred over 175B GPT-3. Blueprint behind ChatGPT.

Start with a pretrained language model, supervised-fine-tune it on demonstrations, then optimize against a reward model learned from human preferences.
Use PPO as the policy optimization step so the model shifts toward outputs people prefer rather than only next-token likelihood.
Show that post-training can substantially improve instruction following and perceived helpfulness.
Frame alignment as an engineering problem of objective design and feedback collection rather than only scaling pretraining.

RLHF became the standard post-training pattern for chat-style assistants.
The paper matters because it made human preference optimization a mainstream part of model development.