Training Language Models to Follow Instructions with Human Feedback
Ouyang et al. (2022)
Why It Matters
RLHF alignment: SFT + reward model + PPO. Made 1.3B InstructGPT preferred over 175B GPT-3. Blueprint behind ChatGPT.
Key Ideas
- Start with a pretrained language model, supervised-fine-tune it on demonstrations, then optimize against a reward model learned from human preferences.
- Use PPO as the policy optimization step so the model shifts toward outputs people prefer rather than only next-token likelihood.
- Show that post-training can substantially improve instruction following and perceived helpfulness.
- Frame alignment as an engineering problem of objective design and feedback collection rather than only scaling pretraining.
Notes
- RLHF became the standard post-training pattern for chat-style assistants.
- The paper matters because it made human preference optimization a mainstream part of model development.