RLHF and Alignment
What
Reinforcement Learning from Human Feedback — the technique that makes LLMs helpful, harmless, and honest. Bridges the gap between “predicts next token” and “follows instructions well.”
The three-step process
1. Supervised Fine-Tuning (SFT)
Train the base model on high-quality instruction-response pairs.
User: "Explain quantum computing simply"
Assistant: [high-quality response written by humans]
2. Reward Model Training
Collect human rankings of model outputs. Train a reward model that scores responses.
Response A > Response B > Response C (human ranking)
→ Reward model learns to predict human preferences
3. RL Optimization (PPO)
Use the reward model as the reward signal. Optimize the LLM’s policy with PPO to maximize the reward.
LLM generates response → Reward model scores it → PPO updates LLM
Alternatives to RLHF
| Method | Approach |
|---|---|
| DPO (Direct Preference Optimization) | Skip the reward model, optimize preferences directly |
| GRPO (Group Relative Policy Optimization) | Used by DeepSeek-R1, no critic needed |
| SimPO | Simplified preference optimization variant |
| RLAIF | Use AI feedback instead of human feedback |
| Constitutional AI | Self-critique guided by principles |
DPO and its variants (GRPO, SimPO) have largely replaced classical RLHF-PPO for open-source alignment, though frontier labs still use PPO-based methods for their strongest models.
Reasoning via RL
DeepSeek-R1 showed that pure RL (no SFT) can produce emergent chain-of-thought reasoning:
- R1-Zero: GRPO on base model — model discovers reasoning on its own
- R1: adds SFT data + additional RL
- R1-Distill: SFT data from R1 used to fine-tune smaller models (Qwen, Llama)
GRPO (Group Relative Policy Optimization) eliminates the critic model — compares responses within a group to estimate advantages. Published in Nature, 2025.
Key insight: reasoning capability can emerge from RL alone and transfer via distillation.
Key papers
- Training language models to follow instructions with human feedback (Ouyang et al., 2022) — InstructGPT
- Direct Preference Optimization (Rafailov et al., 2023) — DPO