RLHF and Alignment

What

Reinforcement Learning from Human Feedback — the technique that makes LLMs helpful, harmless, and honest. Bridges the gap between “predicts next token” and “follows instructions well.”

The three-step process

1. Supervised Fine-Tuning (SFT)

Train the base model on high-quality instruction-response pairs.

User: "Explain quantum computing simply"
Assistant: [high-quality response written by humans]

2. Reward Model Training

Collect human rankings of model outputs. Train a reward model that scores responses.

Response A > Response B > Response C (human ranking)
→ Reward model learns to predict human preferences

3. RL Optimization (PPO)

Use the reward model as the reward signal. Optimize the LLM’s policy with PPO to maximize the reward.

LLM generates response → Reward model scores it → PPO updates LLM

Alternatives to RLHF

MethodApproach
DPO (Direct Preference Optimization)Skip the reward model, optimize preferences directly
GRPO (Group Relative Policy Optimization)Used by DeepSeek-R1, no critic needed
SimPOSimplified preference optimization variant
RLAIFUse AI feedback instead of human feedback
Constitutional AISelf-critique guided by principles

DPO and its variants (GRPO, SimPO) have largely replaced classical RLHF-PPO for open-source alignment, though frontier labs still use PPO-based methods for their strongest models.

Reasoning via RL

DeepSeek-R1 showed that pure RL (no SFT) can produce emergent chain-of-thought reasoning:

  1. R1-Zero: GRPO on base model — model discovers reasoning on its own
  2. R1: adds SFT data + additional RL
  3. R1-Distill: SFT data from R1 used to fine-tune smaller models (Qwen, Llama)

GRPO (Group Relative Policy Optimization) eliminates the critic model — compares responses within a group to estimate advantages. Published in Nature, 2025.

Key insight: reasoning capability can emerge from RL alone and transfer via distillation.

Key papers

  • Training language models to follow instructions with human feedback (Ouyang et al., 2022) — InstructGPT
  • Direct Preference Optimization (Rafailov et al., 2023) — DPO