RLHF and Alignment

What

Reinforcement Learning from Human Feedback — the technique that makes LLMs helpful, harmless, and honest. Bridges the gap between “predicts next token” and “follows instructions well.”

The three-step process

1. Supervised Fine-Tuning (SFT)

Train the base model on high-quality instruction-response pairs.

User: "Explain quantum computing simply"
Assistant: [high-quality response written by humans]

2. Reward Model Training

Collect human rankings of model outputs. Train a reward model that scores responses.

Response A > Response B > Response C (human ranking)
→ Reward model learns to predict human preferences

3. RL Optimization (PPO)

Use the reward model as the reward signal. Optimize the LLM’s policy with PPO to maximize the reward.

LLM generates response → Reward model scores it → PPO updates LLM

Alternatives to RLHF

Method	Approach
DPO (Direct Preference Optimization)	Skip the reward model, optimize preferences directly
GRPO (Group Relative Policy Optimization)	Used by DeepSeek-R1, no critic needed
SimPO	Simplified preference optimization variant
RLAIF	Use AI feedback instead of human feedback
Constitutional AI	Self-critique guided by principles

DPO and its variants (GRPO, SimPO) have largely replaced classical RLHF-PPO for open-source alignment, though frontier labs still use PPO-based methods for their strongest models.

Reasoning via RL

DeepSeek-R1 showed that pure RL (no SFT) can produce emergent chain-of-thought reasoning:

R1-Zero: GRPO on base model — model discovers reasoning on its own
R1: adds SFT data + additional RL
R1-Distill: SFT data from R1 used to fine-tune smaller models (Qwen, Llama)

GRPO (Group Relative Policy Optimization) eliminates the critic model — compares responses within a group to estimate advantages. Published in Nature, 2025.

Key insight: reasoning capability can emerge from RL alone and transfer via distillation.

Key papers

Training language models to follow instructions with human feedback (Ouyang et al., 2022) — InstructGPT
Direct Preference Optimization (Rafailov et al., 2023) — DPO

AI/ML Notes

Explorer

RLHF and Alignment

RLHF and Alignment

What

The three-step process

1. Supervised Fine-Tuning (SFT)

2. Reward Model Training

3. RL Optimization (PPO)

Alternatives to RLHF

Reasoning via RL

Key papers

Links

Graph View

Table of Contents

Backlinks