Actor-Critic and PPO
What
Actor-critic methods combine two ideas: a policy (the actor, which decides what to do) and a value function (the critic, which evaluates how good the action was). PPO (Proximal Policy Optimization) is the most important actor-critic algorithm — it is the default choice for almost every RL application from robotics to RLHF for language models.
This note goes deeper than Actor-Critic Methods, which covers the basics. Here we focus on the full PPO algorithm and its implementation.
Why actor-critic?
Policy Gradient Methods suffer from high variance: the REINFORCE estimator uses the full return G_t as the signal, which varies wildly between episodes.
The key insight: subtract a baseline from the return. If the baseline is V(s) (the expected return from state s), then:
advantage A(s, a) = Q(s, a) - V(s)
= "how much better was this action than average?"
- A > 0: this action was better than expected → increase its probability
- A < 0: this action was worse than expected → decrease its probability
The critic learns V(s), giving us the baseline. The actor uses the advantage to update the policy. Same expected gradient as REINFORCE, much lower variance.
A2C: Advantage Actor-Critic
The synchronous version of actor-critic.
For each batch of experience:
1. Run policy, collect trajectories: (s_t, a_t, r_t, s_{t+1})
2. Compute advantages: A_t = r_t + gamma * V(s_{t+1}) - V(s_t) (1-step TD)
3. Update actor: maximize E[log pi(a|s) * A]
4. Update critic: minimize E[(V(s) - target)^2]
A2C uses the 1-step TD advantage. This has low variance but high bias (because V(s) is an approximation).
Generalized Advantage Estimation (GAE)
GAE provides a smooth tradeoff between bias and variance using a parameter lambda (0 to 1).
delta_t = r_t + gamma * V(s_{t+1}) - V(s_t) # TD error at step t
GAE(gamma, lambda):
A_t = delta_t + (gamma * lambda) * delta_{t+1} + (gamma * lambda)^2 * delta_{t+2} + ...
Equivalently (recursive):
A_T = delta_T # last step
A_t = delta_t + gamma * lambda * A_{t+1} # all other steps
| lambda | Behavior | Bias | Variance |
|---|---|---|---|
| 0 | 1-step TD: A = r + gamma*V(s’) - V(s) | High | Low |
| 0.5 | Mix of short and long horizon | Medium | Medium |
| 0.95 | Close to Monte Carlo | Low | Medium-high |
| 1 | Full Monte Carlo return | None | High |
In practice: lambda=0.95, gamma=0.99. This works across most environments.
import numpy as np
def compute_gae(rewards, values, dones, gamma=0.99, lam=0.95):
"""Compute Generalized Advantage Estimation.
rewards: (T,) rewards at each step.
values: (T+1,) value estimates (includes bootstrap for last state).
dones: (T,) episode termination flags.
Returns: advantages (T,), returns (T,).
"""
T = len(rewards)
advantages = np.zeros(T)
last_gae = 0
for t in reversed(range(T)):
next_non_terminal = 1.0 - dones[t]
delta = rewards[t] + gamma * values[t + 1] * next_non_terminal - values[t]
advantages[t] = delta + gamma * lam * next_non_terminal * last_gae
last_gae = advantages[t]
returns = advantages + values[:T]
return advantages, returnsPPO: Proximal Policy Optimization
PPO is the workhorse of modern RL. Published by OpenAI in 2017, it remains the dominant algorithm through 2026.
The problem PPO solves
Policy gradient methods are sensitive to step size. Too small → slow learning. Too large → catastrophic policy collapse (the policy jumps to a bad region and can’t recover).
TRPO (Trust Region Policy Optimization) solved this with constrained optimization — limit the KL divergence between old and new policies. It works but is complex to implement (conjugate gradients, line search).
PPO achieves nearly the same effect with a simple clipping trick.
The clipped objective
L_CLIP = E[ min(r_t * A_t, clip(r_t, 1-eps, 1+eps) * A_t) ]
where:
r_t = pi_new(a_t | s_t) / pi_old(a_t | s_t) # probability ratio
eps = 0.2 (typically)
A_t = advantage (from GAE)
How it works:
r_tmeasures how much the policy changed. If r_t = 1, no change. r_t = 2 means the action is now 2x more likely.- When A_t > 0 (good action): we want to increase r_t, but clip prevents r_t from exceeding (1 + eps). The gradient is zero beyond the clip.
- When A_t < 0 (bad action): we want to decrease r_t, but clip prevents r_t from going below (1 - eps).
The clip kills the gradient when the policy tries to change too much, preventing catastrophic updates.
import torch
import torch.nn as nn
def ppo_loss(log_probs_new, log_probs_old, advantages, clip_eps=0.2):
"""Compute PPO clipped surrogate loss.
log_probs_new: log pi_new(a|s) for actions taken.
log_probs_old: log pi_old(a|s) (from rollout).
advantages: GAE advantages.
"""
ratio = torch.exp(log_probs_new - log_probs_old) # pi_new / pi_old
surr1 = ratio * advantages
surr2 = torch.clamp(ratio, 1.0 - clip_eps, 1.0 + clip_eps) * advantages
return -torch.min(surr1, surr2).mean() # negative because we maximizeThe full PPO loss
PPO’s loss has three components:
L = L_CLIP + c1 * L_VALUE + c2 * L_ENTROPY
L_CLIP: clipped surrogate (actor loss)
L_VALUE: (V(s) - returns)^2 (critic loss), often also clipped
L_ENTROPY: -E[pi * log(pi)] (entropy bonus, encourages exploration)
def ppo_full_loss(log_probs_new, log_probs_old, values, returns,
advantages, entropy, clip_eps=0.2, vf_coef=0.5, ent_coef=0.01):
"""Full PPO loss with value and entropy components."""
# Policy loss (clipped surrogate)
ratio = torch.exp(log_probs_new - log_probs_old)
surr1 = ratio * advantages
surr2 = torch.clamp(ratio, 1 - clip_eps, 1 + clip_eps) * advantages
policy_loss = -torch.min(surr1, surr2).mean()
# Value loss
value_loss = nn.functional.mse_loss(values, returns)
# Entropy bonus (higher entropy = more exploration)
entropy_loss = -entropy.mean()
total = policy_loss + vf_coef * value_loss + ent_coef * entropy_loss
return total, policy_loss, value_loss, entropy_lossPPO training loop structure
1. Collect N steps of experience using current policy (rollout phase)
2. Compute GAE advantages and returns
3. For K epochs:
a. Split data into minibatches
b. For each minibatch:
- Recompute log_probs and values with current network
- Compute PPO loss
- Gradient step
4. Go to 1
Key: we reuse the same rollout data for K epochs (typically K=4).
This is why PPO is more sample-efficient than vanilla policy gradient.
PPO hyperparameters
| Hyperparameter | Typical value | What it controls |
|---|---|---|
clip_eps | 0.2 | How much policy can change per update |
gamma | 0.99 | Discount factor |
gae_lambda | 0.95 | GAE bias-variance tradeoff |
n_steps | 2048 | Steps per rollout |
n_epochs | 4-10 | Optimization epochs per rollout |
minibatch_size | 64-256 | Minibatch size for SGD |
lr | 3e-4 | Learning rate (often with linear decay) |
vf_coef | 0.5 | Value loss weight |
ent_coef | 0.01 | Entropy bonus weight |
max_grad_norm | 0.5 | Gradient clipping |
These defaults (from stable-baselines3) work for most environments. Tune n_steps and lr first if performance is bad.
Using stable-baselines3
For production and experimentation, use the reference implementation.
from stable_baselines3 import PPO
from stable_baselines3.common.env_util import make_vec_env
# Vectorized environment (parallel rollouts for speed)
env = make_vec_env("LunarLander-v2", n_envs=4)
model = PPO(
"MlpPolicy", env,
learning_rate=3e-4,
n_steps=2048,
batch_size=64,
n_epochs=10,
gamma=0.99,
gae_lambda=0.95,
clip_range=0.2,
ent_coef=0.01,
verbose=1,
)
model.learn(total_timesteps=500_000)
# Evaluate
obs = env.reset()
for _ in range(1000):
action, _ = model.predict(obs, deterministic=True)
obs, reward, done, info = env.step(action)Why PPO dominates
- Simple: no second-order optimization (TRPO), no replay buffer (SAC/DQN), no target networks
- Stable: clipping prevents catastrophic updates
- General: works for discrete and continuous actions
- Scalable: trivially parallelized across environments
- Proven: used in ChatGPT’s RLHF, OpenAI Five, robotics, game playing
PPO’s main weakness: sample efficiency. It’s on-policy (can’t reuse old data). Off-policy methods (SAC, TD3) can be more sample-efficient for continuous control. But PPO’s simplicity and stability usually win in practice.
Applications
- Robotics: locomotion, manipulation, dexterous hand control
- Game playing: OpenAI Five (Dota 2), StarCraft, board games
- RLHF: training language models to follow instructions (ChatGPT, Claude)
- Autonomous systems: drone navigation, vehicle control
- Resource management: network routing, chip design, scheduling
Self-test questions
- Why does subtracting a baseline from the return reduce variance without changing the expected gradient?
- Walk through what happens to the PPO loss when r_t = 1.5 and A_t > 0. What if A_t < 0?
- Why does PPO reuse the same rollout data for multiple epochs? What’s the risk?
- What is the difference between GAE with lambda=0 and lambda=1?
- Why is PPO preferred over TRPO despite both achieving “trust region” behavior?
Exercises
- Derive the advantage: Starting from the policy gradient theorem, show that subtracting V(s) from Q(s,a) preserves the expected gradient direction.
- Clipping intuition: Plot the PPO objective as a function of r_t for A_t > 0 and A_t < 0 (with eps=0.2). Verify that the clip prevents r_t from moving too far from 1.
- Compare A2C vs PPO: Train both on CartPole-v1 (use stable-baselines3). Plot learning curves. Which converges faster? Which is more stable across random seeds?
Links
- Policy Gradient Methods — REINFORCE, the foundation
- Actor-Critic Methods — actor-critic overview and basics
- Q-Learning and DQN — value-based alternative
- RL Fundamentals — MDP framework, value functions
- Multi-Agent RL — multi-agent PPO (MAPPO)
- Reward Design and Curriculum — what PPO optimizes for
- Tutorial - PPO from Scratch — full implementation
- Language Models — RLHF uses PPO