Actor-Critic and PPO

What

Actor-critic methods combine two ideas: a policy (the actor, which decides what to do) and a value function (the critic, which evaluates how good the action was). PPO (Proximal Policy Optimization) is the most important actor-critic algorithm — it is the default choice for almost every RL application from robotics to RLHF for language models.

This note goes deeper than Actor-Critic Methods, which covers the basics. Here we focus on the full PPO algorithm and its implementation.

Why actor-critic?

Policy Gradient Methods suffer from high variance: the REINFORCE estimator uses the full return G_t as the signal, which varies wildly between episodes.

The key insight: subtract a baseline from the return. If the baseline is V(s) (the expected return from state s), then:

advantage A(s, a) = Q(s, a) - V(s)
                  = "how much better was this action than average?"
  • A > 0: this action was better than expected → increase its probability
  • A < 0: this action was worse than expected → decrease its probability

The critic learns V(s), giving us the baseline. The actor uses the advantage to update the policy. Same expected gradient as REINFORCE, much lower variance.

A2C: Advantage Actor-Critic

The synchronous version of actor-critic.

For each batch of experience:
  1. Run policy, collect trajectories: (s_t, a_t, r_t, s_{t+1})
  2. Compute advantages: A_t = r_t + gamma * V(s_{t+1}) - V(s_t)  (1-step TD)
  3. Update actor: maximize E[log pi(a|s) * A]
  4. Update critic: minimize E[(V(s) - target)^2]

A2C uses the 1-step TD advantage. This has low variance but high bias (because V(s) is an approximation).

Generalized Advantage Estimation (GAE)

GAE provides a smooth tradeoff between bias and variance using a parameter lambda (0 to 1).

delta_t = r_t + gamma * V(s_{t+1}) - V(s_t)     # TD error at step t

GAE(gamma, lambda):
  A_t = delta_t + (gamma * lambda) * delta_{t+1} + (gamma * lambda)^2 * delta_{t+2} + ...

Equivalently (recursive):
  A_T = delta_T                                    # last step
  A_t = delta_t + gamma * lambda * A_{t+1}         # all other steps
lambdaBehaviorBiasVariance
01-step TD: A = r + gamma*V(s’) - V(s)HighLow
0.5Mix of short and long horizonMediumMedium
0.95Close to Monte CarloLowMedium-high
1Full Monte Carlo returnNoneHigh

In practice: lambda=0.95, gamma=0.99. This works across most environments.

import numpy as np
 
def compute_gae(rewards, values, dones, gamma=0.99, lam=0.95):
    """Compute Generalized Advantage Estimation.
    rewards: (T,) rewards at each step.
    values: (T+1,) value estimates (includes bootstrap for last state).
    dones: (T,) episode termination flags.
    Returns: advantages (T,), returns (T,).
    """
    T = len(rewards)
    advantages = np.zeros(T)
    last_gae = 0
 
    for t in reversed(range(T)):
        next_non_terminal = 1.0 - dones[t]
        delta = rewards[t] + gamma * values[t + 1] * next_non_terminal - values[t]
        advantages[t] = delta + gamma * lam * next_non_terminal * last_gae
        last_gae = advantages[t]
 
    returns = advantages + values[:T]
    return advantages, returns

PPO: Proximal Policy Optimization

PPO is the workhorse of modern RL. Published by OpenAI in 2017, it remains the dominant algorithm through 2026.

The problem PPO solves

Policy gradient methods are sensitive to step size. Too small → slow learning. Too large → catastrophic policy collapse (the policy jumps to a bad region and can’t recover).

TRPO (Trust Region Policy Optimization) solved this with constrained optimization — limit the KL divergence between old and new policies. It works but is complex to implement (conjugate gradients, line search).

PPO achieves nearly the same effect with a simple clipping trick.

The clipped objective

L_CLIP = E[ min(r_t * A_t, clip(r_t, 1-eps, 1+eps) * A_t) ]

where:
  r_t = pi_new(a_t | s_t) / pi_old(a_t | s_t)    # probability ratio
  eps = 0.2 (typically)
  A_t = advantage (from GAE)

How it works:

  • r_t measures how much the policy changed. If r_t = 1, no change. r_t = 2 means the action is now 2x more likely.
  • When A_t > 0 (good action): we want to increase r_t, but clip prevents r_t from exceeding (1 + eps). The gradient is zero beyond the clip.
  • When A_t < 0 (bad action): we want to decrease r_t, but clip prevents r_t from going below (1 - eps).

The clip kills the gradient when the policy tries to change too much, preventing catastrophic updates.

import torch
import torch.nn as nn
 
def ppo_loss(log_probs_new, log_probs_old, advantages, clip_eps=0.2):
    """Compute PPO clipped surrogate loss.
    log_probs_new: log pi_new(a|s) for actions taken.
    log_probs_old: log pi_old(a|s) (from rollout).
    advantages: GAE advantages.
    """
    ratio = torch.exp(log_probs_new - log_probs_old)  # pi_new / pi_old
    surr1 = ratio * advantages
    surr2 = torch.clamp(ratio, 1.0 - clip_eps, 1.0 + clip_eps) * advantages
    return -torch.min(surr1, surr2).mean()  # negative because we maximize

The full PPO loss

PPO’s loss has three components:

L = L_CLIP + c1 * L_VALUE + c2 * L_ENTROPY

L_CLIP:   clipped surrogate (actor loss)
L_VALUE:  (V(s) - returns)^2 (critic loss), often also clipped
L_ENTROPY: -E[pi * log(pi)] (entropy bonus, encourages exploration)
def ppo_full_loss(log_probs_new, log_probs_old, values, returns,
                  advantages, entropy, clip_eps=0.2, vf_coef=0.5, ent_coef=0.01):
    """Full PPO loss with value and entropy components."""
    # Policy loss (clipped surrogate)
    ratio = torch.exp(log_probs_new - log_probs_old)
    surr1 = ratio * advantages
    surr2 = torch.clamp(ratio, 1 - clip_eps, 1 + clip_eps) * advantages
    policy_loss = -torch.min(surr1, surr2).mean()
 
    # Value loss
    value_loss = nn.functional.mse_loss(values, returns)
 
    # Entropy bonus (higher entropy = more exploration)
    entropy_loss = -entropy.mean()
 
    total = policy_loss + vf_coef * value_loss + ent_coef * entropy_loss
    return total, policy_loss, value_loss, entropy_loss

PPO training loop structure

1. Collect N steps of experience using current policy (rollout phase)
2. Compute GAE advantages and returns
3. For K epochs:
   a. Split data into minibatches
   b. For each minibatch:
      - Recompute log_probs and values with current network
      - Compute PPO loss
      - Gradient step
4. Go to 1

Key: we reuse the same rollout data for K epochs (typically K=4).
This is why PPO is more sample-efficient than vanilla policy gradient.

PPO hyperparameters

HyperparameterTypical valueWhat it controls
clip_eps0.2How much policy can change per update
gamma0.99Discount factor
gae_lambda0.95GAE bias-variance tradeoff
n_steps2048Steps per rollout
n_epochs4-10Optimization epochs per rollout
minibatch_size64-256Minibatch size for SGD
lr3e-4Learning rate (often with linear decay)
vf_coef0.5Value loss weight
ent_coef0.01Entropy bonus weight
max_grad_norm0.5Gradient clipping

These defaults (from stable-baselines3) work for most environments. Tune n_steps and lr first if performance is bad.

Using stable-baselines3

For production and experimentation, use the reference implementation.

from stable_baselines3 import PPO
from stable_baselines3.common.env_util import make_vec_env
 
# Vectorized environment (parallel rollouts for speed)
env = make_vec_env("LunarLander-v2", n_envs=4)
 
model = PPO(
    "MlpPolicy", env,
    learning_rate=3e-4,
    n_steps=2048,
    batch_size=64,
    n_epochs=10,
    gamma=0.99,
    gae_lambda=0.95,
    clip_range=0.2,
    ent_coef=0.01,
    verbose=1,
)
 
model.learn(total_timesteps=500_000)
 
# Evaluate
obs = env.reset()
for _ in range(1000):
    action, _ = model.predict(obs, deterministic=True)
    obs, reward, done, info = env.step(action)

Why PPO dominates

  1. Simple: no second-order optimization (TRPO), no replay buffer (SAC/DQN), no target networks
  2. Stable: clipping prevents catastrophic updates
  3. General: works for discrete and continuous actions
  4. Scalable: trivially parallelized across environments
  5. Proven: used in ChatGPT’s RLHF, OpenAI Five, robotics, game playing

PPO’s main weakness: sample efficiency. It’s on-policy (can’t reuse old data). Off-policy methods (SAC, TD3) can be more sample-efficient for continuous control. But PPO’s simplicity and stability usually win in practice.

Applications

  • Robotics: locomotion, manipulation, dexterous hand control
  • Game playing: OpenAI Five (Dota 2), StarCraft, board games
  • RLHF: training language models to follow instructions (ChatGPT, Claude)
  • Autonomous systems: drone navigation, vehicle control
  • Resource management: network routing, chip design, scheduling

Self-test questions

  1. Why does subtracting a baseline from the return reduce variance without changing the expected gradient?
  2. Walk through what happens to the PPO loss when r_t = 1.5 and A_t > 0. What if A_t < 0?
  3. Why does PPO reuse the same rollout data for multiple epochs? What’s the risk?
  4. What is the difference between GAE with lambda=0 and lambda=1?
  5. Why is PPO preferred over TRPO despite both achieving “trust region” behavior?

Exercises

  1. Derive the advantage: Starting from the policy gradient theorem, show that subtracting V(s) from Q(s,a) preserves the expected gradient direction.
  2. Clipping intuition: Plot the PPO objective as a function of r_t for A_t > 0 and A_t < 0 (with eps=0.2). Verify that the clip prevents r_t from moving too far from 1.
  3. Compare A2C vs PPO: Train both on CartPole-v1 (use stable-baselines3). Plot learning curves. Which converges faster? Which is more stable across random seeds?