Actor-Critic and PPO

What

Actor-critic methods combine two ideas: a policy (the actor, which decides what to do) and a value function (the critic, which evaluates how good the action was). PPO (Proximal Policy Optimization) is the most important actor-critic algorithm — it is the default choice for almost every RL application from robotics to RLHF for language models.

This note goes deeper than Actor-Critic Methods, which covers the basics. Here we focus on the full PPO algorithm and its implementation.

Why actor-critic?

Policy Gradient Methods suffer from high variance: the REINFORCE estimator uses the full return G_t as the signal, which varies wildly between episodes.

The key insight: subtract a baseline from the return. If the baseline is V(s) (the expected return from state s), then:

advantage A(s, a) = Q(s, a) - V(s)
                  = "how much better was this action than average?"

A > 0: this action was better than expected → increase its probability
A < 0: this action was worse than expected → decrease its probability

The critic learns V(s), giving us the baseline. The actor uses the advantage to update the policy. Same expected gradient as REINFORCE, much lower variance.

A2C: Advantage Actor-Critic

The synchronous version of actor-critic.

For each batch of experience:
  1. Run policy, collect trajectories: (s_t, a_t, r_t, s_{t+1})
  2. Compute advantages: A_t = r_t + gamma * V(s_{t+1}) - V(s_t)  (1-step TD)
  3. Update actor: maximize E[log pi(a|s) * A]
  4. Update critic: minimize E[(V(s) - target)^2]

A2C uses the 1-step TD advantage. This has low variance but high bias (because V(s) is an approximation).

Generalized Advantage Estimation (GAE)

GAE provides a smooth tradeoff between bias and variance using a parameter lambda (0 to 1).

delta_t = r_t + gamma * V(s_{t+1}) - V(s_t)     # TD error at step t

GAE(gamma, lambda):
  A_t = delta_t + (gamma * lambda) * delta_{t+1} + (gamma * lambda)^2 * delta_{t+2} + ...

Equivalently (recursive):
  A_T = delta_T                                    # last step
  A_t = delta_t + gamma * lambda * A_{t+1}         # all other steps

lambda	Behavior	Bias	Variance
0	1-step TD: A = r + gamma*V(s’) - V(s)	High	Low
0.5	Mix of short and long horizon	Medium	Medium
0.95	Close to Monte Carlo	Low	Medium-high
1	Full Monte Carlo return	None	High

In practice: lambda=0.95, gamma=0.99. This works across most environments.

import numpy as np
 
def compute_gae(rewards, values, dones, gamma=0.99, lam=0.95):
    """Compute Generalized Advantage Estimation.
    rewards: (T,) rewards at each step.
    values: (T+1,) value estimates (includes bootstrap for last state).
    dones: (T,) episode termination flags.
    Returns: advantages (T,), returns (T,).
    """
    T = len(rewards)
    advantages = np.zeros(T)
    last_gae = 0
 
    for t in reversed(range(T)):
        next_non_terminal = 1.0 - dones[t]
        delta = rewards[t] + gamma * values[t + 1] * next_non_terminal - values[t]
        advantages[t] = delta + gamma * lam * next_non_terminal * last_gae
        last_gae = advantages[t]
 
    returns = advantages + values[:T]
    return advantages, returns

PPO: Proximal Policy Optimization

PPO is the workhorse of modern RL. Published by OpenAI in 2017, it remains the dominant algorithm through 2026.

The problem PPO solves

Policy gradient methods are sensitive to step size. Too small → slow learning. Too large → catastrophic policy collapse (the policy jumps to a bad region and can’t recover).

TRPO (Trust Region Policy Optimization) solved this with constrained optimization — limit the KL divergence between old and new policies. It works but is complex to implement (conjugate gradients, line search).

PPO achieves nearly the same effect with a simple clipping trick.

The clipped objective

L_CLIP = E[ min(r_t * A_t, clip(r_t, 1-eps, 1+eps) * A_t) ]

where:
  r_t = pi_new(a_t | s_t) / pi_old(a_t | s_t)    # probability ratio
  eps = 0.2 (typically)
  A_t = advantage (from GAE)

How it works:

r_t measures how much the policy changed. If r_t = 1, no change. r_t = 2 means the action is now 2x more likely.
When A_t > 0 (good action): we want to increase r_t, but clip prevents r_t from exceeding (1 + eps). The gradient is zero beyond the clip.
When A_t < 0 (bad action): we want to decrease r_t, but clip prevents r_t from going below (1 - eps).

The clip kills the gradient when the policy tries to change too much, preventing catastrophic updates.

import torch
import torch.nn as nn
 
def ppo_loss(log_probs_new, log_probs_old, advantages, clip_eps=0.2):
    """Compute PPO clipped surrogate loss.
    log_probs_new: log pi_new(a|s) for actions taken.
    log_probs_old: log pi_old(a|s) (from rollout).
    advantages: GAE advantages.
    """
    ratio = torch.exp(log_probs_new - log_probs_old)  # pi_new / pi_old
    surr1 = ratio * advantages
    surr2 = torch.clamp(ratio, 1.0 - clip_eps, 1.0 + clip_eps) * advantages
    return -torch.min(surr1, surr2).mean()  # negative because we maximize

The full PPO loss

PPO’s loss has three components:

L = L_CLIP + c1 * L_VALUE + c2 * L_ENTROPY

L_CLIP:   clipped surrogate (actor loss)
L_VALUE:  (V(s) - returns)^2 (critic loss), often also clipped
L_ENTROPY: -E[pi * log(pi)] (entropy bonus, encourages exploration)

def ppo_full_loss(log_probs_new, log_probs_old, values, returns,
                  advantages, entropy, clip_eps=0.2, vf_coef=0.5, ent_coef=0.01):
    """Full PPO loss with value and entropy components."""
    # Policy loss (clipped surrogate)
    ratio = torch.exp(log_probs_new - log_probs_old)
    surr1 = ratio * advantages
    surr2 = torch.clamp(ratio, 1 - clip_eps, 1 + clip_eps) * advantages
    policy_loss = -torch.min(surr1, surr2).mean()
 
    # Value loss
    value_loss = nn.functional.mse_loss(values, returns)
 
    # Entropy bonus (higher entropy = more exploration)
    entropy_loss = -entropy.mean()
 
    total = policy_loss + vf_coef * value_loss + ent_coef * entropy_loss
    return total, policy_loss, value_loss, entropy_loss

PPO training loop structure

1. Collect N steps of experience using current policy (rollout phase)
2. Compute GAE advantages and returns
3. For K epochs:
   a. Split data into minibatches
   b. For each minibatch:
      - Recompute log_probs and values with current network
      - Compute PPO loss
      - Gradient step
4. Go to 1

Key: we reuse the same rollout data for K epochs (typically K=4).
This is why PPO is more sample-efficient than vanilla policy gradient.

PPO hyperparameters

Hyperparameter	Typical value	What it controls
`clip_eps`	0.2	How much policy can change per update
`gamma`	0.99	Discount factor
`gae_lambda`	0.95	GAE bias-variance tradeoff
`n_steps`	2048	Steps per rollout
`n_epochs`	4-10	Optimization epochs per rollout
`minibatch_size`	64-256	Minibatch size for SGD
`lr`	3e-4	Learning rate (often with linear decay)
`vf_coef`	0.5	Value loss weight
`ent_coef`	0.01	Entropy bonus weight
`max_grad_norm`	0.5	Gradient clipping

These defaults (from stable-baselines3) work for most environments. Tune n_steps and lr first if performance is bad.

Using stable-baselines3

For production and experimentation, use the reference implementation.

from stable_baselines3 import PPO
from stable_baselines3.common.env_util import make_vec_env
 
# Vectorized environment (parallel rollouts for speed)
env = make_vec_env("LunarLander-v2", n_envs=4)
 
model = PPO(
    "MlpPolicy", env,
    learning_rate=3e-4,
    n_steps=2048,
    batch_size=64,
    n_epochs=10,
    gamma=0.99,
    gae_lambda=0.95,
    clip_range=0.2,
    ent_coef=0.01,
    verbose=1,
)
 
model.learn(total_timesteps=500_000)
 
# Evaluate
obs = env.reset()
for _ in range(1000):
    action, _ = model.predict(obs, deterministic=True)
    obs, reward, done, info = env.step(action)

Why PPO dominates

Simple: no second-order optimization (TRPO), no replay buffer (SAC/DQN), no target networks
Stable: clipping prevents catastrophic updates
General: works for discrete and continuous actions
Scalable: trivially parallelized across environments
Proven: used in ChatGPT’s RLHF, OpenAI Five, robotics, game playing

PPO’s main weakness: sample efficiency. It’s on-policy (can’t reuse old data). Off-policy methods (SAC, TD3) can be more sample-efficient for continuous control. But PPO’s simplicity and stability usually win in practice.

Applications

Robotics: locomotion, manipulation, dexterous hand control
Game playing: OpenAI Five (Dota 2), StarCraft, board games
RLHF: training language models to follow instructions (ChatGPT, Claude)
Autonomous systems: drone navigation, vehicle control
Resource management: network routing, chip design, scheduling

Self-test questions

Why does subtracting a baseline from the return reduce variance without changing the expected gradient?
Walk through what happens to the PPO loss when r_t = 1.5 and A_t > 0. What if A_t < 0?
Why does PPO reuse the same rollout data for multiple epochs? What’s the risk?
What is the difference between GAE with lambda=0 and lambda=1?
Why is PPO preferred over TRPO despite both achieving “trust region” behavior?

Exercises

Derive the advantage: Starting from the policy gradient theorem, show that subtracting V(s) from Q(s,a) preserves the expected gradient direction.
Clipping intuition: Plot the PPO objective as a function of r_t for A_t > 0 and A_t < 0 (with eps=0.2). Verify that the clip prevents r_t from moving too far from 1.
Compare A2C vs PPO: Train both on CartPole-v1 (use stable-baselines3). Plot learning curves. Which converges faster? Which is more stable across random seeds?

AI/ML Notes

Explorer

Actor-Critic and PPO

Actor-Critic and PPO

What

Why actor-critic?

A2C: Advantage Actor-Critic

Generalized Advantage Estimation (GAE)

PPO: Proximal Policy Optimization

The problem PPO solves

The clipped objective

The full PPO loss

PPO training loop structure

PPO hyperparameters

Using stable-baselines3

Why PPO dominates

Applications

Self-test questions

Exercises

Links

Graph View

Table of Contents

Backlinks