Actor-Critic Methods

What

Combine policy gradient (actor) with value function (critic):

  • Actor: the policy — decides what to do
  • Critic: value function — evaluates how good the action was

The critic reduces variance of policy gradient updates by providing a baseline.

The advantage function

The core idea: don’t ask “how good was this action?” Ask “how much better was this action than average?”

A(s, a) = Q(s, a) - V(s)
  • Q(s, a): expected return from taking action a in state s
  • V(s): expected return from state s under current policy
  • A(s, a): the advantage — positive means better than average, negative means worse

Why this helps: vanilla policy gradient has high variance because rewards vary a lot. Subtracting the baseline V(s) keeps the same expected gradient but dramatically reduces variance. The critic learns V(s) so you don’t need to estimate it from returns alone.

Generalized Advantage Estimation (GAE)

GAE balances bias vs variance when estimating the advantage. It uses a parameter lambda (0 to 1):

GAE(lambda=0): A = r + gamma*V(s') - V(s)   # low variance, high bias (TD)
GAE(lambda=1): A = full Monte Carlo return    # high variance, low bias

In practice, lambda=0.95 and gamma=0.99 work well. GAE is used in PPO by default.

Key algorithms

AlgorithmNotes
A2C (Advantage Actor-Critic)Synchronous, stable
A3CAsynchronous parallel training
PPO (Proximal Policy Optimization)Clipped updates, the default choice
SAC (Soft Actor-Critic)Maximum entropy, good for continuous control

Why PPO dominates

PPO’s key insight: limit how much the policy changes per update. The clipped objective:

L = min(r * A, clip(r, 1-eps, 1+eps) * A)

where r = pi_new(a|s) / pi_old(a|s) is the probability ratio and eps is typically 0.2.

If the new policy tries to change too much (r far from 1), the clip kills the gradient. This prevents catastrophic policy updates without the complexity of TRPO’s constrained optimization.

PPO is the default for: robotics, game playing, RLHF for LLMs, and most new RL applications. It’s simple to implement, stable, and works across discrete and continuous action spaces.

Continuous vs discrete action spaces

AspectDiscreteContinuous
Actor outputProbability over N actionsMean + std of Gaussian per dimension
SamplingCategorical distributionNormal distribution
ExampleAtari (left, right, fire)Robotics (joint torques)
Typical algoPPO, A2CPPO, SAC

SAC is particularly good for continuous control because its entropy bonus encourages exploration across the continuous space.

from stable_baselines3 import PPO
 
model = PPO("MlpPolicy", "CartPole-v1", verbose=1)
model.learn(total_timesteps=100_000)