Actor-Critic Methods

What

Combine policy gradient (actor) with value function (critic):

Actor: the policy — decides what to do
Critic: value function — evaluates how good the action was

The critic reduces variance of policy gradient updates by providing a baseline.

The advantage function

The core idea: don’t ask “how good was this action?” Ask “how much better was this action than average?”

A(s, a) = Q(s, a) - V(s)

Q(s, a): expected return from taking action a in state s
V(s): expected return from state s under current policy
A(s, a): the advantage — positive means better than average, negative means worse

Why this helps: vanilla policy gradient has high variance because rewards vary a lot. Subtracting the baseline V(s) keeps the same expected gradient but dramatically reduces variance. The critic learns V(s) so you don’t need to estimate it from returns alone.

Generalized Advantage Estimation (GAE)

GAE balances bias vs variance when estimating the advantage. It uses a parameter lambda (0 to 1):

GAE(lambda=0): A = r + gamma*V(s') - V(s)   # low variance, high bias (TD)
GAE(lambda=1): A = full Monte Carlo return    # high variance, low bias

In practice, lambda=0.95 and gamma=0.99 work well. GAE is used in PPO by default.

Key algorithms

Algorithm	Notes
A2C (Advantage Actor-Critic)	Synchronous, stable
A3C	Asynchronous parallel training
PPO (Proximal Policy Optimization)	Clipped updates, the default choice
SAC (Soft Actor-Critic)	Maximum entropy, good for continuous control

Why PPO dominates

PPO’s key insight: limit how much the policy changes per update. The clipped objective:

L = min(r * A, clip(r, 1-eps, 1+eps) * A)

where r = pi_new(a|s) / pi_old(a|s) is the probability ratio and eps is typically 0.2.

If the new policy tries to change too much (r far from 1), the clip kills the gradient. This prevents catastrophic policy updates without the complexity of TRPO’s constrained optimization.

PPO is the default for: robotics, game playing, RLHF for LLMs, and most new RL applications. It’s simple to implement, stable, and works across discrete and continuous action spaces.

Continuous vs discrete action spaces

Aspect	Discrete	Continuous
Actor output	Probability over N actions	Mean + std of Gaussian per dimension
Sampling	Categorical distribution	Normal distribution
Example	Atari (left, right, fire)	Robotics (joint torques)
Typical algo	PPO, A2C	PPO, SAC

SAC is particularly good for continuous control because its entropy bonus encourages exploration across the continuous space.

from stable_baselines3 import PPO
 
model = PPO("MlpPolicy", "CartPole-v1", verbose=1)
model.learn(total_timesteps=100_000)

AI/ML Notes

Explorer

Actor-Critic Methods

Actor-Critic Methods

What

The advantage function

Generalized Advantage Estimation (GAE)

Key algorithms

Why PPO dominates

Continuous vs discrete action spaces

Links

Graph View

Table of Contents

Backlinks