Actor-Critic Methods
What
Combine policy gradient (actor) with value function (critic):
- Actor: the policy — decides what to do
- Critic: value function — evaluates how good the action was
The critic reduces variance of policy gradient updates by providing a baseline.
The advantage function
The core idea: don’t ask “how good was this action?” Ask “how much better was this action than average?”
A(s, a) = Q(s, a) - V(s)
Q(s, a): expected return from taking actionain statesV(s): expected return from statesunder current policyA(s, a): the advantage — positive means better than average, negative means worse
Why this helps: vanilla policy gradient has high variance because rewards vary a lot. Subtracting the baseline V(s) keeps the same expected gradient but dramatically reduces variance. The critic learns V(s) so you don’t need to estimate it from returns alone.
Generalized Advantage Estimation (GAE)
GAE balances bias vs variance when estimating the advantage. It uses a parameter lambda (0 to 1):
GAE(lambda=0): A = r + gamma*V(s') - V(s) # low variance, high bias (TD)
GAE(lambda=1): A = full Monte Carlo return # high variance, low bias
In practice, lambda=0.95 and gamma=0.99 work well. GAE is used in PPO by default.
Key algorithms
| Algorithm | Notes |
|---|---|
| A2C (Advantage Actor-Critic) | Synchronous, stable |
| A3C | Asynchronous parallel training |
| PPO (Proximal Policy Optimization) | Clipped updates, the default choice |
| SAC (Soft Actor-Critic) | Maximum entropy, good for continuous control |
Why PPO dominates
PPO’s key insight: limit how much the policy changes per update. The clipped objective:
L = min(r * A, clip(r, 1-eps, 1+eps) * A)
where r = pi_new(a|s) / pi_old(a|s) is the probability ratio and eps is typically 0.2.
If the new policy tries to change too much (r far from 1), the clip kills the gradient. This prevents catastrophic policy updates without the complexity of TRPO’s constrained optimization.
PPO is the default for: robotics, game playing, RLHF for LLMs, and most new RL applications. It’s simple to implement, stable, and works across discrete and continuous action spaces.
Continuous vs discrete action spaces
| Aspect | Discrete | Continuous |
|---|---|---|
| Actor output | Probability over N actions | Mean + std of Gaussian per dimension |
| Sampling | Categorical distribution | Normal distribution |
| Example | Atari (left, right, fire) | Robotics (joint torques) |
| Typical algo | PPO, A2C | PPO, SAC |
SAC is particularly good for continuous control because its entropy bonus encourages exploration across the continuous space.
from stable_baselines3 import PPO
model = PPO("MlpPolicy", "CartPole-v1", verbose=1)
model.learn(total_timesteps=100_000)Links
- Policy Gradient Methods — the actor’s foundation
- Q-Learning and DQN — value-based alternative
- RL Fundamentals — core RL concepts
- Multi-Armed Bandits — simplest exploration/exploitation problem
- Language Models — RLHF uses PPO