Multi-Agent RL
What
Multiple agents learn simultaneously in a shared environment. Each agent has its own observations, actions, and (possibly) its own reward signal. The agents’ behaviors interact — each agent’s optimal strategy depends on what the others are doing.
This is fundamentally harder than single-agent RL because the environment becomes non-stationary from each agent’s perspective: the “environment” includes the other agents, who are also changing their behavior.
Types of interaction
| Type | Relationship | Reward structure | Example |
|---|---|---|---|
| Cooperative | Agents share a goal | Same reward for all | Drone swarm search |
| Competitive | Zero-sum conflict | One gains, another loses | Pursuit-evasion |
| Mixed | Some cooperation, some competition | Independent but overlapping | Traffic intersections |
Most real-world scenarios are mixed. Even in a “cooperative” drone swarm, agents compete for limited resources (airspace, communication bandwidth, charging stations).
Core challenges
Non-stationarity
From agent A’s perspective, the other agents are part of the environment. As they learn and change their policies, the environment dynamics change. Standard RL assumes a stationary MDP — this assumption breaks in MARL.
Credit assignment
In cooperative settings with a shared reward: all agents get the same reward, but which agent’s action caused the good (or bad) outcome? With 10 drones and a team reward, each drone needs to figure out its contribution.
Scalability
N agents with action space A each → joint action space is A^N. With 5 agents and 10 actions each, that’s 100,000 joint actions. The space explodes combinatorially.
Communication
Should agents communicate? What should they share? Learned communication channels can emerge, but they add complexity and bandwidth requirements.
Approaches
Independent learners
Each agent runs its own RL algorithm (e.g., PPO), treating other agents as part of the environment. Simple and scalable.
import gymnasium as gym
import numpy as np
class IndependentAgent:
"""Each agent learns independently with Q-learning."""
def __init__(self, state_dim, n_actions, lr=0.1, gamma=0.95, epsilon=0.1):
self.Q = {} # state -> action values
self.lr = lr
self.gamma = gamma
self.epsilon = epsilon
self.n_actions = n_actions
def get_q(self, state):
state = tuple(state) if hasattr(state, '__iter__') else state
if state not in self.Q:
self.Q[state] = np.zeros(self.n_actions)
return self.Q[state]
def choose_action(self, state):
if np.random.random() < self.epsilon:
return np.random.randint(self.n_actions)
return np.argmax(self.get_q(state))
def update(self, state, action, reward, next_state, done):
q = self.get_q(state)
q_next = self.get_q(next_state)
target = reward + (1 - done) * self.gamma * np.max(q_next)
q[action] += self.lr * (target - q[action])Problems: non-stationarity means learning is unstable. Often works surprisingly well as a baseline though.
Centralized Training, Decentralized Execution (CTDE)
The dominant paradigm. During training, agents can share information (other agents’ observations, actions, global state). During execution, each agent acts based only on its own observations.
Training:
Central critic sees all agents' observations and actions
Each agent's actor only sees its own observation
Execution:
Each agent runs its own actor independently
No communication required
Why: during training, using all available information improves learning. At test time, agents may not be able to communicate reliably (radio jamming, latency, agent failure).
QMIX (cooperative teams)
For cooperative settings. Each agent has a local Q-function. The team Q-value is a monotonic mixing of individual Q-values:
Q_team(s, a1, a2, ..., aN) = mix(Q1(o1, a1), Q2(o2, a2), ..., QN(oN, aN))
Constraint: d(Q_team)/d(Qi) >= 0 (monotonic)
→ maximizing Q_team = each agent maximizes its own Qi
→ decentralized execution works!
The mixing network is a hypernetwork: its weights are conditioned on the global state.
MAPPO (Multi-Agent PPO)
Surprisingly, applying PPO independently to each agent (with a shared centralized critic) works very well. MAPPO is a strong baseline that often matches or beats more complex algorithms.
import torch
import torch.nn as nn
class MAPPOCritic(nn.Module):
"""Centralized critic: takes all agents' observations as input."""
def __init__(self, total_obs_dim, hidden=128):
super().__init__()
self.net = nn.Sequential(
nn.Linear(total_obs_dim, hidden),
nn.ReLU(),
nn.Linear(hidden, hidden),
nn.ReLU(),
nn.Linear(hidden, 1),
)
def forward(self, all_obs):
"""all_obs: concatenation of all agents' observations."""
return self.net(all_obs)
class MAPPOActor(nn.Module):
"""Decentralized actor: each agent sees only its own observation."""
def __init__(self, obs_dim, n_actions, hidden=64):
super().__init__()
self.net = nn.Sequential(
nn.Linear(obs_dim, hidden),
nn.ReLU(),
nn.Linear(hidden, hidden),
nn.ReLU(),
nn.Linear(hidden, n_actions),
)
def forward(self, obs):
logits = self.net(obs)
return torch.distributions.Categorical(logits=logits)Key insight: the critic sees everything (CTDE), but each actor only sees its own observation. During execution, only the actors are needed.
Parameter sharing
When agents are homogeneous (same type, same action space), share one policy across all agents. Each agent receives its own observation but uses the same network weights.
Benefits: N times more data for training, faster convergence, smaller model. Works well when agents should behave similarly (drone swarm with identical drones).
Communication
Agents can learn to communicate through explicit message channels.
Approaches
| Method | How | When |
|---|---|---|
| CommNet | Average of all messages as extra input | Fully connected, small teams |
| TarMAC | Attention-based selective communication | When agents should attend to specific others |
| QMIX + comms | Messages as part of observation | Discrete communication protocols |
class CommunicatingAgent(nn.Module):
"""Agent that sends and receives messages."""
def __init__(self, obs_dim, n_actions, msg_dim=16, hidden=64):
super().__init__()
self.encoder = nn.Sequential(
nn.Linear(obs_dim + msg_dim, hidden),
nn.ReLU(),
)
# Message to send to others
self.msg_head = nn.Linear(hidden, msg_dim)
# Action selection
self.action_head = nn.Linear(hidden, n_actions)
def forward(self, obs, received_msg):
"""obs: agent's own observation. received_msg: aggregated messages from others."""
x = torch.cat([obs, received_msg], dim=-1)
h = self.encoder(x)
msg_out = self.msg_head(h) # message to broadcast
logits = self.action_head(h) # action selection
return torch.distributions.Categorical(logits=logits), msg_outEmergent behavior
One of the most fascinating aspects of MARL: behaviors emerge that were never explicitly programmed.
Examples:
- Cooperation: agents learn to coordinate without explicit cooperation reward
- Specialization: in pursuit-evasion, some agents learn to chase while others learn to cut off escape routes
- Deception: in competitive settings, agents learn to feint and misdirect
- Communication protocols: agents develop their own “language” for coordination
PettingZoo: MARL environments
PettingZoo is the standard library for multi-agent environments (the MARL equivalent of Gymnasium).
from pettingzoo.mpe import simple_spread_v3
# Cooperative: N agents must cover N landmarks
env = simple_spread_v3.parallel_env(N=3, max_cycles=100)
observations, infos = env.reset()
# observations is a dict: {agent_name: observation}
for agent, obs in observations.items():
print(f"{agent}: obs shape = {obs.shape}")
# Step with actions for all agents
actions = {agent: env.action_space(agent).sample() for agent in env.agents}
observations, rewards, terminations, truncations, infos = env.step(actions)
# rewards is a dict: {agent_name: float}
for agent, reward in rewards.items():
print(f"{agent}: reward = {reward:.3f}")Key PettingZoo environments
| Environment | Type | Description |
|---|---|---|
| simple_spread | Cooperative | N agents cover N landmarks |
| simple_tag | Competitive | Predators chase prey |
| simple_adversary | Mixed | Agent must reach target, adversary interferes |
| waterworld | Cooperative | Agents consume food, avoid poison |
More MARL frameworks
| Framework | Strengths |
|---|---|
| EPyMARL | Easy benchmarking of MARL algorithms (QMIX, MAPPO, etc.) |
| MARLlib | Comprehensive library supporting many algorithms + environments |
| Melting Pot | DeepMind, evaluates social intelligence and cooperation |
Applications
Drone swarms
- Formation control: maintain geometric formation while navigating
- Search and coverage: cooperatively search an area (divide territory)
- Pursuit-evasion: team of drones tracking a moving target
- Relay communication: agents position themselves to relay messages
Defense and security
- EW (Electronic Warfare): jammer vs target as adversarial game. The jammer learns to position/time interference; the target learns evasion. MARL finds equilibrium strategies.
- Swarm vs defense: offensive swarm of drones vs defensive system. Competitive MARL reveals vulnerabilities in both attack and defense strategies.
- Cognitive warfare modeling: multi-agent influence networks. Agents represent actors trying to shape information environment. Competitive MARL reveals manipulation strategies and countermeasures.
Other domains
- Traffic control: traffic signals as cooperative agents, vehicles as self-interested agents
- Multiplayer games: StarCraft (SMAC), Dota 2, hide-and-seek
- Market making: buyers and sellers as competing agents
Self-test questions
- Why does independent learning struggle in multi-agent settings?
- What does CTDE stand for, and why is it the dominant paradigm?
- How does QMIX ensure that decentralized execution is consistent with centralized training?
- Why is parameter sharing effective for homogeneous agents?
- Give an example of emergent behavior in MARL. Why is it surprising?
Exercises
- Independent agents: Train independent Q-learning agents in simple_tag (PettingZoo). Do predators learn to coordinate? Measure capture rate over training.
- CTDE with shared critic: Implement a simple MAPPO setup: shared centralized critic, independent actors. Train on simple_spread. Compare with independent PPO.
- Emergent communication: Add a message channel (1-bit signal) between agents in simple_spread. Train with MAPPO. Analyze: do agents learn to use the communication channel meaningfully? Plot message statistics vs training progress.
Links
- Actor-Critic and PPO — MAPPO builds on PPO
- Policy Gradient Methods — policy optimization basics
- Reward Design and Curriculum — reward design is even harder in MARL
- Tutorial - Multi-Agent Training — hands-on multi-agent training
- Case Study - RL System Design — multi-agent pursuit scenario
- RL Fundamentals — single-agent foundations