Multi-Agent RL

What

Multiple agents learn simultaneously in a shared environment. Each agent has its own observations, actions, and (possibly) its own reward signal. The agents’ behaviors interact — each agent’s optimal strategy depends on what the others are doing.

This is fundamentally harder than single-agent RL because the environment becomes non-stationary from each agent’s perspective: the “environment” includes the other agents, who are also changing their behavior.

Types of interaction

TypeRelationshipReward structureExample
CooperativeAgents share a goalSame reward for allDrone swarm search
CompetitiveZero-sum conflictOne gains, another losesPursuit-evasion
MixedSome cooperation, some competitionIndependent but overlappingTraffic intersections

Most real-world scenarios are mixed. Even in a “cooperative” drone swarm, agents compete for limited resources (airspace, communication bandwidth, charging stations).

Core challenges

Non-stationarity

From agent A’s perspective, the other agents are part of the environment. As they learn and change their policies, the environment dynamics change. Standard RL assumes a stationary MDP — this assumption breaks in MARL.

Credit assignment

In cooperative settings with a shared reward: all agents get the same reward, but which agent’s action caused the good (or bad) outcome? With 10 drones and a team reward, each drone needs to figure out its contribution.

Scalability

N agents with action space A each → joint action space is A^N. With 5 agents and 10 actions each, that’s 100,000 joint actions. The space explodes combinatorially.

Communication

Should agents communicate? What should they share? Learned communication channels can emerge, but they add complexity and bandwidth requirements.

Approaches

Independent learners

Each agent runs its own RL algorithm (e.g., PPO), treating other agents as part of the environment. Simple and scalable.

import gymnasium as gym
import numpy as np
 
class IndependentAgent:
    """Each agent learns independently with Q-learning."""
 
    def __init__(self, state_dim, n_actions, lr=0.1, gamma=0.95, epsilon=0.1):
        self.Q = {}  # state -> action values
        self.lr = lr
        self.gamma = gamma
        self.epsilon = epsilon
        self.n_actions = n_actions
 
    def get_q(self, state):
        state = tuple(state) if hasattr(state, '__iter__') else state
        if state not in self.Q:
            self.Q[state] = np.zeros(self.n_actions)
        return self.Q[state]
 
    def choose_action(self, state):
        if np.random.random() < self.epsilon:
            return np.random.randint(self.n_actions)
        return np.argmax(self.get_q(state))
 
    def update(self, state, action, reward, next_state, done):
        q = self.get_q(state)
        q_next = self.get_q(next_state)
        target = reward + (1 - done) * self.gamma * np.max(q_next)
        q[action] += self.lr * (target - q[action])

Problems: non-stationarity means learning is unstable. Often works surprisingly well as a baseline though.

Centralized Training, Decentralized Execution (CTDE)

The dominant paradigm. During training, agents can share information (other agents’ observations, actions, global state). During execution, each agent acts based only on its own observations.

Training:
  Central critic sees all agents' observations and actions
  Each agent's actor only sees its own observation

Execution:
  Each agent runs its own actor independently
  No communication required

Why: during training, using all available information improves learning. At test time, agents may not be able to communicate reliably (radio jamming, latency, agent failure).

QMIX (cooperative teams)

For cooperative settings. Each agent has a local Q-function. The team Q-value is a monotonic mixing of individual Q-values:

Q_team(s, a1, a2, ..., aN) = mix(Q1(o1, a1), Q2(o2, a2), ..., QN(oN, aN))

Constraint: d(Q_team)/d(Qi) >= 0  (monotonic)
→ maximizing Q_team = each agent maximizes its own Qi
→ decentralized execution works!

The mixing network is a hypernetwork: its weights are conditioned on the global state.

MAPPO (Multi-Agent PPO)

Surprisingly, applying PPO independently to each agent (with a shared centralized critic) works very well. MAPPO is a strong baseline that often matches or beats more complex algorithms.

import torch
import torch.nn as nn
 
class MAPPOCritic(nn.Module):
    """Centralized critic: takes all agents' observations as input."""
 
    def __init__(self, total_obs_dim, hidden=128):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(total_obs_dim, hidden),
            nn.ReLU(),
            nn.Linear(hidden, hidden),
            nn.ReLU(),
            nn.Linear(hidden, 1),
        )
 
    def forward(self, all_obs):
        """all_obs: concatenation of all agents' observations."""
        return self.net(all_obs)
 
class MAPPOActor(nn.Module):
    """Decentralized actor: each agent sees only its own observation."""
 
    def __init__(self, obs_dim, n_actions, hidden=64):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(obs_dim, hidden),
            nn.ReLU(),
            nn.Linear(hidden, hidden),
            nn.ReLU(),
            nn.Linear(hidden, n_actions),
        )
 
    def forward(self, obs):
        logits = self.net(obs)
        return torch.distributions.Categorical(logits=logits)

Key insight: the critic sees everything (CTDE), but each actor only sees its own observation. During execution, only the actors are needed.

Parameter sharing

When agents are homogeneous (same type, same action space), share one policy across all agents. Each agent receives its own observation but uses the same network weights.

Benefits: N times more data for training, faster convergence, smaller model. Works well when agents should behave similarly (drone swarm with identical drones).

Communication

Agents can learn to communicate through explicit message channels.

Approaches

MethodHowWhen
CommNetAverage of all messages as extra inputFully connected, small teams
TarMACAttention-based selective communicationWhen agents should attend to specific others
QMIX + commsMessages as part of observationDiscrete communication protocols
class CommunicatingAgent(nn.Module):
    """Agent that sends and receives messages."""
 
    def __init__(self, obs_dim, n_actions, msg_dim=16, hidden=64):
        super().__init__()
        self.encoder = nn.Sequential(
            nn.Linear(obs_dim + msg_dim, hidden),
            nn.ReLU(),
        )
        # Message to send to others
        self.msg_head = nn.Linear(hidden, msg_dim)
        # Action selection
        self.action_head = nn.Linear(hidden, n_actions)
 
    def forward(self, obs, received_msg):
        """obs: agent's own observation. received_msg: aggregated messages from others."""
        x = torch.cat([obs, received_msg], dim=-1)
        h = self.encoder(x)
        msg_out = self.msg_head(h)       # message to broadcast
        logits = self.action_head(h)     # action selection
        return torch.distributions.Categorical(logits=logits), msg_out

Emergent behavior

One of the most fascinating aspects of MARL: behaviors emerge that were never explicitly programmed.

Examples:

  • Cooperation: agents learn to coordinate without explicit cooperation reward
  • Specialization: in pursuit-evasion, some agents learn to chase while others learn to cut off escape routes
  • Deception: in competitive settings, agents learn to feint and misdirect
  • Communication protocols: agents develop their own “language” for coordination

PettingZoo: MARL environments

PettingZoo is the standard library for multi-agent environments (the MARL equivalent of Gymnasium).

from pettingzoo.mpe import simple_spread_v3
 
# Cooperative: N agents must cover N landmarks
env = simple_spread_v3.parallel_env(N=3, max_cycles=100)
observations, infos = env.reset()
 
# observations is a dict: {agent_name: observation}
for agent, obs in observations.items():
    print(f"{agent}: obs shape = {obs.shape}")
 
# Step with actions for all agents
actions = {agent: env.action_space(agent).sample() for agent in env.agents}
observations, rewards, terminations, truncations, infos = env.step(actions)
 
# rewards is a dict: {agent_name: float}
for agent, reward in rewards.items():
    print(f"{agent}: reward = {reward:.3f}")

Key PettingZoo environments

EnvironmentTypeDescription
simple_spreadCooperativeN agents cover N landmarks
simple_tagCompetitivePredators chase prey
simple_adversaryMixedAgent must reach target, adversary interferes
waterworldCooperativeAgents consume food, avoid poison

More MARL frameworks

FrameworkStrengths
EPyMARLEasy benchmarking of MARL algorithms (QMIX, MAPPO, etc.)
MARLlibComprehensive library supporting many algorithms + environments
Melting PotDeepMind, evaluates social intelligence and cooperation

Applications

Drone swarms

  • Formation control: maintain geometric formation while navigating
  • Search and coverage: cooperatively search an area (divide territory)
  • Pursuit-evasion: team of drones tracking a moving target
  • Relay communication: agents position themselves to relay messages

Defense and security

  • EW (Electronic Warfare): jammer vs target as adversarial game. The jammer learns to position/time interference; the target learns evasion. MARL finds equilibrium strategies.
  • Swarm vs defense: offensive swarm of drones vs defensive system. Competitive MARL reveals vulnerabilities in both attack and defense strategies.
  • Cognitive warfare modeling: multi-agent influence networks. Agents represent actors trying to shape information environment. Competitive MARL reveals manipulation strategies and countermeasures.

Other domains

  • Traffic control: traffic signals as cooperative agents, vehicles as self-interested agents
  • Multiplayer games: StarCraft (SMAC), Dota 2, hide-and-seek
  • Market making: buyers and sellers as competing agents

Self-test questions

  1. Why does independent learning struggle in multi-agent settings?
  2. What does CTDE stand for, and why is it the dominant paradigm?
  3. How does QMIX ensure that decentralized execution is consistent with centralized training?
  4. Why is parameter sharing effective for homogeneous agents?
  5. Give an example of emergent behavior in MARL. Why is it surprising?

Exercises

  1. Independent agents: Train independent Q-learning agents in simple_tag (PettingZoo). Do predators learn to coordinate? Measure capture rate over training.
  2. CTDE with shared critic: Implement a simple MAPPO setup: shared centralized critic, independent actors. Train on simple_spread. Compare with independent PPO.
  3. Emergent communication: Add a message channel (1-bit signal) between agents in simple_spread. Train with MAPPO. Analyze: do agents learn to use the communication channel meaningfully? Plot message statistics vs training progress.