Multi-Agent RL

What

Multiple agents learn simultaneously in a shared environment. Each agent has its own observations, actions, and (possibly) its own reward signal. The agents’ behaviors interact — each agent’s optimal strategy depends on what the others are doing.

This is fundamentally harder than single-agent RL because the environment becomes non-stationary from each agent’s perspective: the “environment” includes the other agents, who are also changing their behavior.

Types of interaction

Type	Relationship	Reward structure	Example
Cooperative	Agents share a goal	Same reward for all	Drone swarm search
Competitive	Zero-sum conflict	One gains, another loses	Pursuit-evasion
Mixed	Some cooperation, some competition	Independent but overlapping	Traffic intersections

Most real-world scenarios are mixed. Even in a “cooperative” drone swarm, agents compete for limited resources (airspace, communication bandwidth, charging stations).

Core challenges

Non-stationarity

From agent A’s perspective, the other agents are part of the environment. As they learn and change their policies, the environment dynamics change. Standard RL assumes a stationary MDP — this assumption breaks in MARL.

Credit assignment

In cooperative settings with a shared reward: all agents get the same reward, but which agent’s action caused the good (or bad) outcome? With 10 drones and a team reward, each drone needs to figure out its contribution.

Scalability

N agents with action space A each → joint action space is A^N. With 5 agents and 10 actions each, that’s 100,000 joint actions. The space explodes combinatorially.

Communication

Should agents communicate? What should they share? Learned communication channels can emerge, but they add complexity and bandwidth requirements.

Approaches

Independent learners

Each agent runs its own RL algorithm (e.g., PPO), treating other agents as part of the environment. Simple and scalable.

import gymnasium as gym
import numpy as np
 
class IndependentAgent:
    """Each agent learns independently with Q-learning."""
 
    def __init__(self, state_dim, n_actions, lr=0.1, gamma=0.95, epsilon=0.1):
        self.Q = {}  # state -> action values
        self.lr = lr
        self.gamma = gamma
        self.epsilon = epsilon
        self.n_actions = n_actions
 
    def get_q(self, state):
        state = tuple(state) if hasattr(state, '__iter__') else state
        if state not in self.Q:
            self.Q[state] = np.zeros(self.n_actions)
        return self.Q[state]
 
    def choose_action(self, state):
        if np.random.random() < self.epsilon:
            return np.random.randint(self.n_actions)
        return np.argmax(self.get_q(state))
 
    def update(self, state, action, reward, next_state, done):
        q = self.get_q(state)
        q_next = self.get_q(next_state)
        target = reward + (1 - done) * self.gamma * np.max(q_next)
        q[action] += self.lr * (target - q[action])

Problems: non-stationarity means learning is unstable. Often works surprisingly well as a baseline though.

Centralized Training, Decentralized Execution (CTDE)

The dominant paradigm. During training, agents can share information (other agents’ observations, actions, global state). During execution, each agent acts based only on its own observations.

Training:
  Central critic sees all agents' observations and actions
  Each agent's actor only sees its own observation

Execution:
  Each agent runs its own actor independently
  No communication required

Why: during training, using all available information improves learning. At test time, agents may not be able to communicate reliably (radio jamming, latency, agent failure).

QMIX (cooperative teams)

For cooperative settings. Each agent has a local Q-function. The team Q-value is a monotonic mixing of individual Q-values:

Q_team(s, a1, a2, ..., aN) = mix(Q1(o1, a1), Q2(o2, a2), ..., QN(oN, aN))

Constraint: d(Q_team)/d(Qi) >= 0  (monotonic)
→ maximizing Q_team = each agent maximizes its own Qi
→ decentralized execution works!

The mixing network is a hypernetwork: its weights are conditioned on the global state.

MAPPO (Multi-Agent PPO)

Surprisingly, applying PPO independently to each agent (with a shared centralized critic) works very well. MAPPO is a strong baseline that often matches or beats more complex algorithms.

import torch
import torch.nn as nn
 
class MAPPOCritic(nn.Module):
    """Centralized critic: takes all agents' observations as input."""
 
    def __init__(self, total_obs_dim, hidden=128):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(total_obs_dim, hidden),
            nn.ReLU(),
            nn.Linear(hidden, hidden),
            nn.ReLU(),
            nn.Linear(hidden, 1),
        )
 
    def forward(self, all_obs):
        """all_obs: concatenation of all agents' observations."""
        return self.net(all_obs)
 
class MAPPOActor(nn.Module):
    """Decentralized actor: each agent sees only its own observation."""
 
    def __init__(self, obs_dim, n_actions, hidden=64):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(obs_dim, hidden),
            nn.ReLU(),
            nn.Linear(hidden, hidden),
            nn.ReLU(),
            nn.Linear(hidden, n_actions),
        )
 
    def forward(self, obs):
        logits = self.net(obs)
        return torch.distributions.Categorical(logits=logits)

Key insight: the critic sees everything (CTDE), but each actor only sees its own observation. During execution, only the actors are needed.

When agents are homogeneous (same type, same action space), share one policy across all agents. Each agent receives its own observation but uses the same network weights.

Benefits: N times more data for training, faster convergence, smaller model. Works well when agents should behave similarly (drone swarm with identical drones).

Communication

Agents can learn to communicate through explicit message channels.

Approaches

Method	How	When
CommNet	Average of all messages as extra input	Fully connected, small teams
TarMAC	Attention-based selective communication	When agents should attend to specific others
QMIX + comms	Messages as part of observation	Discrete communication protocols

class CommunicatingAgent(nn.Module):
    """Agent that sends and receives messages."""
 
    def __init__(self, obs_dim, n_actions, msg_dim=16, hidden=64):
        super().__init__()
        self.encoder = nn.Sequential(
            nn.Linear(obs_dim + msg_dim, hidden),
            nn.ReLU(),
        )
        # Message to send to others
        self.msg_head = nn.Linear(hidden, msg_dim)
        # Action selection
        self.action_head = nn.Linear(hidden, n_actions)
 
    def forward(self, obs, received_msg):
        """obs: agent's own observation. received_msg: aggregated messages from others."""
        x = torch.cat([obs, received_msg], dim=-1)
        h = self.encoder(x)
        msg_out = self.msg_head(h)       # message to broadcast
        logits = self.action_head(h)     # action selection
        return torch.distributions.Categorical(logits=logits), msg_out

Emergent behavior

One of the most fascinating aspects of MARL: behaviors emerge that were never explicitly programmed.

Examples:

Cooperation: agents learn to coordinate without explicit cooperation reward
Specialization: in pursuit-evasion, some agents learn to chase while others learn to cut off escape routes
Deception: in competitive settings, agents learn to feint and misdirect
Communication protocols: agents develop their own “language” for coordination

PettingZoo: MARL environments

PettingZoo is the standard library for multi-agent environments (the MARL equivalent of Gymnasium).

from pettingzoo.mpe import simple_spread_v3
 
# Cooperative: N agents must cover N landmarks
env = simple_spread_v3.parallel_env(N=3, max_cycles=100)
observations, infos = env.reset()
 
# observations is a dict: {agent_name: observation}
for agent, obs in observations.items():
    print(f"{agent}: obs shape = {obs.shape}")
 
# Step with actions for all agents
actions = {agent: env.action_space(agent).sample() for agent in env.agents}
observations, rewards, terminations, truncations, infos = env.step(actions)
 
# rewards is a dict: {agent_name: float}
for agent, reward in rewards.items():
    print(f"{agent}: reward = {reward:.3f}")

Key PettingZoo environments

Environment	Type	Description
simple_spread	Cooperative	N agents cover N landmarks
simple_tag	Competitive	Predators chase prey
simple_adversary	Mixed	Agent must reach target, adversary interferes
waterworld	Cooperative	Agents consume food, avoid poison

More MARL frameworks

Framework	Strengths
EPyMARL	Easy benchmarking of MARL algorithms (QMIX, MAPPO, etc.)
MARLlib	Comprehensive library supporting many algorithms + environments
Melting Pot	DeepMind, evaluates social intelligence and cooperation

Applications

Drone swarms

Formation control: maintain geometric formation while navigating
Search and coverage: cooperatively search an area (divide territory)
Pursuit-evasion: team of drones tracking a moving target
Relay communication: agents position themselves to relay messages

Defense and security

EW (Electronic Warfare): jammer vs target as adversarial game. The jammer learns to position/time interference; the target learns evasion. MARL finds equilibrium strategies.
Swarm vs defense: offensive swarm of drones vs defensive system. Competitive MARL reveals vulnerabilities in both attack and defense strategies.
Cognitive warfare modeling: multi-agent influence networks. Agents represent actors trying to shape information environment. Competitive MARL reveals manipulation strategies and countermeasures.

Other domains

Traffic control: traffic signals as cooperative agents, vehicles as self-interested agents
Multiplayer games: StarCraft (SMAC), Dota 2, hide-and-seek
Market making: buyers and sellers as competing agents

Self-test questions

Why does independent learning struggle in multi-agent settings?
What does CTDE stand for, and why is it the dominant paradigm?
How does QMIX ensure that decentralized execution is consistent with centralized training?
Why is parameter sharing effective for homogeneous agents?
Give an example of emergent behavior in MARL. Why is it surprising?

Exercises

Independent agents: Train independent Q-learning agents in simple_tag (PettingZoo). Do predators learn to coordinate? Measure capture rate over training.
CTDE with shared critic: Implement a simple MAPPO setup: shared centralized critic, independent actors. Train on simple_spread. Compare with independent PPO.
Emergent communication: Add a message channel (1-bit signal) between agents in simple_spread. Train with MAPPO. Analyze: do agents learn to use the communication channel meaningfully? Plot message statistics vs training progress.

AI/ML Notes

Explorer

Multi-Agent RL

Multi-Agent RL

What

Types of interaction

Core challenges

Non-stationarity

Credit assignment

Scalability

Communication

Approaches

Independent learners

Centralized Training, Decentralized Execution (CTDE)

QMIX (cooperative teams)

MAPPO (Multi-Agent PPO)

Communication

Approaches

Emergent behavior

PettingZoo: MARL environments

Key PettingZoo environments

More MARL frameworks

Applications

Drone swarms

Defense and security

Other domains

Self-test questions

Exercises

Links

Graph View

Table of Contents

Backlinks

AI/ML Notes

Explorer

Multi-Agent RL

Multi-Agent RL

What

Types of interaction

Core challenges

Non-stationarity

Credit assignment

Scalability

Communication

Approaches

Independent learners

Centralized Training, Decentralized Execution (CTDE)

QMIX (cooperative teams)

MAPPO (Multi-Agent PPO)

Parameter sharing

Communication

Approaches

Emergent behavior

PettingZoo: MARL environments

Key PettingZoo environments

More MARL frameworks

Applications

Drone swarms

Defense and security

Other domains

Self-test questions

Exercises

Links

Graph View

Table of Contents

Backlinks