1753 lines
56 KiB
Markdown
1753 lines
56 KiB
Markdown
|
||
# Multi-Agent Reinforcement Learning
|
||
|
||
## When to Use This Skill
|
||
|
||
Invoke this skill when you encounter:
|
||
|
||
- **Multiple Learners**: 2+ agents learning simultaneously in shared environment
|
||
- **Coordination Problem**: Agents must coordinate to achieve goals
|
||
- **Non-Stationarity**: Other agents changing policies during training
|
||
- **CTDE Implementation**: Separating centralized training from decentralized execution
|
||
- **Value Factorization**: Credit assignment in cooperative multi-agent settings
|
||
- **QMIX Algorithm**: Learning cooperative Q-values with value factorization
|
||
- **MADDPG**: Multi-agent actor-critic with centralized critics
|
||
- **Communication**: Agents learning to communicate to improve coordination
|
||
- **Team Reward Ambiguity**: How to split team reward fairly among agents
|
||
- **Cooperative vs Competitive**: Designing reward structure for multi-agent problem
|
||
- **Non-Stationarity Handling**: Dealing with other agents' policy changes
|
||
- **When Multi-Agent RL Needed**: Deciding if problem requires MARL vs single-agent
|
||
|
||
**This skill teaches learning from multiple simultaneous agents with coordination challenges.**
|
||
|
||
Do NOT use this skill for:
|
||
|
||
- Single-agent RL (use rl-foundations, value-based-methods, policy-gradient-methods)
|
||
- Supervised multi-task learning (that's supervised learning)
|
||
- Simple parallel independent tasks (use single-agent RL in parallel)
|
||
- Pure game theory without learning (use game theory frameworks)
|
||
|
||
## Core Principle
|
||
|
||
**Multi-agent RL learns coordinated policies for multiple agents in shared environment, solving the fundamental problem that environment non-stationarity from other agents' learning breaks standard RL convergence guarantees.**
|
||
|
||
The core insight: When other agents improve their policies, the environment changes. Your value estimates computed assuming other agents play old policy become wrong when they play new policy.
|
||
|
||
```
|
||
Single-Agent RL:
|
||
1. Agent learns policy π
|
||
2. Environment is fixed
|
||
3. Agent value estimates Q(s,a) stable
|
||
4. Algorithm converges to optimal policy
|
||
|
||
Multi-Agent RL:
|
||
1. Agent 1 learns policy π_1
|
||
2. Agent 2 also learning, changing π_2
|
||
3. Environment from Agent 1 perspective is non-stationary
|
||
4. Agent 1's value estimates invalid when Agent 2 improves
|
||
5. Standard convergence guarantees broken
|
||
6. Need special algorithms: QMIX, MADDPG, communication
|
||
|
||
Without addressing non-stationarity, multi-agent learning is unstable.
|
||
```
|
||
|
||
**Without understanding multi-agent problem structure and non-stationarity, you'll implement algorithms that fail to coordinate, suffer credit assignment disasters, or waste effort on agent conflicts instead of collaboration.**
|
||
|
||
|
||
## Part 1: Multi-Agent RL Fundamentals
|
||
|
||
### Why Multi-Agent RL Differs From Single-Agent
|
||
|
||
**Standard RL Assumption (Single-Agent)**:
|
||
|
||
- You have one agent
|
||
- Environment dynamics and reward function are fixed
|
||
- Agent's actions don't change environment structure
|
||
- Goal: Learn policy that maximizes expected return
|
||
|
||
**Multi-Agent RL Reality**:
|
||
|
||
- Multiple agents act in shared environment
|
||
- Each agent learns simultaneously
|
||
- When Agent 1 improves, Agent 2 sees changed environment
|
||
- Reward depends on all agents' actions: R = R(a_1, a_2, ..., a_n)
|
||
- Non-stationarity: other agents' policies change constantly
|
||
- Convergence undefined (what is "optimal" when others adapt?)
|
||
|
||
### Problem Types: Cooperative, Competitive, Mixed
|
||
|
||
**Cooperative Multi-Agent Problem**:
|
||
|
||
```
|
||
Definition: All agents share same objective
|
||
Reward: R_team(a_1, a_2, ..., a_n) = same for all agents
|
||
|
||
Example - Robot Team Assembly:
|
||
- All robots get same team reward
|
||
- +100 if assembly succeeds
|
||
- 0 if assembly fails
|
||
- All robots benefit from success equally
|
||
|
||
Characteristic:
|
||
- Agents don't conflict on goals
|
||
- Challenge: Credit assignment (who deserves credit?)
|
||
- Solution: Value factorization (QMIX, QPLEX)
|
||
|
||
Key Insight:
|
||
Cooperative doesn't mean agents see each other!
|
||
- Agents might have partial/no observation of others
|
||
- Still must coordinate for team success
|
||
- Factorization enables coordination without observation
|
||
```
|
||
|
||
**Competitive Multi-Agent Problem**:
|
||
|
||
```
|
||
Definition: Agents have opposite objectives (zero-sum)
|
||
Reward: R_i(a_1, ..., a_n) = -R_j(a_1, ..., a_n) for i≠j
|
||
|
||
Example - Chess, Poker, Soccer:
|
||
- Agent 1 tries to win
|
||
- Agent 2 tries to win
|
||
- One's gain is other's loss
|
||
- R_1 + R_2 = 0 (zero-sum)
|
||
|
||
Characteristic:
|
||
- Agents are adversarial
|
||
- Challenge: Computing best response to opponent
|
||
- Solution: Nash equilibrium (MADDPG, self-play)
|
||
|
||
Key Insight:
|
||
In competitive games, agents must predict opponent strategies.
|
||
- Agent 1 assumes Agent 2 plays best response
|
||
- Agent 2 assumes Agent 1 plays best response
|
||
- Nash equilibrium = mutual best response
|
||
- No agent can improve unilaterally
|
||
```
|
||
|
||
**Mixed Multi-Agent Problem**:
|
||
|
||
```
|
||
Definition: Some cooperation, some competition
|
||
Reward: R_i(a_1, ..., a_n) contains both shared and individual terms
|
||
|
||
Example - Team Soccer (3v3):
|
||
- Blue team agents cooperate for same goal
|
||
- But blue vs red is competitive
|
||
- Blue agent reward:
|
||
R_i = +10 if blue scores, -10 if red scores (team-based)
|
||
+ 1 if blue_i scores goal (individual bonus)
|
||
|
||
Characteristic:
|
||
- Agents cooperate with teammates
|
||
- Agents compete with opponents
|
||
- Challenge: Balancing cooperation and competition
|
||
- Solution: Hybrid approaches using both cooperative and competitive algorithms
|
||
|
||
Key Insight:
|
||
Mixed scenarios are most common in practice.
|
||
- Robot teams: cooperate internally, compete for resources
|
||
- Trading: multiple firms (cooperate via regulations, compete for profit)
|
||
- Multiplayer games: team-based (cooperate with allies, compete with enemies)
|
||
```
|
||
|
||
### Non-Stationarity: The Core Challenge
|
||
|
||
**What is Non-Stationarity?**
|
||
|
||
```
|
||
Stationarity: Environment dynamics P(s'|s,a) and rewards R(s,a) are fixed
|
||
Non-Stationarity: Dynamics/rewards change over time
|
||
|
||
In multi-agent RL:
|
||
Environment from Agent 1's perspective:
|
||
P(s'_1 | s_1, a_1, a_2(t), a_3(t), ...)
|
||
|
||
If other agents' policies change:
|
||
π_2(t) ≠ π_2(t+1)
|
||
|
||
Then transition dynamics change:
|
||
P(s'_1 | s_1, a_1, a_2(t)) ≠ P(s'_1 | s_1, a_1, a_2(t+1))
|
||
|
||
Environment is non-stationary!
|
||
```
|
||
|
||
**Why Non-Stationarity Breaks Standard RL**:
|
||
|
||
```python
|
||
# Single-agent Q-learning assumes:
|
||
# Environment is fixed during learning
|
||
# Q-values converge because bellman expectation is fixed point
|
||
|
||
Q[s,a] ← Q[s,a] + α(r + γ max_a' Q[s',a'] - Q[s,a])
|
||
|
||
# In multi-agent with non-stationarity:
|
||
# Other agents improve their policies
|
||
# Max action a' depends on other agents' policies
|
||
# When other agents improve, max action changes
|
||
# Q-values chase moving target
|
||
# No convergence guarantee
|
||
```
|
||
|
||
**Impact on Learning**:
|
||
|
||
```
|
||
Scenario: Two agents learning to navigate
|
||
Agent 1 learns: "If Agent 2 goes left, I go right"
|
||
Agent 1 builds value estimates based on this assumption
|
||
|
||
Agent 2 improves: "Actually, going right is better"
|
||
Now Agent 2 goes right (not left)
|
||
Agent 1's assumptions invalid!
|
||
Agent 1's value estimates become wrong
|
||
Agent 1 must relearn
|
||
|
||
Agent 1 tries new path based on new estimates
|
||
Agent 2 sees Agent 1's change and adapts
|
||
Agent 2's estimates become wrong
|
||
|
||
Result: Chaotic learning, no convergence
|
||
```
|
||
|
||
|
||
## Part 2: Centralized Training, Decentralized Execution (CTDE)
|
||
|
||
### CTDE Paradigm
|
||
|
||
**Key Idea**: Use centralized information during training, decentralized information during execution.
|
||
|
||
```
|
||
Training Phase (Centralized):
|
||
- Trainer observes: o_1, o_2, ..., o_n (all agents' observations)
|
||
- Trainer observes: a_1, a_2, ..., a_n (all agents' actions)
|
||
- Trainer observes: R_team or R_1, R_2, ... (reward signals)
|
||
- Trainer can assign credit fairly
|
||
- Trainer can compute global value functions
|
||
|
||
Execution Phase (Decentralized):
|
||
- Agent 1 observes: o_1 only
|
||
- Agent 1 executes: π_1(a_1 | o_1)
|
||
- Agent 1 doesn't need to see other agents
|
||
- Each agent is independent during rollout
|
||
- Enables scalability and robustness
|
||
```
|
||
|
||
**Why CTDE Solves Non-Stationarity**:
|
||
|
||
```
|
||
During training:
|
||
- Centralized trainer sees all information
|
||
- Can compute value Q_1(s_1, s_2, ..., s_n | a_1, a_2, ..., a_n)
|
||
- Can factor: Q_team = f(Q_1, Q_2, ..., Q_n) (QMIX)
|
||
- Can compute importance weights: who contributed most?
|
||
|
||
During execution:
|
||
- Decentralized agents only use own observations
|
||
- Policies learned during centralized training work well
|
||
- No need for other agents' observations at runtime
|
||
- Robust to other agents' changes (policy doesn't depend on their states)
|
||
|
||
Result:
|
||
- Training leverages global information for stability
|
||
- Execution is independent and scalable
|
||
- Solves non-stationarity via centralized credit assignment
|
||
```
|
||
|
||
### CTDE in Practice
|
||
|
||
**Centralized Information Used in Training**:
|
||
|
||
```python
|
||
# During training, compute global value function
|
||
# Inputs: observations and actions of ALL agents
|
||
def compute_value_ctde(obs_1, obs_2, obs_3, act_1, act_2, act_3):
|
||
# See everyone's observations
|
||
global_state = combine(obs_1, obs_2, obs_3)
|
||
|
||
# See everyone's actions
|
||
joint_action = (act_1, act_2, act_3)
|
||
|
||
# Compute shared value with all information
|
||
Q_shared = centralized_q_network(global_state, joint_action)
|
||
|
||
# Factor into individual Q-values (QMIX)
|
||
Q_1 = q_network_1(obs_1, act_1)
|
||
Q_2 = q_network_2(obs_2, act_2)
|
||
Q_3 = q_network_3(obs_3, act_3)
|
||
|
||
# Factorization: Q_team ≈ mixing_network(Q_1, Q_2, Q_3)
|
||
# Each agent learns its contribution via QMIX loss
|
||
return Q_shared, (Q_1, Q_2, Q_3)
|
||
```
|
||
|
||
**Decentralized Execution**:
|
||
|
||
```python
|
||
# During execution, use only own observation
|
||
def execute_policy(agent_id, own_observation):
|
||
# Agent only sees and uses own obs
|
||
action = policy_network(own_observation)
|
||
|
||
# No access to other agents' observations
|
||
# Doesn't need other agents' actions
|
||
# Purely decentralized execution
|
||
return action
|
||
|
||
# All agents execute in parallel:
|
||
# Agent 1: o_1 → a_1 (decentralized)
|
||
# Agent 2: o_2 → a_2 (decentralized)
|
||
# Agent 3: o_3 → a_3 (decentralized)
|
||
# Execution is independent!
|
||
```
|
||
|
||
|
||
## Part 3: QMIX - Value Factorization for Cooperative Teams
|
||
|
||
### QMIX: The Core Insight
|
||
|
||
**Problem**: In cooperative teams, how do you assign credit fairly?
|
||
|
||
```
|
||
Naive approach: Joint Q-value
|
||
Q_team(s, a_1, a_2, ..., a_n) = expected return from joint action
|
||
|
||
Problem: Still doesn't assign individual credit
|
||
If Q_team = 100, how much did Agent 1 contribute?
|
||
Agent 1 might think: "I deserve 50%" (overconfident)
|
||
But Agent 1 might deserve only 10% (others did more)
|
||
|
||
Result: Agents learn wrong priorities
|
||
```
|
||
|
||
**Solution: Value Factorization (QMIX)**
|
||
|
||
```
|
||
Key Assumption: Monotonicity in actions
|
||
If improving Agent i's action improves team outcome,
|
||
improving Agent i's individual Q-value should help
|
||
|
||
Mathematical Form:
|
||
Q_team(a) ≥ Q_team(a') if Agent i plays better action a_i instead of a'_i
|
||
and Agent i's Q_i(a_i) > Q_i(a'_i)
|
||
|
||
Concrete Implementation:
|
||
Q_team(s, a_1, ..., a_n) = f(Q_1(s_1, a_1), Q_2(s_2, a_2), ..., Q_n(s_n, a_n))
|
||
|
||
Where:
|
||
- Q_i: Individual Q-network for agent i
|
||
- f: Monotonic mixing network (ensures monotonicity)
|
||
|
||
Monotonicity guarantee:
|
||
If Q_1 increases, Q_team increases (if f is monotonic)
|
||
```
|
||
|
||
### QMIX Algorithm
|
||
|
||
**Architecture**:
|
||
|
||
```
|
||
Individual Q-Networks: Mixing Network (Monotonic):
|
||
┌─────────┐ ┌──────────────────┐
|
||
│ Agent 1 │─o_1─────────────────→│ │
|
||
│ LSTM │ │ MLP (weights) │─→ Q_team
|
||
│ Q_1 │ │ are monotonic │
|
||
└─────────┘ │ (ReLU blocks) │
|
||
└──────────────────┘
|
||
┌─────────┐ ↑
|
||
│ Agent 2 │─o_2──────────────────────────┤
|
||
│ LSTM │ │
|
||
│ Q_2 │ │
|
||
└─────────┘ │
|
||
Hypernet:
|
||
┌─────────┐ generates weights
|
||
│ Agent 3 │─o_3────────────────────→ as function
|
||
│ LSTM │ of state
|
||
│ Q_3 │
|
||
└─────────┘
|
||
|
||
Value outputs: Q_1(o_1, a_1), Q_2(o_2, a_2), Q_3(o_3, a_3)
|
||
Mixing: Q_team = mixing_network(Q_1, Q_2, Q_3, state)
|
||
```
|
||
|
||
**QMIX Training**:
|
||
|
||
```python
|
||
import torch
|
||
import torch.nn as nn
|
||
from torch.optim import Adam
|
||
|
||
class QMIXAgent:
|
||
def __init__(self, n_agents, state_dim, obs_dim, action_dim, hidden_dim=64):
|
||
self.n_agents = n_agents
|
||
|
||
# Individual Q-networks (one per agent)
|
||
self.q_networks = nn.ModuleList([
|
||
nn.Sequential(
|
||
nn.Linear(obs_dim + action_dim, hidden_dim),
|
||
nn.ReLU(),
|
||
nn.Linear(hidden_dim, hidden_dim),
|
||
nn.ReLU(),
|
||
nn.Linear(hidden_dim, 1) # Q-value for this action
|
||
)
|
||
for _ in range(n_agents)
|
||
])
|
||
|
||
# Mixing network: takes individual Q-values and produces joint Q
|
||
self.mixing_network = nn.Sequential(
|
||
nn.Linear(n_agents + state_dim, hidden_dim),
|
||
nn.ReLU(),
|
||
nn.Linear(hidden_dim, hidden_dim),
|
||
nn.ReLU(),
|
||
nn.Linear(hidden_dim, 1)
|
||
)
|
||
|
||
# Hypernet: generates mixing network weights (ensuring monotonicity)
|
||
self.hypernet = nn.Sequential(
|
||
nn.Linear(state_dim, hidden_dim),
|
||
nn.ReLU(),
|
||
nn.Linear(hidden_dim, hidden_dim * (n_agents + state_dim))
|
||
)
|
||
|
||
self.optimizer = Adam(
|
||
list(self.q_networks.parameters()) +
|
||
list(self.mixing_network.parameters()) +
|
||
list(self.hypernet.parameters()),
|
||
lr=5e-4
|
||
)
|
||
|
||
self.discount = 0.99
|
||
self.target_update_rate = 0.001
|
||
self.epsilon = 0.05
|
||
|
||
# Target networks (soft update)
|
||
self._init_target_networks()
|
||
|
||
def _init_target_networks(self):
|
||
"""Create target networks for stable learning."""
|
||
self.target_q_networks = nn.ModuleList([
|
||
nn.Sequential(*[nn.Linear(*p.shape[::-1]) for p in q.parameters()])
|
||
for q in self.q_networks
|
||
])
|
||
self.target_mixing_network = nn.Sequential(
|
||
*[nn.Linear(*p.shape[::-1]) for p in self.mixing_network.parameters()]
|
||
)
|
||
|
||
def compute_individual_q_values(self, observations, actions):
|
||
"""
|
||
Compute Q-values for each agent given their observation and action.
|
||
|
||
Args:
|
||
observations: list of n_agents observations (each [batch_size, obs_dim])
|
||
actions: list of n_agents actions (each [batch_size, action_dim])
|
||
|
||
Returns:
|
||
q_values: tensor [batch_size, n_agents]
|
||
"""
|
||
q_values = []
|
||
for i, (obs, act) in enumerate(zip(observations, actions)):
|
||
# Concatenate observation and action
|
||
q_input = torch.cat([obs, act], dim=-1)
|
||
q_i = self.q_networks[i](q_input)
|
||
q_values.append(q_i)
|
||
|
||
return torch.cat(q_values, dim=-1) # [batch_size, n_agents]
|
||
|
||
def compute_joint_q_value(self, q_values, state):
|
||
"""
|
||
Mix individual Q-values into joint Q-value using monotonic mixing network.
|
||
|
||
Args:
|
||
q_values: individual Q-values [batch_size, n_agents]
|
||
state: global state [batch_size, state_dim]
|
||
|
||
Returns:
|
||
q_joint: joint Q-value [batch_size, 1]
|
||
"""
|
||
# Ensure monotonicity by using weight constraints
|
||
# Mixing network learns to combine Q-values
|
||
q_joint = self.mixing_network(torch.cat([q_values, state], dim=-1))
|
||
return q_joint
|
||
|
||
def train_step(self, batch, state_batch):
|
||
"""
|
||
One QMIX training step.
|
||
|
||
Batch contains:
|
||
observations: list[n_agents] of [batch_size, obs_dim]
|
||
actions: list[n_agents] of [batch_size, action_dim]
|
||
rewards: [batch_size] (shared team reward)
|
||
next_observations: list[n_agents] of [batch_size, obs_dim]
|
||
dones: [batch_size]
|
||
"""
|
||
observations, actions, rewards, next_observations, dones = batch
|
||
batch_size = observations[0].shape[0]
|
||
|
||
# Compute current Q-values
|
||
q_values = self.compute_individual_q_values(observations, actions)
|
||
q_joint = self.compute_joint_q_value(q_values, state_batch)
|
||
|
||
# Compute target Q-values
|
||
with torch.no_grad():
|
||
# Get next Q-values for all possible joint actions (in practice, greedy)
|
||
next_q_values = self.compute_individual_q_values(
|
||
next_observations,
|
||
[torch.zeros_like(a) for a in actions] # Best actions (simplified)
|
||
)
|
||
|
||
# Mix next Q-values
|
||
next_q_joint = self.compute_joint_q_value(next_q_values, state_batch)
|
||
|
||
# TD target: team gets shared reward
|
||
td_target = rewards.unsqueeze(-1) + (
|
||
1 - dones.unsqueeze(-1)
|
||
) * self.discount * next_q_joint
|
||
|
||
# QMIX loss
|
||
qmix_loss = ((q_joint - td_target) ** 2).mean()
|
||
|
||
self.optimizer.zero_grad()
|
||
qmix_loss.backward()
|
||
self.optimizer.step()
|
||
|
||
# Soft update target networks
|
||
self._soft_update_targets()
|
||
|
||
return {'qmix_loss': qmix_loss.item()}
|
||
|
||
def _soft_update_targets(self):
|
||
"""Soft update target networks."""
|
||
for target, main in zip(self.target_q_networks, self.q_networks):
|
||
for target_param, main_param in zip(target.parameters(), main.parameters()):
|
||
target_param.data.copy_(
|
||
self.target_update_rate * main_param.data +
|
||
(1 - self.target_update_rate) * target_param.data
|
||
)
|
||
|
||
def select_actions(self, observations):
|
||
"""
|
||
Greedy action selection (decentralized execution).
|
||
Each agent selects action independently.
|
||
"""
|
||
actions = []
|
||
for i, obs in enumerate(observations):
|
||
with torch.no_grad():
|
||
# Agent i evaluates all possible actions
|
||
best_action = None
|
||
best_q = -float('inf')
|
||
|
||
for action in range(self.action_dim):
|
||
q_input = torch.cat([obs, one_hot(action, self.action_dim)])
|
||
q_val = self.q_networks[i](q_input).item()
|
||
|
||
if q_val > best_q:
|
||
best_q = q_val
|
||
best_action = action
|
||
|
||
# Epsilon-greedy
|
||
if torch.rand(1).item() < self.epsilon:
|
||
best_action = torch.randint(0, self.action_dim, (1,)).item()
|
||
|
||
actions.append(best_action)
|
||
|
||
return actions
|
||
```
|
||
|
||
**QMIX Key Concepts**:
|
||
|
||
1. **Monotonicity**: If agent improves action, team value improves
|
||
2. **Value Factorization**: Q_team = f(Q_1, Q_2, ..., Q_n)
|
||
3. **Decentralized Execution**: Each agent uses only own observation
|
||
4. **Centralized Training**: Trainer sees all Q-values and state
|
||
|
||
**When QMIX Works Well**:
|
||
|
||
- Fully observable or partially observable cooperative teams
|
||
- Sparse communication needs
|
||
- Fixed team membership
|
||
- Shared reward structure
|
||
|
||
**QMIX Limitations**:
|
||
|
||
- Assumes monotonicity (not all cooperative games satisfy this)
|
||
- Doesn't handle explicit communication
|
||
- Doesn't learn agent roles dynamically
|
||
|
||
|
||
## Part 4: MADDPG - Multi-Agent Actor-Critic
|
||
|
||
### MADDPG: For Competitive and Mixed Scenarios
|
||
|
||
**Core Idea**: Actor-critic but with centralized critic during training.
|
||
|
||
```
|
||
DDPG (single-agent):
|
||
- Actor π(a|s) learns policy
|
||
- Critic Q(s,a) estimates value
|
||
- Critic trains actor via policy gradient
|
||
|
||
MADDPG (multi-agent):
|
||
- Each agent has actor π_i(a_i|o_i)
|
||
- Centralized critic Q(s, a_1, ..., a_n) sees all agents
|
||
- During training: use centralized critic for learning
|
||
- During execution: each agent uses only own actor
|
||
```
|
||
|
||
**MADDPG Algorithm**:
|
||
|
||
```python
|
||
class MADDPGAgent:
|
||
def __init__(self, agent_id, n_agents, obs_dim, action_dim, state_dim, hidden_dim=256):
|
||
self.agent_id = agent_id
|
||
self.n_agents = n_agents
|
||
self.action_dim = action_dim
|
||
|
||
# Actor: learns decentralized policy π_i(a_i|o_i)
|
||
self.actor = nn.Sequential(
|
||
nn.Linear(obs_dim, hidden_dim),
|
||
nn.ReLU(),
|
||
nn.Linear(hidden_dim, hidden_dim),
|
||
nn.ReLU(),
|
||
nn.Linear(hidden_dim, action_dim),
|
||
nn.Tanh() # Continuous actions in [-1, 1]
|
||
)
|
||
|
||
# Critic: centralized value Q(s, a_1, ..., a_n)
|
||
# Input: global state + all agents' actions
|
||
self.critic = nn.Sequential(
|
||
nn.Linear(state_dim + n_agents * action_dim, hidden_dim),
|
||
nn.ReLU(),
|
||
nn.Linear(hidden_dim, hidden_dim),
|
||
nn.ReLU(),
|
||
nn.Linear(hidden_dim, 1) # Single value output
|
||
)
|
||
|
||
# Target networks for stability
|
||
self.target_actor = copy.deepcopy(self.actor)
|
||
self.target_critic = copy.deepcopy(self.critic)
|
||
|
||
self.actor_optimizer = Adam(self.actor.parameters(), lr=1e-4)
|
||
self.critic_optimizer = Adam(self.critic.parameters(), lr=1e-3)
|
||
|
||
self.discount = 0.99
|
||
self.tau = 0.01 # Soft update rate
|
||
|
||
def train_step(self, batch):
|
||
"""
|
||
MADDPG training step.
|
||
|
||
Batch contains:
|
||
observations: list[n_agents] of [batch_size, obs_dim]
|
||
actions: list[n_agents] of [batch_size, action_dim]
|
||
rewards: [batch_size] (agent-specific reward!)
|
||
next_observations: list[n_agents] of [batch_size, obs_dim]
|
||
global_state: [batch_size, state_dim]
|
||
next_global_state: [batch_size, state_dim]
|
||
dones: [batch_size]
|
||
"""
|
||
observations, actions, rewards, next_observations, \
|
||
global_state, next_global_state, dones = batch
|
||
|
||
batch_size = observations[0].shape[0]
|
||
agent_obs = observations[self.agent_id]
|
||
agent_action = actions[self.agent_id]
|
||
agent_reward = rewards # Agent-specific reward
|
||
|
||
# Step 1: Critic Update (centralized)
|
||
with torch.no_grad():
|
||
# Compute next actions using target actors
|
||
next_actions = []
|
||
for i, next_obs in enumerate(next_observations):
|
||
if i == self.agent_id:
|
||
next_a = self.target_actor(next_obs)
|
||
else:
|
||
# Use stored target actors from other agents
|
||
next_a = other_agents_target_actors[i](next_obs)
|
||
next_actions.append(next_a)
|
||
|
||
# Concatenate all next actions
|
||
next_actions_cat = torch.cat(next_actions, dim=-1)
|
||
|
||
# Compute next value (centralized critic)
|
||
next_q = self.target_critic(
|
||
torch.cat([next_global_state, next_actions_cat], dim=-1)
|
||
)
|
||
|
||
# TD target
|
||
td_target = agent_reward.unsqueeze(-1) + (
|
||
1 - dones.unsqueeze(-1)
|
||
) * self.discount * next_q
|
||
|
||
# Compute current Q-value
|
||
current_actions_cat = torch.cat(actions, dim=-1)
|
||
current_q = self.critic(
|
||
torch.cat([global_state, current_actions_cat], dim=-1)
|
||
)
|
||
|
||
# Critic loss
|
||
critic_loss = ((current_q - td_target) ** 2).mean()
|
||
|
||
self.critic_optimizer.zero_grad()
|
||
critic_loss.backward()
|
||
self.critic_optimizer.step()
|
||
|
||
# Step 2: Actor Update (decentralized policy improvement)
|
||
# Actor only uses own observation
|
||
policy_actions = []
|
||
for i, obs in enumerate(observations):
|
||
if i == self.agent_id:
|
||
# Use current actor for this agent
|
||
action_i = self.actor(obs)
|
||
else:
|
||
# Use current actors of other agents
|
||
action_i = other_agents_actors[i](obs)
|
||
policy_actions.append(action_i)
|
||
|
||
# Compute Q-value under current policy
|
||
policy_actions_cat = torch.cat(policy_actions, dim=-1)
|
||
policy_q = self.critic(
|
||
torch.cat([global_state, policy_actions_cat], dim=-1)
|
||
)
|
||
|
||
# Policy gradient: maximize Q-value
|
||
actor_loss = -policy_q.mean()
|
||
|
||
self.actor_optimizer.zero_grad()
|
||
actor_loss.backward()
|
||
self.actor_optimizer.step()
|
||
|
||
# Soft update target networks
|
||
self._soft_update_targets()
|
||
|
||
return {
|
||
'critic_loss': critic_loss.item(),
|
||
'actor_loss': actor_loss.item(),
|
||
'avg_q_value': current_q.mean().item()
|
||
}
|
||
|
||
def _soft_update_targets(self):
|
||
"""Soft update target networks toward main networks."""
|
||
for target_param, main_param in zip(
|
||
self.target_actor.parameters(),
|
||
self.actor.parameters()
|
||
):
|
||
target_param.data.copy_(
|
||
self.tau * main_param.data + (1 - self.tau) * target_param.data
|
||
)
|
||
|
||
for target_param, main_param in zip(
|
||
self.target_critic.parameters(),
|
||
self.critic.parameters()
|
||
):
|
||
target_param.data.copy_(
|
||
self.tau * main_param.data + (1 - self.tau) * target_param.data
|
||
)
|
||
|
||
def select_action(self, observation):
|
||
"""Decentralized action selection."""
|
||
with torch.no_grad():
|
||
action = self.actor(observation)
|
||
# Add exploration noise
|
||
action = action + torch.normal(0, 0.1, action.shape)
|
||
action = torch.clamp(action, -1, 1)
|
||
return action.cpu().numpy()
|
||
```
|
||
|
||
**MADDPG Key Properties**:
|
||
|
||
1. **Centralized Critic**: Sees all agents' observations and actions
|
||
2. **Decentralized Actors**: Each agent uses only own observation
|
||
3. **Agent-Specific Rewards**: Each agent maximizes own reward
|
||
4. **Handles Competitive/Mixed**: Doesn't assume cooperation
|
||
5. **Continuous Actions**: Works well with continuous action spaces
|
||
|
||
**When MADDPG Works Well**:
|
||
|
||
- Competitive and mixed-motive scenarios
|
||
- Continuous action spaces
|
||
- Partial observability (agents don't see each other)
|
||
- Need for independent agent rewards
|
||
|
||
|
||
## Part 5: Communication in Multi-Agent Systems
|
||
|
||
### When and Why Communication Helps
|
||
|
||
**Problem Without Communication**:
|
||
|
||
```
|
||
Agents with partial observability:
|
||
Agent 1: sees position p_1, but NOT p_2
|
||
Agent 2: sees position p_2, but NOT p_1
|
||
|
||
Goal: Avoid collision while moving to targets
|
||
|
||
Without communication:
|
||
Agent 1: "I don't know where Agent 2 is"
|
||
Agent 2: "I don't know where Agent 1 is"
|
||
|
||
Both might move toward same corridor
|
||
Collision, but agents couldn't coordinate!
|
||
|
||
With communication:
|
||
Agent 1: broadcasts "I'm moving left"
|
||
Agent 2: receives message, moves right
|
||
No collision!
|
||
```
|
||
|
||
**Communication Trade-offs**:
|
||
|
||
```
|
||
Advantages:
|
||
- Enables coordination with partial observability
|
||
- Can solve some problems impossible without communication
|
||
- Explicit intention sharing
|
||
|
||
Disadvantages:
|
||
- Adds complexity: agents must learn what to communicate
|
||
- High variance: messages might mislead
|
||
- Computational overhead: processing all messages
|
||
- Communication bandwidth limited in real systems
|
||
|
||
When to use communication:
|
||
- Partial observability prevents coordination
|
||
- Explicit roles (e.g., one agent is "scout")
|
||
- Limited field of view, agents are out of sight
|
||
- Agents benefit from sharing intentions
|
||
|
||
When NOT to use communication:
|
||
- Full observability (agents see everything)
|
||
- Simple coordination (value factorization sufficient)
|
||
- Communication is unreliable
|
||
```
|
||
|
||
### CommNet: Learning Communication
|
||
|
||
**Idea**: Agents learn to send and receive messages to improve coordination.
|
||
|
||
```
|
||
Architecture:
|
||
1. Each agent processes own observation: f_i(o_i) → hidden state h_i
|
||
2. Agent broadcasts hidden state as "message"
|
||
3. Agent receives messages from neighbors
|
||
4. Agent aggregates messages: Σ_j M(h_j) (attention mechanism)
|
||
5. Agent processes aggregated information: policy π(a_i | h_i + aggregated)
|
||
|
||
Key: Agents learn what information to broadcast in h_i
|
||
Receiving agents learn what messages are useful
|
||
```
|
||
|
||
**Simple Communication Example**:
|
||
|
||
```python
|
||
class CommNetAgent:
|
||
def __init__(self, obs_dim, action_dim, hidden_dim=64):
|
||
# Encoding network: observation → hidden message
|
||
self.encoder = nn.Sequential(
|
||
nn.Linear(obs_dim, hidden_dim),
|
||
nn.ReLU(),
|
||
nn.Linear(hidden_dim, hidden_dim) # Message to broadcast
|
||
)
|
||
|
||
# Communication aggregation (simplified attention)
|
||
self.comm_processor = nn.Sequential(
|
||
nn.Linear(hidden_dim * 2, hidden_dim), # Own + received
|
||
nn.ReLU(),
|
||
nn.Linear(hidden_dim, hidden_dim)
|
||
)
|
||
|
||
# Policy network
|
||
self.policy = nn.Sequential(
|
||
nn.Linear(hidden_dim, hidden_dim),
|
||
nn.ReLU(),
|
||
nn.Linear(hidden_dim, action_dim)
|
||
)
|
||
|
||
def compute_message(self, observation):
|
||
"""Generate message to broadcast to other agents."""
|
||
return self.encoder(observation)
|
||
|
||
def forward(self, observation, received_messages):
|
||
"""
|
||
Process observation + received messages, output action.
|
||
|
||
Args:
|
||
observation: [obs_dim]
|
||
received_messages: list of messages from neighbors
|
||
|
||
Returns:
|
||
action: [action_dim]
|
||
"""
|
||
# Generate own message
|
||
my_message = self.encoder(observation)
|
||
|
||
# Aggregate received messages (mean pooling)
|
||
if received_messages:
|
||
others_messages = torch.stack(received_messages).mean(dim=0)
|
||
else:
|
||
others_messages = torch.zeros_like(my_message)
|
||
|
||
# Process aggregated communication
|
||
combined = torch.cat([my_message, others_messages], dim=-1)
|
||
hidden = self.comm_processor(combined)
|
||
|
||
# Select action
|
||
action = self.policy(hidden)
|
||
return action, my_message
|
||
```
|
||
|
||
**Communication Pitfall**: Agents learn to send misleading messages!
|
||
|
||
```python
|
||
# Without careful design, agents learn deceptive communication:
|
||
Agent 1 learns: "If I broadcast 'I'm going right', Agent 2 will go left"
|
||
Agent 1 broadcasts: "Going right" (but actually goes left)
|
||
Agent 2 goes right as expected (collision!)
|
||
Agent 1 gets higher reward (its deception worked)
|
||
|
||
Solution: Design communication carefully
|
||
- Verify agents to be truthful (implicit in cooperative setting)
|
||
- Use communication only when beneficial
|
||
- Monitor emergent communication protocols
|
||
```
|
||
|
||
|
||
## Part 6: Credit Assignment in Cooperative Teams
|
||
|
||
### Individual Reward vs Team Reward
|
||
|
||
**Problem**:
|
||
|
||
```
|
||
Scenario: 3-robot assembly team
|
||
Team reward: +100 if assembly succeeds, 0 if fails
|
||
|
||
Individual Reward Design:
|
||
Option 1 - Split equally: each robot gets +33.33
|
||
Problem: Robot 3 (insignificant) gets same credit as Robot 1 (crucial)
|
||
|
||
Option 2 - Use agent contribution:
|
||
Robot 1 (held piece): +60
|
||
Robot 2 (guided insertion): +25
|
||
Robot 3 (steadied base): +15
|
||
Problem: How to compute contributions? (requires complex analysis)
|
||
|
||
Option 3 - Use value factorization (QMIX):
|
||
Team value = mixing_network(Q_1, Q_2, Q_3)
|
||
Each robot learns its Q-value
|
||
QMIX learns to weight Q-values by importance
|
||
Result: Fair credit assignment via factorization
|
||
```
|
||
|
||
**QMIX Credit Assignment Mechanism**:
|
||
|
||
```
|
||
Training:
|
||
Observe: robot_1 does action a_1, gets q_1
|
||
robot_2 does action a_2, gets q_2
|
||
robot_3 does action a_3, gets q_3
|
||
Team gets reward r_team
|
||
|
||
Factorize: r_team ≈ mixing_network(q_1, q_2, q_3)
|
||
= w_1 * q_1 + w_2 * q_2 + w_3 * q_3 + bias
|
||
|
||
Learn weights w_i via mixing network
|
||
|
||
If Robot 1 is crucial:
|
||
mixing network learns w_1 > w_2, w_3
|
||
Robot 1 gets larger credit (w_1 * q_1 > others)
|
||
|
||
If Robot 3 is redundant:
|
||
mixing network learns w_3 ≈ 0
|
||
Robot 3 gets small credit
|
||
|
||
Result: Each robot learns fair contribution
|
||
```
|
||
|
||
**Value Decomposition Pitfall**: Agents can game the factorization!
|
||
|
||
```
|
||
Example: Learned mixing network w = [0.9, 0.05, 0.05]
|
||
|
||
Agent 1 learns: "I must maximize q_1 (it has weight 0.9)"
|
||
Agent 1 tries: action that maximizes own q_1
|
||
Problem: q_1 computed from own reward signal (myopic)
|
||
might not actually help team!
|
||
|
||
Solution: Use proper credit assignment metrics
|
||
- Shapley values: game theory approach to credit
|
||
- Counterfactual reasoning: what if agent didn't act?
|
||
- Implicit credit (QMIX): let factorization emergently learn
|
||
```
|
||
|
||
|
||
## Part 7: Common Multi-Agent RL Failure Modes
|
||
|
||
### Failure Mode 1: Non-Stationarity Instability
|
||
|
||
**Symptom**: Learning curves erratic, no convergence.
|
||
|
||
```python
|
||
# Problem scenario:
|
||
for episode in range(1000):
|
||
# Agent 1 learns
|
||
episode_reward_1 = []
|
||
for t in range(steps):
|
||
a_1 = agent_1.select_action(o_1)
|
||
a_2 = agent_2.select_action(o_2) # Using old policy!
|
||
r, o'_1, o'_2 = env.step(a_1, a_2)
|
||
agent_1.update(a_1, r, o'_1)
|
||
|
||
# Agent 2 improves (environment changes for Agent 1!)
|
||
episode_reward_2 = []
|
||
for t in range(steps):
|
||
a_1 = agent_1.select_action(o_1) # OLD VALUE ESTIMATES
|
||
a_2 = agent_2.select_action(o_2) # NEW POLICY (Agent 2 improved)
|
||
r, o'_1, o'_2 = env.step(a_1, a_2)
|
||
agent_2.update(a_2, r, o'_2)
|
||
|
||
Result: Agent 1's Q-values become invalid when Agent 2 improves
|
||
Learning is unstable, doesn't converge
|
||
```
|
||
|
||
**Solution**: Use CTDE or opponent modeling
|
||
|
||
```python
|
||
# CTDE Approach:
|
||
# During training, use global information to stabilize
|
||
trainer.observe(o_1, a_1, o_2, a_2, r)
|
||
# Trainer sees both agents' actions, can compute stable target
|
||
|
||
# During execution:
|
||
agent_1.execute(o_1 only) # Decentralized
|
||
agent_2.execute(o_2 only) # Decentralized
|
||
```
|
||
|
||
### Failure Mode 2: Reward Ambiguity
|
||
|
||
**Symptom**: Agents don't improve, stuck at local optima.
|
||
|
||
```python
|
||
# Problem: Multi-agent team, shared reward
|
||
total_reward = 50
|
||
|
||
# Distribution: who gets what?
|
||
# Agent 1 thinks: "I deserve 50" (overconfident)
|
||
# Agent 2 thinks: "I deserve 50" (overconfident)
|
||
# Agent 3 thinks: "I deserve 50" (overconfident)
|
||
|
||
# Each agent overestimates importance
|
||
# Each agent learns selfishly (internal conflict)
|
||
# Team coordination breaks
|
||
|
||
Result: Team performance worse than if agents cooperated
|
||
```
|
||
|
||
**Solution**: Use value factorization
|
||
|
||
```python
|
||
# QMIX learns fair decomposition
|
||
q_1, q_2, q_3 = compute_individual_values(a_1, a_2, a_3)
|
||
team_reward = mixing_network(q_1, q_2, q_3)
|
||
|
||
# Mixing network learns importance
|
||
# If Agent 2 crucial: weight_2 > weight_1, weight_3
|
||
# Training adjusts weights based on who actually helped
|
||
|
||
Result: Fair credit, agents coordinate
|
||
```
|
||
|
||
### Failure Mode 3: Algorithm-Reward Mismatch
|
||
|
||
**Symptom**: Learning fails in specific problem types (cooperative/competitive).
|
||
|
||
```python
|
||
# Problem: Using QMIX (cooperative) in competitive setting
|
||
# Competitive game (agents have opposite rewards)
|
||
|
||
# QMIX assumes: shared reward (monotonicity works)
|
||
# But in competitive:
|
||
# Q_1 high means Agent 1 winning
|
||
# Q_2 high means Agent 2 winning (opposite!)
|
||
# QMIX mixing doesn't make sense
|
||
# Convergence fails
|
||
|
||
# Solution: Use MADDPG (handles competitive)
|
||
# MADDPG doesn't assume monotonicity
|
||
# Works with individual rewards
|
||
# Handles competition naturally
|
||
```
|
||
|
||
|
||
## Part 8: When to Use Multi-Agent RL
|
||
|
||
### Problem Characteristics for MARL
|
||
|
||
**Use MARL when**:
|
||
|
||
```
|
||
1. Multiple simultaneous learners
|
||
- Problem has 2+ agents learning
|
||
- NOT just parallel tasks (that's single-agent x N)
|
||
|
||
2. Shared/interdependent environment
|
||
- Agents' actions affect each other
|
||
- One agent's action impacts other agent's rewards
|
||
- True interaction (not independent MDPs)
|
||
|
||
3. Coordination is beneficial
|
||
- Agents can improve by coordinating
|
||
- Alternative: agents could act independently (inefficient)
|
||
|
||
4. Non-trivial communication/credit
|
||
- Agents need to coordinate or assign credit
|
||
- NOT trivial to decompose into independent subproblems
|
||
```
|
||
|
||
**Use Single-Agent RL when**:
|
||
|
||
```
|
||
1. Single learning agent (others are environment)
|
||
- Example: one RL agent vs static rules-based opponents
|
||
- Environment includes other agents, but they're not learning
|
||
|
||
2. Independent parallel tasks
|
||
- Example: 10 robots, each with own goal, no interaction
|
||
- Use single-agent RL x 10 (faster, simpler)
|
||
|
||
3. Fully decomposable problems
|
||
- Example: multi-robot path planning (can use single-agent per robot)
|
||
- Problem decomposes into independent subproblems
|
||
|
||
4. Scalability critical
|
||
- Single-agent RL scales to huge teams
|
||
- MARL harder to scale (centralized training bottleneck)
|
||
```
|
||
|
||
### Decision Tree
|
||
|
||
```
|
||
Problem: Multiple agents learning together?
|
||
NO → Use single-agent RL
|
||
YES ↓
|
||
|
||
Problem: Agents' rewards interdependent?
|
||
NO → Use single-agent RL x N (parallel)
|
||
YES ↓
|
||
|
||
Problem: Agents must coordinate?
|
||
NO → Use independent learning (but expect instability)
|
||
YES ↓
|
||
|
||
Problem structure:
|
||
COOPERATIVE → Use QMIX, MAPPO, QPLEX
|
||
COMPETITIVE → Use MADDPG, self-play
|
||
MIXED → Use hybrid (cooperative + competitive algorithms)
|
||
```
|
||
|
||
|
||
## Part 9: Opponent Modeling in Competitive Settings
|
||
|
||
### Why Model Opponents?
|
||
|
||
**Problem Without Opponent Modeling**:
|
||
|
||
```
|
||
Agent 1 (using MADDPG) learns:
|
||
"Move right gives Q=50"
|
||
|
||
But assumption: Agent 2 plays policy π_2
|
||
|
||
When Agent 2 improves to π'_2:
|
||
"Move right gives Q=20" (because Agent 2 blocks that path)
|
||
|
||
Agent 1's Q-value estimates become stale!
|
||
Environment has changed (opponent improved)
|
||
```
|
||
|
||
**Solution: Opponent Modeling**
|
||
|
||
```python
|
||
class OpponentModelingAgent:
|
||
def __init__(self, agent_id, n_agents, obs_dim, action_dim):
|
||
self.agent_id = agent_id
|
||
|
||
# Own actor and critic
|
||
self.actor = self._build_actor(obs_dim, action_dim)
|
||
self.critic = self._build_critic()
|
||
|
||
# Model opponent policies (for agents we compete against)
|
||
self.opponent_models = {
|
||
i: self._build_opponent_model() for i in range(n_agents) if i != agent_id
|
||
}
|
||
|
||
def _build_opponent_model(self):
|
||
"""Model what opponent will do given state."""
|
||
return nn.Sequential(
|
||
nn.Linear(64, 128),
|
||
nn.ReLU(),
|
||
nn.Linear(128, 128),
|
||
nn.ReLU(),
|
||
nn.Linear(128, self.action_dim)
|
||
)
|
||
|
||
def train_step_with_opponent_modeling(self, batch):
|
||
"""
|
||
Update own policy AND opponent models.
|
||
|
||
Key insight: predict what opponent will do,
|
||
then plan against those predictions
|
||
"""
|
||
observations, actions, rewards, next_observations = batch
|
||
|
||
# Step 1: Update opponent models (supervised)
|
||
# Predict opponent action from observation
|
||
for opponent_id, model in self.opponent_models.items():
|
||
predicted_action = model(next_observations[opponent_id])
|
||
actual_action = actions[opponent_id]
|
||
opponent_loss = ((predicted_action - actual_action) ** 2).mean()
|
||
# Update opponent model
|
||
optimizer.zero_grad()
|
||
opponent_loss.backward()
|
||
optimizer.step()
|
||
|
||
# Step 2: Plan against opponent predictions
|
||
predicted_opponent_actions = {
|
||
i: model(observations[i])
|
||
for i, model in self.opponent_models.items()
|
||
}
|
||
|
||
# Use predictions in MADDPG update
|
||
# Critic sees: own obs + predicted opponent actions
|
||
# Actor learns: given opponent predictions, best response
|
||
|
||
return {'opponent_loss': opponent_loss.item()}
|
||
```
|
||
|
||
**Opponent Modeling Trade-offs**:
|
||
|
||
```
|
||
Advantages:
|
||
- Accounts for opponent improvements (non-stationarity)
|
||
- Enables planning ahead
|
||
- Reduces brittleness to opponent policy changes
|
||
|
||
Disadvantages:
|
||
- Requires learning opponent models (additional supervision)
|
||
- If opponent model is wrong, agent learns wrong policy
|
||
- Computational overhead
|
||
- Assumes opponent is predictable
|
||
|
||
When to use:
|
||
- Competitive settings with clear opponents
|
||
- Limited number of distinct opponents
|
||
- Opponents have consistent strategies
|
||
|
||
When NOT to use:
|
||
- Too many potential opponents
|
||
- Opponents are unpredictable
|
||
- Cooperative setting (waste of resources)
|
||
```
|
||
|
||
|
||
## Part 10: Advanced: Independent Q-Learning (IQL) for Multi-Agent
|
||
|
||
### IQL in Multi-Agent Settings
|
||
|
||
**Idea**: Each agent learns Q-value using only own rewards and observations.
|
||
|
||
```python
|
||
class IQLMultiAgent:
|
||
def __init__(self, agent_id, obs_dim, action_dim):
|
||
self.agent_id = agent_id
|
||
|
||
# Q-network for this agent only
|
||
self.q_network = nn.Sequential(
|
||
nn.Linear(obs_dim + action_dim, 128),
|
||
nn.ReLU(),
|
||
nn.Linear(128, 1)
|
||
)
|
||
|
||
self.optimizer = Adam(self.q_network.parameters(), lr=1e-3)
|
||
|
||
def train_step(self, batch):
|
||
"""
|
||
Independent Q-learning: each agent learns from own reward only.
|
||
|
||
Problem: Non-stationarity
|
||
- Other agents improve policies
|
||
- Environment from this agent's perspective changes
|
||
- Q-values become invalid
|
||
|
||
Benefit: Decentralized
|
||
- No centralized training needed
|
||
- Scalable to many agents
|
||
"""
|
||
observations, actions, rewards, next_observations = batch
|
||
|
||
# Q-value update (standard Q-learning)
|
||
with torch.no_grad():
|
||
# Greedy next action (assume agent acts greedily)
|
||
next_q_values = []
|
||
for action in range(self.action_dim):
|
||
q_input = torch.cat([next_observations, one_hot(action)])
|
||
q_val = self.q_network(q_input)
|
||
next_q_values.append(q_val)
|
||
|
||
max_next_q = torch.max(torch.stack(next_q_values), dim=0)[0]
|
||
td_target = rewards + 0.99 * max_next_q
|
||
|
||
# Current Q-value
|
||
q_pred = self.q_network(torch.cat([observations, actions], dim=-1))
|
||
|
||
# TD loss
|
||
loss = ((q_pred - td_target) ** 2).mean()
|
||
|
||
self.optimizer.zero_grad()
|
||
loss.backward()
|
||
self.optimizer.step()
|
||
|
||
return {'loss': loss.item()}
|
||
```
|
||
|
||
**IQL in Multi-Agent: Pros and Cons**:
|
||
|
||
```
|
||
Advantages:
|
||
- Fully decentralized (scalable)
|
||
- No communication needed
|
||
- Simple implementation
|
||
- Works with partial observability
|
||
|
||
Disadvantages:
|
||
- Non-stationarity breaks convergence
|
||
- Agents chase moving targets (other agents improving)
|
||
- No explicit coordination
|
||
- Performance often poor without CTDE
|
||
|
||
Result:
|
||
- IQL works but is unstable in true multi-agent settings
|
||
- Better to use CTDE (QMIX, MADDPG) for stability
|
||
- IQL useful if centralized training impossible
|
||
```
|
||
|
||
|
||
## Part 11: Multi-Agent Experience Replay and Batch Sampling
|
||
|
||
### Challenges of Experience Replay in Multi-Agent
|
||
|
||
**Problem**:
|
||
|
||
```
|
||
In single-agent RL:
|
||
Experience replay stores (s, a, r, s', d)
|
||
Sample uniformly from buffer
|
||
Works well (iid samples)
|
||
|
||
In multi-agent RL:
|
||
Experience replay stores (s, a_1, a_2, ..., a_n, r, s')
|
||
But agents are non-stationary!
|
||
|
||
Transition (s, a_1, a_2, r, s') valid only if:
|
||
- Assumptions about other agents' policies still hold
|
||
- If other agents improved, assumptions invalid
|
||
|
||
Solution: Prioritized experience replay for multi-agent
|
||
- Prioritize transitions where agent's assumptions are likely correct
|
||
- Down-weight transitions from old policies (outdated assumptions)
|
||
- Focus on recent transitions (more relevant)
|
||
```
|
||
|
||
**Batch Sampling Strategy**:
|
||
|
||
```python
|
||
class MultiAgentReplayBuffer:
|
||
def __init__(self, capacity=100000, n_agents=3):
|
||
self.buffer = deque(maxlen=capacity)
|
||
self.n_agents = n_agents
|
||
self.priority_weights = deque(maxlen=capacity)
|
||
|
||
def add(self, transition):
|
||
"""Store experience with priority."""
|
||
# transition: (observations, actions, rewards, next_observations, dones)
|
||
self.buffer.append(transition)
|
||
|
||
# Priority: how relevant is this to current policy?
|
||
# Recent transitions: high priority (policies haven't changed much)
|
||
# Old transitions: low priority (agents have improved, assumptions stale)
|
||
priority = self._compute_priority(transition)
|
||
self.priority_weights.append(priority)
|
||
|
||
def _compute_priority(self, transition):
|
||
"""Compute priority for multi-agent setting."""
|
||
# Heuristic: prioritize recent transitions
|
||
# Could use TD-error (how surprised are we by this transition?)
|
||
age = len(self.buffer) # How long ago was this added?
|
||
decay = 0.99 ** age # Exponential decay
|
||
return decay
|
||
|
||
def sample(self, batch_size):
|
||
"""Sample prioritized batch."""
|
||
# Weighted sampling: high priority more likely
|
||
indices = np.random.choice(
|
||
len(self.buffer),
|
||
batch_size,
|
||
p=self.priority_weights / self.priority_weights.sum()
|
||
)
|
||
|
||
batch = [self.buffer[i] for i in indices]
|
||
return batch
|
||
```
|
||
|
||
|
||
## Part 12: 10+ Critical Pitfalls
|
||
|
||
1. **Treating as independent agents**: Non-stationarity breaks convergence
|
||
2. **Giving equal reward to unequal contributors**: Credit assignment fails
|
||
3. **Forgetting decentralized execution**: Agents need independent policies
|
||
4. **Communicating too much**: High variance, bandwidth waste
|
||
5. **Using cooperative algorithm in competitive game**: Convergence fails
|
||
6. **Using competitive algorithm in cooperative game**: Agents conflict
|
||
7. **Not using CTDE**: Weak coordination, brittle policies
|
||
8. **Assuming other agents will converge**: Non-stationarity = moving targets
|
||
9. **Value overestimation in team settings**: Similar to offline RL issues
|
||
10. **Forgetting opponent modeling**: In competitive settings, must predict others
|
||
11. **Communication deception**: Agents learn to mislead for short-term gain
|
||
12. **Scalability (too many agents)**: MARL doesn't scale to 100+ agents
|
||
13. **Experience replay staleness**: Old transitions assume old opponent policies
|
||
14. **Ignoring observability constraints**: Partial obs needs communication or factorization
|
||
15. **Reward structure not matching algorithm**: Cooperative/competitive mismatch
|
||
|
||
|
||
## Part 13: 10+ Rationalization Patterns
|
||
|
||
Users often rationalize MARL mistakes:
|
||
|
||
1. **"Independent agents should work"**: Doesn't understand non-stationarity
|
||
2. **"My algorithm converged to something"**: Might be local optima due to credit ambiguity
|
||
3. **"Communication improved rewards"**: Might be learned deception, not coordination
|
||
4. **"QMIX should work everywhere"**: Doesn't check problem for monotonicity
|
||
5. **"More agents = more parallelism"**: Ignores centralized training bottleneck
|
||
6. **"Rewards are subjective anyway"**: Credit assignment is objective (factorization)
|
||
7. **"I'll just add more training"**: Non-stationarity can't be fixed by more epochs
|
||
8. **"Other agents are fixed"**: But they're learning too (environment is non-stationary)
|
||
9. **"Communication bandwidth doesn't matter"**: In real systems, it does
|
||
10. **"Nash equilibrium is always stable"**: No, it's just best-response equilibrium
|
||
|
||
|
||
## Part 14: MAPPO - Multi-Agent Proximal Policy Optimization
|
||
|
||
### When to Use MAPPO
|
||
|
||
**Cooperative teams with policy gradients**:
|
||
|
||
```python
|
||
class MAPPOAgent:
|
||
def __init__(self, agent_id, obs_dim, action_dim, hidden_dim=256):
|
||
self.agent_id = agent_id
|
||
|
||
# Actor: policy for decentralized execution
|
||
self.actor = nn.Sequential(
|
||
nn.Linear(obs_dim, hidden_dim),
|
||
nn.ReLU(),
|
||
nn.Linear(hidden_dim, hidden_dim),
|
||
nn.ReLU(),
|
||
nn.Linear(hidden_dim, action_dim)
|
||
)
|
||
|
||
# Critic: centralized value function (uses global state during training)
|
||
self.critic = nn.Sequential(
|
||
nn.Linear(obs_dim, hidden_dim),
|
||
nn.ReLU(),
|
||
nn.Linear(hidden_dim, 1)
|
||
)
|
||
|
||
self.actor_optimizer = Adam(self.actor.parameters(), lr=3e-4)
|
||
self.critic_optimizer = Adam(self.critic.parameters(), lr=1e-3)
|
||
|
||
def train_step_on_batch(self, observations, actions, returns, advantages):
|
||
"""
|
||
MAPPO training: advantage actor-critic with clipped policy gradient.
|
||
|
||
Key difference from DDPG:
|
||
- Policy gradient (not off-policy value)
|
||
- Centralized training (uses global returns/advantages)
|
||
- Decentralized execution (policy uses only own observation)
|
||
"""
|
||
# Actor loss (clipped PPO)
|
||
action_probs = torch.softmax(self.actor(observations), dim=-1)
|
||
action_log_probs = torch.log(action_probs.gather(-1, actions))
|
||
|
||
# Importance weight (in on-policy setting, = 1)
|
||
# In practice, small advantage clipping for stability
|
||
policy_loss = -(action_log_probs * advantages).mean()
|
||
|
||
# Entropy regularization (exploration)
|
||
entropy = -(action_probs * torch.log(action_probs + 1e-8)).sum(dim=-1).mean()
|
||
actor_loss = policy_loss - 0.01 * entropy
|
||
|
||
self.actor_optimizer.zero_grad()
|
||
actor_loss.backward()
|
||
self.actor_optimizer.step()
|
||
|
||
# Critic loss (value estimation)
|
||
values = self.critic(observations)
|
||
critic_loss = ((values - returns) ** 2).mean()
|
||
|
||
self.critic_optimizer.zero_grad()
|
||
critic_loss.backward()
|
||
self.critic_optimizer.step()
|
||
|
||
return {
|
||
'actor_loss': actor_loss.item(),
|
||
'critic_loss': critic_loss.item(),
|
||
'entropy': entropy.item()
|
||
}
|
||
```
|
||
|
||
**MAPPO vs QMIX**:
|
||
|
||
```
|
||
QMIX:
|
||
- Value-based (discrete actions)
|
||
- Value factorization (credit assignment)
|
||
- Works with partial observability
|
||
|
||
MAPPO:
|
||
- Policy gradient-based
|
||
- Centralized critic (advantage estimation)
|
||
- On-policy (requires recent trajectories)
|
||
|
||
Use MAPPO when:
|
||
- Continuous or large discrete action spaces
|
||
- On-policy learning acceptable
|
||
- Value factorization not needed (reward structure simple)
|
||
|
||
Use QMIX when:
|
||
- Discrete actions
|
||
- Need explicit credit assignment
|
||
- Off-policy learning preferred
|
||
```
|
||
|
||
|
||
## Part 15: Self-Play for Competitive Learning
|
||
|
||
### Self-Play Mechanism
|
||
|
||
**Problem**: Training competitive agents requires opponents.
|
||
|
||
```
|
||
Naive approach:
|
||
- Agent 1 trains vs fixed opponent
|
||
- Problem: fixed opponent doesn't adapt
|
||
- Agent 1 learns exploitation (brittle to new opponents)
|
||
|
||
Self-play:
|
||
- Agent 1 trains vs historical versions of itself
|
||
- Agent 1 improves → creates stronger opponent
|
||
- New Agent 1 trains vs stronger Agent 1
|
||
- Cycle: both improve together
|
||
- Result: robust agent that beats all versions of itself
|
||
```
|
||
|
||
**Self-Play Implementation**:
|
||
|
||
```python
|
||
class SelfPlayTrainer:
|
||
def __init__(self, agent_class, n_checkpoint_opponents=5):
|
||
self.current_agent = agent_class()
|
||
self.opponent_pool = [] # Keep historical versions
|
||
self.n_checkpoints = n_checkpoint_opponents
|
||
|
||
def train(self, num_episodes):
|
||
"""Train with self-play against previous versions."""
|
||
for episode in range(num_episodes):
|
||
# Select opponent: current agent or historical version
|
||
if not self.opponent_pool or np.random.rand() < 0.5:
|
||
opponent = copy.deepcopy(self.current_agent)
|
||
else:
|
||
opponent = np.random.choice(self.opponent_pool)
|
||
|
||
# Play episode: current_agent vs opponent
|
||
trajectory = self._play_episode(self.current_agent, opponent)
|
||
|
||
# Train current agent on trajectory
|
||
self.current_agent.train_on_trajectory(trajectory)
|
||
|
||
# Periodically add current agent to opponent pool
|
||
if episode % (num_episodes // self.n_checkpoints) == 0:
|
||
self.opponent_pool.append(copy.deepcopy(self.current_agent))
|
||
|
||
return self.current_agent
|
||
|
||
def _play_episode(self, agent1, agent2):
|
||
"""Play episode: agent1 vs agent2, collect experience."""
|
||
trajectory = []
|
||
state = self.env.reset()
|
||
done = False
|
||
|
||
while not done:
|
||
# Agent 1 action
|
||
action1 = agent1.select_action(state['agent1_obs'])
|
||
|
||
# Agent 2 action (opponent)
|
||
action2 = agent2.select_action(state['agent2_obs'])
|
||
|
||
# Step environment
|
||
state, reward, done = self.env.step(action1, action2)
|
||
|
||
trajectory.append({
|
||
'obs1': state['agent1_obs'],
|
||
'obs2': state['agent2_obs'],
|
||
'action1': action1,
|
||
'action2': action2,
|
||
'reward1': reward['agent1'],
|
||
'reward2': reward['agent2']
|
||
})
|
||
|
||
return trajectory
|
||
```
|
||
|
||
**Self-Play Benefits and Pitfalls**:
|
||
|
||
```
|
||
Benefits:
|
||
- Agents automatically improve together
|
||
- Robust to different opponent styles
|
||
- Emergent complexity (rock-paper-scissors dynamics)
|
||
|
||
Pitfalls:
|
||
- Agents might exploit specific weaknesses (not generalizable)
|
||
- Training unstable if pool too small
|
||
- Forgetting how to beat weaker opponents (catastrophic forgetting)
|
||
- Computational cost (need to evaluate multiple opponents)
|
||
|
||
Solution: Diverse opponent pool
|
||
- Keep varied historical versions
|
||
- Mix self-play with evaluation vs fixed benchmark
|
||
- Monitor for forgetting (test vs all opponents periodically)
|
||
```
|
||
|
||
|
||
## Part 16: Practical Implementation Considerations
|
||
|
||
### Observation Space Design
|
||
|
||
**Key consideration**: Partial vs full observability
|
||
|
||
```python
|
||
# Full Observability (not realistic but simplest)
|
||
observation = {
|
||
'own_position': agent_pos,
|
||
'all_agent_positions': [pos1, pos2, pos3], # See everyone!
|
||
'all_agent_velocities': [vel1, vel2, vel3],
|
||
'targets': [target1, target2, target3]
|
||
}
|
||
|
||
# Partial Observability (more realistic, harder)
|
||
observation = {
|
||
'own_position': agent_pos,
|
||
'own_velocity': agent_vel,
|
||
'target': own_target,
|
||
'nearby_agents': agents_within_5m, # Limited field of view
|
||
# Note: don't see agents far away
|
||
}
|
||
|
||
# Consequence: With partial obs, agents must communicate or learn implicitly
|
||
# Via environmental interaction (e.g., bumping into others)
|
||
```
|
||
|
||
### Reward Structure Design
|
||
|
||
**Critical for multi-agent learning**:
|
||
|
||
```python
|
||
# Cooperative game: shared reward
|
||
team_reward = +100 if goal_reached else 0
|
||
# Problem: ambiguous who contributed
|
||
|
||
# Cooperative game: mixed rewards (shared + individual)
|
||
team_reward = +100 if goal_reached
|
||
individual_bonus = +5 if agent_i_did_critical_action
|
||
total_reward_i = team_reward + individual_bonus # incentivizes both
|
||
|
||
# Competitive game: zero-sum
|
||
reward_1 = goals_1 - goals_2
|
||
reward_2 = goals_2 - goals_1 # Opposite
|
||
|
||
# Competitive game: individual scores
|
||
reward_1 = goals_1
|
||
reward_2 = goals_2
|
||
# Problem: agents don't care about each other (no implicit competition)
|
||
|
||
# Mixed: cooperation + competition (team sports)
|
||
reward_i = +10 if team_wins
|
||
+ 1 if agent_i_scores
|
||
+ 0.1 * team_score # Shared team success bonus
|
||
```
|
||
|
||
**Reward Design Pitfall**: Too much individual reward breaks cooperation
|
||
|
||
```
|
||
Example: 3v3 soccer
|
||
reward_i = +100 if agent_i_scores (individual goal)
|
||
+ +5 if agent_i_assists (passes to scorer)
|
||
+ 0 if teammate scores (not rewarded!)
|
||
|
||
Result:
|
||
Agent learns: "Only my goals matter, don't pass to teammates"
|
||
Agent hoards ball, tries solo shots
|
||
Team coordination breaks
|
||
Lose to coordinated opponent team
|
||
|
||
Solution: Include team reward
|
||
reward_i = +100 if team_wins
|
||
+ +10 if agent_i_scores goal
|
||
+ +2 if agent_i_assists
|
||
```
|
||
|
||
|
||
## Summary: When to Use Multi-Agent RL
|
||
|
||
**Multi-agent RL is needed when**:
|
||
|
||
1. Multiple agents learning simultaneously in shared environment
|
||
2. Agent interactions cause non-stationarity
|
||
3. Coordination or credit assignment is non-trivial
|
||
4. Problem structure matches available algorithm (cooperative/competitive)
|
||
|
||
**Multi-agent RL is NOT needed when**:
|
||
|
||
1. Single learning agent (others are static)
|
||
2. Agents act independently (no true interaction)
|
||
3. Problem easily decomposes (use single-agent RL per agent)
|
||
4. Scalability to 100+ agents critical (MARL hard to scale)
|
||
|
||
**Key Algorithms**:
|
||
|
||
1. **QMIX**: Cooperative, value factorization, decentralized execution
|
||
2. **MADDPG**: Competitive/mixed, continuous actions, centralized critic
|
||
3. **MAPPO**: Cooperative, policy gradients, centralized training
|
||
4. **Self-Play**: Competitive, agents train vs historical versions
|
||
5. **Communication**: For partial observability, explicit coordination
|
||
6. **CTDE**: Paradigm enabling stable multi-agent learning
|
||
|
||
**Algorithm Selection Matrix**:
|
||
|
||
```
|
||
Cooperative Competitive Mixed
|
||
Discrete Action QMIX Nash-Q Hybrid
|
||
Continuous Action MAPPO/MADDPG MADDPG MADDPG
|
||
Partial Obs +Comm +Opponent Mod +Both
|
||
Scalable IQL (unstable) IQL IQL (unstable)
|
||
```
|
||
|
||
**Critical Success Factors**:
|
||
|
||
1. Match algorithm to problem structure (cooperative vs competitive)
|
||
2. Design reward to align with desired coordination
|
||
3. Use CTDE for stable training
|
||
4. Monitor for non-stationarity issues
|
||
5. Validate agents work independently during execution
|
||
|
||
Use this skill to understand multi-agent problem structure and select appropriate algorithms for coordination challenges.
|