gh-tachyon-beep-skillpacks-…/skills/using-deep-rl/multi-agent-rl.md


# Multi-Agent Reinforcement Learning

## When to Use This Skill

Invoke this skill when you encounter:

- **Multiple Learners**: 2+ agents learning simultaneously in shared environment
- **Coordination Problem**: Agents must coordinate to achieve goals
- **Non-Stationarity**: Other agents changing policies during training
- **CTDE Implementation**: Separating centralized training from decentralized execution
- **Value Factorization**: Credit assignment in cooperative multi-agent settings
- **QMIX Algorithm**: Learning cooperative Q-values with value factorization
- **MADDPG**: Multi-agent actor-critic with centralized critics
- **Communication**: Agents learning to communicate to improve coordination
- **Team Reward Ambiguity**: How to split team reward fairly among agents
- **Cooperative vs Competitive**: Designing reward structure for multi-agent problem
- **Non-Stationarity Handling**: Dealing with other agents' policy changes
- **When Multi-Agent RL Needed**: Deciding if problem requires MARL vs single-agent

**This skill teaches learning from multiple simultaneous agents with coordination challenges.**

Do NOT use this skill for:

- Single-agent RL (use rl-foundations, value-based-methods, policy-gradient-methods)
- Supervised multi-task learning (that's supervised learning)
- Simple parallel independent tasks (use single-agent RL in parallel)
- Pure game theory without learning (use game theory frameworks)

## Core Principle

**Multi-agent RL learns coordinated policies for multiple agents in shared environment, solving the fundamental problem that environment non-stationarity from other agents' learning breaks standard RL convergence guarantees.**

The core insight: When other agents improve their policies, the environment changes. Your value estimates computed assuming other agents play old policy become wrong when they play new policy.

```
Single-Agent RL:
  1. Agent learns policy π
  2. Environment is fixed
  3. Agent value estimates Q(s,a) stable
  4. Algorithm converges to optimal policy

Multi-Agent RL:
  1. Agent 1 learns policy π_1
  2. Agent 2 also learning, changing π_2
  3. Environment from Agent 1 perspective is non-stationary
  4. Agent 1's value estimates invalid when Agent 2 improves
  5. Standard convergence guarantees broken
  6. Need special algorithms: QMIX, MADDPG, communication

Without addressing non-stationarity, multi-agent learning is unstable.
```

**Without understanding multi-agent problem structure and non-stationarity, you'll implement algorithms that fail to coordinate, suffer credit assignment disasters, or waste effort on agent conflicts instead of collaboration.**


## Part 1: Multi-Agent RL Fundamentals

### Why Multi-Agent RL Differs From Single-Agent

**Standard RL Assumption (Single-Agent)**:

- You have one agent
- Environment dynamics and reward function are fixed
- Agent's actions don't change environment structure
- Goal: Learn policy that maximizes expected return

**Multi-Agent RL Reality**:

- Multiple agents act in shared environment
- Each agent learns simultaneously
- When Agent 1 improves, Agent 2 sees changed environment
- Reward depends on all agents' actions: R = R(a_1, a_2, ..., a_n)
- Non-stationarity: other agents' policies change constantly
- Convergence undefined (what is "optimal" when others adapt?)

### Problem Types: Cooperative, Competitive, Mixed

**Cooperative Multi-Agent Problem**:

```
Definition: All agents share same objective
Reward: R_team(a_1, a_2, ..., a_n) = same for all agents

Example - Robot Team Assembly:
  - All robots get same team reward
  - +100 if assembly succeeds
  - 0 if assembly fails
  - All robots benefit from success equally

Characteristic:
  - Agents don't conflict on goals
  - Challenge: Credit assignment (who deserves credit?)
  - Solution: Value factorization (QMIX, QPLEX)

Key Insight:
  Cooperative doesn't mean agents see each other!
  - Agents might have partial/no observation of others
  - Still must coordinate for team success
  - Factorization enables coordination without observation
```

**Competitive Multi-Agent Problem**:

```
Definition: Agents have opposite objectives (zero-sum)
Reward: R_i(a_1, ..., a_n) = -R_j(a_1, ..., a_n) for i≠j

Example - Chess, Poker, Soccer:
  - Agent 1 tries to win
  - Agent 2 tries to win
  - One's gain is other's loss
  - R_1 + R_2 = 0 (zero-sum)

Characteristic:
  - Agents are adversarial
  - Challenge: Computing best response to opponent
  - Solution: Nash equilibrium (MADDPG, self-play)

Key Insight:
  In competitive games, agents must predict opponent strategies.
  - Agent 1 assumes Agent 2 plays best response
  - Agent 2 assumes Agent 1 plays best response
  - Nash equilibrium = mutual best response
  - No agent can improve unilaterally
```

**Mixed Multi-Agent Problem**:

```
Definition: Some cooperation, some competition
Reward: R_i(a_1, ..., a_n) contains both shared and individual terms

Example - Team Soccer (3v3):
  - Blue team agents cooperate for same goal
  - But blue vs red is competitive
  - Blue agent reward:
    R_i = +10 if blue scores, -10 if red scores (team-based)
        + 1 if blue_i scores goal (individual bonus)

Characteristic:
  - Agents cooperate with teammates
  - Agents compete with opponents
  - Challenge: Balancing cooperation and competition
  - Solution: Hybrid approaches using both cooperative and competitive algorithms

Key Insight:
  Mixed scenarios are most common in practice.
  - Robot teams: cooperate internally, compete for resources
  - Trading: multiple firms (cooperate via regulations, compete for profit)
  - Multiplayer games: team-based (cooperate with allies, compete with enemies)
```

### Non-Stationarity: The Core Challenge

**What is Non-Stationarity?**

```
Stationarity: Environment dynamics P(s'|s,a) and rewards R(s,a) are fixed
Non-Stationarity: Dynamics/rewards change over time

In multi-agent RL:
  Environment from Agent 1's perspective:
    P(s'_1 | s_1, a_1, a_2(t), a_3(t), ...)

  If other agents' policies change:
    π_2(t) ≠ π_2(t+1)

  Then transition dynamics change:
    P(s'_1 | s_1, a_1, a_2(t)) ≠ P(s'_1 | s_1, a_1, a_2(t+1))

  Environment is non-stationary!
```

**Why Non-Stationarity Breaks Standard RL**:

```python
# Single-agent Q-learning assumes:
# Environment is fixed during learning
# Q-values converge because bellman expectation is fixed point

Q[s,a] ← Q[s,a] + α(r + γ max_a' Q[s',a'] - Q[s,a])

# In multi-agent with non-stationarity:
# Other agents improve their policies
# Max action a' depends on other agents' policies
# When other agents improve, max action changes
# Q-values chase moving target
# No convergence guarantee
```

**Impact on Learning**:

```
Scenario: Two agents learning to navigate
Agent 1 learns: "If Agent 2 goes left, I go right"
Agent 1 builds value estimates based on this assumption

Agent 2 improves: "Actually, going right is better"
Now Agent 2 goes right (not left)
Agent 1's assumptions invalid!
Agent 1's value estimates become wrong
Agent 1 must relearn

Agent 1 tries new path based on new estimates
Agent 2 sees Agent 1's change and adapts
Agent 2's estimates become wrong

Result: Chaotic learning, no convergence
```


## Part 2: Centralized Training, Decentralized Execution (CTDE)

### CTDE Paradigm

**Key Idea**: Use centralized information during training, decentralized information during execution.

```
Training Phase (Centralized):
  - Trainer observes: o_1, o_2, ..., o_n (all agents' observations)
  - Trainer observes: a_1, a_2, ..., a_n (all agents' actions)
  - Trainer observes: R_team or R_1, R_2, ... (reward signals)
  - Trainer can assign credit fairly
  - Trainer can compute global value functions

Execution Phase (Decentralized):
  - Agent 1 observes: o_1 only
  - Agent 1 executes: π_1(a_1 | o_1)
  - Agent 1 doesn't need to see other agents
  - Each agent is independent during rollout
  - Enables scalability and robustness
```

**Why CTDE Solves Non-Stationarity**:

```
During training:
  - Centralized trainer sees all information
  - Can compute value Q_1(s_1, s_2, ..., s_n | a_1, a_2, ..., a_n)
  - Can factor: Q_team = f(Q_1, Q_2, ..., Q_n) (QMIX)
  - Can compute importance weights: who contributed most?

During execution:
  - Decentralized agents only use own observations
  - Policies learned during centralized training work well
  - No need for other agents' observations at runtime
  - Robust to other agents' changes (policy doesn't depend on their states)

Result:
  - Training leverages global information for stability
  - Execution is independent and scalable
  - Solves non-stationarity via centralized credit assignment
```

### CTDE in Practice

**Centralized Information Used in Training**:

```python
# During training, compute global value function
# Inputs: observations and actions of ALL agents
def compute_value_ctde(obs_1, obs_2, obs_3, act_1, act_2, act_3):
    # See everyone's observations
    global_state = combine(obs_1, obs_2, obs_3)

    # See everyone's actions
    joint_action = (act_1, act_2, act_3)

    # Compute shared value with all information
    Q_shared = centralized_q_network(global_state, joint_action)

    # Factor into individual Q-values (QMIX)
    Q_1 = q_network_1(obs_1, act_1)
    Q_2 = q_network_2(obs_2, act_2)
    Q_3 = q_network_3(obs_3, act_3)

    # Factorization: Q_team ≈ mixing_network(Q_1, Q_2, Q_3)
    # Each agent learns its contribution via QMIX loss
    return Q_shared, (Q_1, Q_2, Q_3)
```

**Decentralized Execution**:

```python
# During execution, use only own observation
def execute_policy(agent_id, own_observation):
    # Agent only sees and uses own obs
    action = policy_network(own_observation)

    # No access to other agents' observations
    # Doesn't need other agents' actions
    # Purely decentralized execution
    return action

# All agents execute in parallel:
# Agent 1: o_1 → a_1 (decentralized)
# Agent 2: o_2 → a_2 (decentralized)
# Agent 3: o_3 → a_3 (decentralized)
# Execution is independent!
```


## Part 3: QMIX - Value Factorization for Cooperative Teams

### QMIX: The Core Insight

**Problem**: In cooperative teams, how do you assign credit fairly?

```
Naive approach: Joint Q-value
  Q_team(s, a_1, a_2, ..., a_n) = expected return from joint action

Problem: Still doesn't assign individual credit
  If Q_team = 100, how much did Agent 1 contribute?
  Agent 1 might think: "I deserve 50%" (overconfident)
  But Agent 1 might deserve only 10% (others did more)

Result: Agents learn wrong priorities
```

**Solution: Value Factorization (QMIX)**

```
Key Assumption: Monotonicity in actions
  If improving Agent i's action improves team outcome,
  improving Agent i's individual Q-value should help

Mathematical Form:
  Q_team(a) ≥ Q_team(a') if Agent i plays better action a_i instead of a'_i
  and Agent i's Q_i(a_i) > Q_i(a'_i)

Concrete Implementation:
  Q_team(s, a_1, ..., a_n) = f(Q_1(s_1, a_1), Q_2(s_2, a_2), ..., Q_n(s_n, a_n))

  Where:
  - Q_i: Individual Q-network for agent i
  - f: Monotonic mixing network (ensures monotonicity)

  Monotonicity guarantee:
    If Q_1 increases, Q_team increases (if f is monotonic)
```

### QMIX Algorithm

**Architecture**:

```
Individual Q-Networks:           Mixing Network (Monotonic):
┌─────────┐                      ┌──────────────────┐
│ Agent 1 │─o_1─────────────────→│                  │
│  LSTM   │                      │   MLP (weights)  │─→ Q_team
│ Q_1     │                      │  are monotonic   │
└─────────┘                      │   (ReLU blocks)  │
                                 └──────────────────┘
┌─────────┐                             ↑
│ Agent 2 │─o_2──────────────────────────┤
│  LSTM   │                              │
│ Q_2     │                              │
└─────────┘                              │
                                    Hypernet:
┌─────────┐                        generates weights
│ Agent 3 │─o_3────────────────────→    as function
│  LSTM   │                        of state
│ Q_3     │
└─────────┘

Value outputs: Q_1(o_1, a_1), Q_2(o_2, a_2), Q_3(o_3, a_3)
Mixing: Q_team = mixing_network(Q_1, Q_2, Q_3, state)
```

**QMIX Training**:

```python
import torch
import torch.nn as nn
from torch.optim import Adam

class QMIXAgent:
    def __init__(self, n_agents, state_dim, obs_dim, action_dim, hidden_dim=64):
        self.n_agents = n_agents

        # Individual Q-networks (one per agent)
        self.q_networks = nn.ModuleList([
            nn.Sequential(
                nn.Linear(obs_dim + action_dim, hidden_dim),
                nn.ReLU(),
                nn.Linear(hidden_dim, hidden_dim),
                nn.ReLU(),
                nn.Linear(hidden_dim, 1)  # Q-value for this action
            )
            for _ in range(n_agents)
        ])

        # Mixing network: takes individual Q-values and produces joint Q
        self.mixing_network = nn.Sequential(
            nn.Linear(n_agents + state_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, 1)
        )

        # Hypernet: generates mixing network weights (ensuring monotonicity)
        self.hypernet = nn.Sequential(
            nn.Linear(state_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, hidden_dim * (n_agents + state_dim))
        )

        self.optimizer = Adam(
            list(self.q_networks.parameters()) +
            list(self.mixing_network.parameters()) +
            list(self.hypernet.parameters()),
            lr=5e-4
        )

        self.discount = 0.99
        self.target_update_rate = 0.001
        self.epsilon = 0.05

        # Target networks (soft update)
        self._init_target_networks()

    def _init_target_networks(self):
        """Create target networks for stable learning."""
        self.target_q_networks = nn.ModuleList([
            nn.Sequential(*[nn.Linear(*p.shape[::-1]) for p in q.parameters()])
            for q in self.q_networks
        ])
        self.target_mixing_network = nn.Sequential(
            *[nn.Linear(*p.shape[::-1]) for p in self.mixing_network.parameters()]
        )

    def compute_individual_q_values(self, observations, actions):
        """
        Compute Q-values for each agent given their observation and action.

        Args:
            observations: list of n_agents observations (each [batch_size, obs_dim])
            actions: list of n_agents actions (each [batch_size, action_dim])

        Returns:
            q_values: tensor [batch_size, n_agents]
        """
        q_values = []
        for i, (obs, act) in enumerate(zip(observations, actions)):
            # Concatenate observation and action
            q_input = torch.cat([obs, act], dim=-1)
            q_i = self.q_networks[i](q_input)
            q_values.append(q_i)

        return torch.cat(q_values, dim=-1)  # [batch_size, n_agents]

    def compute_joint_q_value(self, q_values, state):
        """
        Mix individual Q-values into joint Q-value using monotonic mixing network.

        Args:
            q_values: individual Q-values [batch_size, n_agents]
            state: global state [batch_size, state_dim]

        Returns:
            q_joint: joint Q-value [batch_size, 1]
        """
        # Ensure monotonicity by using weight constraints
        # Mixing network learns to combine Q-values
        q_joint = self.mixing_network(torch.cat([q_values, state], dim=-1))
        return q_joint

    def train_step(self, batch, state_batch):
        """
        One QMIX training step.

        Batch contains:
          observations: list[n_agents] of [batch_size, obs_dim]
          actions: list[n_agents] of [batch_size, action_dim]
          rewards: [batch_size] (shared team reward)
          next_observations: list[n_agents] of [batch_size, obs_dim]
          dones: [batch_size]
        """
        observations, actions, rewards, next_observations, dones = batch
        batch_size = observations[0].shape[0]

        # Compute current Q-values
        q_values = self.compute_individual_q_values(observations, actions)
        q_joint = self.compute_joint_q_value(q_values, state_batch)

        # Compute target Q-values
        with torch.no_grad():
            # Get next Q-values for all possible joint actions (in practice, greedy)
            next_q_values = self.compute_individual_q_values(
                next_observations,
                [torch.zeros_like(a) for a in actions]  # Best actions (simplified)
            )

            # Mix next Q-values
            next_q_joint = self.compute_joint_q_value(next_q_values, state_batch)

            # TD target: team gets shared reward
            td_target = rewards.unsqueeze(-1) + (
                1 - dones.unsqueeze(-1)
            ) * self.discount * next_q_joint

        # QMIX loss
        qmix_loss = ((q_joint - td_target) ** 2).mean()

        self.optimizer.zero_grad()
        qmix_loss.backward()
        self.optimizer.step()

        # Soft update target networks
        self._soft_update_targets()

        return {'qmix_loss': qmix_loss.item()}

    def _soft_update_targets(self):
        """Soft update target networks."""
        for target, main in zip(self.target_q_networks, self.q_networks):
            for target_param, main_param in zip(target.parameters(), main.parameters()):
                target_param.data.copy_(
                    self.target_update_rate * main_param.data +
                    (1 - self.target_update_rate) * target_param.data
                )

    def select_actions(self, observations):
        """
        Greedy action selection (decentralized execution).
        Each agent selects action independently.
        """
        actions = []
        for i, obs in enumerate(observations):
            with torch.no_grad():
                # Agent i evaluates all possible actions
                best_action = None
                best_q = -float('inf')

                for action in range(self.action_dim):
                    q_input = torch.cat([obs, one_hot(action, self.action_dim)])
                    q_val = self.q_networks[i](q_input).item()

                    if q_val > best_q:
                        best_q = q_val
                        best_action = action

                # Epsilon-greedy
                if torch.rand(1).item() < self.epsilon:
                    best_action = torch.randint(0, self.action_dim, (1,)).item()

                actions.append(best_action)

        return actions
```

**QMIX Key Concepts**:

1. **Monotonicity**: If agent improves action, team value improves
2. **Value Factorization**: Q_team = f(Q_1, Q_2, ..., Q_n)
3. **Decentralized Execution**: Each agent uses only own observation
4. **Centralized Training**: Trainer sees all Q-values and state

**When QMIX Works Well**:

- Fully observable or partially observable cooperative teams
- Sparse communication needs
- Fixed team membership
- Shared reward structure

**QMIX Limitations**:

- Assumes monotonicity (not all cooperative games satisfy this)
- Doesn't handle explicit communication
- Doesn't learn agent roles dynamically


## Part 4: MADDPG - Multi-Agent Actor-Critic

### MADDPG: For Competitive and Mixed Scenarios

**Core Idea**: Actor-critic but with centralized critic during training.

```
DDPG (single-agent):
  - Actor π(a|s) learns policy
  - Critic Q(s,a) estimates value
  - Critic trains actor via policy gradient

MADDPG (multi-agent):
  - Each agent has actor π_i(a_i|o_i)
  - Centralized critic Q(s, a_1, ..., a_n) sees all agents
  - During training: use centralized critic for learning
  - During execution: each agent uses only own actor
```

**MADDPG Algorithm**:

```python
class MADDPGAgent:
    def __init__(self, agent_id, n_agents, obs_dim, action_dim, state_dim, hidden_dim=256):
        self.agent_id = agent_id
        self.n_agents = n_agents
        self.action_dim = action_dim

        # Actor: learns decentralized policy π_i(a_i|o_i)
        self.actor = nn.Sequential(
            nn.Linear(obs_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, action_dim),
            nn.Tanh()  # Continuous actions in [-1, 1]
        )

        # Critic: centralized value Q(s, a_1, ..., a_n)
        # Input: global state + all agents' actions
        self.critic = nn.Sequential(
            nn.Linear(state_dim + n_agents * action_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, 1)  # Single value output
        )

        # Target networks for stability
        self.target_actor = copy.deepcopy(self.actor)
        self.target_critic = copy.deepcopy(self.critic)

        self.actor_optimizer = Adam(self.actor.parameters(), lr=1e-4)
        self.critic_optimizer = Adam(self.critic.parameters(), lr=1e-3)

        self.discount = 0.99
        self.tau = 0.01  # Soft update rate

    def train_step(self, batch):
        """
        MADDPG training step.

        Batch contains:
          observations: list[n_agents] of [batch_size, obs_dim]
          actions: list[n_agents] of [batch_size, action_dim]
          rewards: [batch_size] (agent-specific reward!)
          next_observations: list[n_agents] of [batch_size, obs_dim]
          global_state: [batch_size, state_dim]
          next_global_state: [batch_size, state_dim]
          dones: [batch_size]
        """
        observations, actions, rewards, next_observations, \
            global_state, next_global_state, dones = batch

        batch_size = observations[0].shape[0]
        agent_obs = observations[self.agent_id]
        agent_action = actions[self.agent_id]
        agent_reward = rewards  # Agent-specific reward

        # Step 1: Critic Update (centralized)
        with torch.no_grad():
            # Compute next actions using target actors
            next_actions = []
            for i, next_obs in enumerate(next_observations):
                if i == self.agent_id:
                    next_a = self.target_actor(next_obs)
                else:
                    # Use stored target actors from other agents
                    next_a = other_agents_target_actors[i](next_obs)
                next_actions.append(next_a)

            # Concatenate all next actions
            next_actions_cat = torch.cat(next_actions, dim=-1)

            # Compute next value (centralized critic)
            next_q = self.target_critic(
                torch.cat([next_global_state, next_actions_cat], dim=-1)
            )

            # TD target
            td_target = agent_reward.unsqueeze(-1) + (
                1 - dones.unsqueeze(-1)
            ) * self.discount * next_q

        # Compute current Q-value
        current_actions_cat = torch.cat(actions, dim=-1)
        current_q = self.critic(
            torch.cat([global_state, current_actions_cat], dim=-1)
        )

        # Critic loss
        critic_loss = ((current_q - td_target) ** 2).mean()

        self.critic_optimizer.zero_grad()
        critic_loss.backward()
        self.critic_optimizer.step()

        # Step 2: Actor Update (decentralized policy improvement)
        # Actor only uses own observation
        policy_actions = []
        for i, obs in enumerate(observations):
            if i == self.agent_id:
                # Use current actor for this agent
                action_i = self.actor(obs)
            else:
                # Use current actors of other agents
                action_i = other_agents_actors[i](obs)
            policy_actions.append(action_i)

        # Compute Q-value under current policy
        policy_actions_cat = torch.cat(policy_actions, dim=-1)
        policy_q = self.critic(
            torch.cat([global_state, policy_actions_cat], dim=-1)
        )

        # Policy gradient: maximize Q-value
        actor_loss = -policy_q.mean()

        self.actor_optimizer.zero_grad()
        actor_loss.backward()
        self.actor_optimizer.step()

        # Soft update target networks
        self._soft_update_targets()

        return {
            'critic_loss': critic_loss.item(),
            'actor_loss': actor_loss.item(),
            'avg_q_value': current_q.mean().item()
        }

    def _soft_update_targets(self):
        """Soft update target networks toward main networks."""
        for target_param, main_param in zip(
            self.target_actor.parameters(),
            self.actor.parameters()
        ):
            target_param.data.copy_(
                self.tau * main_param.data + (1 - self.tau) * target_param.data
            )

        for target_param, main_param in zip(
            self.target_critic.parameters(),
            self.critic.parameters()
        ):
            target_param.data.copy_(
                self.tau * main_param.data + (1 - self.tau) * target_param.data
            )

    def select_action(self, observation):
        """Decentralized action selection."""
        with torch.no_grad():
            action = self.actor(observation)
            # Add exploration noise
            action = action + torch.normal(0, 0.1, action.shape)
            action = torch.clamp(action, -1, 1)
        return action.cpu().numpy()
```

**MADDPG Key Properties**:

1. **Centralized Critic**: Sees all agents' observations and actions
2. **Decentralized Actors**: Each agent uses only own observation
3. **Agent-Specific Rewards**: Each agent maximizes own reward
4. **Handles Competitive/Mixed**: Doesn't assume cooperation
5. **Continuous Actions**: Works well with continuous action spaces

**When MADDPG Works Well**:

- Competitive and mixed-motive scenarios
- Continuous action spaces
- Partial observability (agents don't see each other)
- Need for independent agent rewards


## Part 5: Communication in Multi-Agent Systems

### When and Why Communication Helps

**Problem Without Communication**:

```
Agents with partial observability:
Agent 1: sees position p_1, but NOT p_2
Agent 2: sees position p_2, but NOT p_1

Goal: Avoid collision while moving to targets

Without communication:
  Agent 1: "I don't know where Agent 2 is"
  Agent 2: "I don't know where Agent 1 is"

  Both might move toward same corridor
  Collision, but agents couldn't coordinate!

With communication:
  Agent 1: broadcasts "I'm moving left"
  Agent 2: receives message, moves right
  No collision!
```

**Communication Trade-offs**:

```
Advantages:
- Enables coordination with partial observability
- Can solve some problems impossible without communication
- Explicit intention sharing

Disadvantages:
- Adds complexity: agents must learn what to communicate
- High variance: messages might mislead
- Computational overhead: processing all messages
- Communication bandwidth limited in real systems

When to use communication:
- Partial observability prevents coordination
- Explicit roles (e.g., one agent is "scout")
- Limited field of view, agents are out of sight
- Agents benefit from sharing intentions

When NOT to use communication:
- Full observability (agents see everything)
- Simple coordination (value factorization sufficient)
- Communication is unreliable
```

### CommNet: Learning Communication

**Idea**: Agents learn to send and receive messages to improve coordination.

```
Architecture:
1. Each agent processes own observation: f_i(o_i) → hidden state h_i
2. Agent broadcasts hidden state as "message"
3. Agent receives messages from neighbors
4. Agent aggregates messages: Σ_j M(h_j) (attention mechanism)
5. Agent processes aggregated information: policy π(a_i | h_i + aggregated)

Key: Agents learn what information to broadcast in h_i
     Receiving agents learn what messages are useful
```

**Simple Communication Example**:

```python
class CommNetAgent:
    def __init__(self, obs_dim, action_dim, hidden_dim=64):
        # Encoding network: observation → hidden message
        self.encoder = nn.Sequential(
            nn.Linear(obs_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, hidden_dim)  # Message to broadcast
        )

        # Communication aggregation (simplified attention)
        self.comm_processor = nn.Sequential(
            nn.Linear(hidden_dim * 2, hidden_dim),  # Own + received
            nn.ReLU(),
            nn.Linear(hidden_dim, hidden_dim)
        )

        # Policy network
        self.policy = nn.Sequential(
            nn.Linear(hidden_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, action_dim)
        )

    def compute_message(self, observation):
        """Generate message to broadcast to other agents."""
        return self.encoder(observation)

    def forward(self, observation, received_messages):
        """
        Process observation + received messages, output action.

        Args:
            observation: [obs_dim]
            received_messages: list of messages from neighbors

        Returns:
            action: [action_dim]
        """
        # Generate own message
        my_message = self.encoder(observation)

        # Aggregate received messages (mean pooling)
        if received_messages:
            others_messages = torch.stack(received_messages).mean(dim=0)
        else:
            others_messages = torch.zeros_like(my_message)

        # Process aggregated communication
        combined = torch.cat([my_message, others_messages], dim=-1)
        hidden = self.comm_processor(combined)

        # Select action
        action = self.policy(hidden)
        return action, my_message
```

**Communication Pitfall**: Agents learn to send misleading messages!

```python
# Without careful design, agents learn deceptive communication:
Agent 1 learns: "If I broadcast 'I'm going right', Agent 2 will go left"
Agent 1 broadcasts: "Going right" (but actually goes left)
Agent 2 goes right as expected (collision!)
Agent 1 gets higher reward (its deception worked)

Solution: Design communication carefully
- Verify agents to be truthful (implicit in cooperative setting)
- Use communication only when beneficial
- Monitor emergent communication protocols
```


## Part 6: Credit Assignment in Cooperative Teams

### Individual Reward vs Team Reward

**Problem**:

```
Scenario: 3-robot assembly team
Team reward: +100 if assembly succeeds, 0 if fails

Individual Reward Design:
Option 1 - Split equally: each robot gets +33.33
  Problem: Robot 3 (insignificant) gets same credit as Robot 1 (crucial)

Option 2 - Use agent contribution:
  Robot 1 (held piece): +60
  Robot 2 (guided insertion): +25
  Robot 3 (steadied base): +15
  Problem: How to compute contributions? (requires complex analysis)

Option 3 - Use value factorization (QMIX):
  Team value = mixing_network(Q_1, Q_2, Q_3)
  Each robot learns its Q-value
  QMIX learns to weight Q-values by importance
  Result: Fair credit assignment via factorization
```

**QMIX Credit Assignment Mechanism**:

```
Training:
  Observe: robot_1 does action a_1, gets q_1
           robot_2 does action a_2, gets q_2
           robot_3 does action a_3, gets q_3
           Team gets reward r_team

  Factorize: r_team ≈ mixing_network(q_1, q_2, q_3)
             = w_1 * q_1 + w_2 * q_2 + w_3 * q_3 + bias

  Learn weights w_i via mixing network

  If Robot 1 is crucial:
    mixing network learns w_1 > w_2, w_3
    Robot 1 gets larger credit (w_1 * q_1 > others)

  If Robot 3 is redundant:
    mixing network learns w_3 ≈ 0
    Robot 3 gets small credit

Result: Each robot learns fair contribution
```

**Value Decomposition Pitfall**: Agents can game the factorization!

```
Example: Learned mixing network w = [0.9, 0.05, 0.05]

Agent 1 learns: "I must maximize q_1 (it has weight 0.9)"
Agent 1 tries: action that maximizes own q_1
Problem: q_1 computed from own reward signal (myopic)
         might not actually help team!

Solution: Use proper credit assignment metrics
- Shapley values: game theory approach to credit
- Counterfactual reasoning: what if agent didn't act?
- Implicit credit (QMIX): let factorization emergently learn
```


## Part 7: Common Multi-Agent RL Failure Modes

### Failure Mode 1: Non-Stationarity Instability

**Symptom**: Learning curves erratic, no convergence.

```python
# Problem scenario:
for episode in range(1000):
    # Agent 1 learns
    episode_reward_1 = []
    for t in range(steps):
        a_1 = agent_1.select_action(o_1)
        a_2 = agent_2.select_action(o_2)  # Using old policy!
        r, o'_1, o'_2 = env.step(a_1, a_2)
        agent_1.update(a_1, r, o'_1)

    # Agent 2 improves (environment changes for Agent 1!)
    episode_reward_2 = []
    for t in range(steps):
        a_1 = agent_1.select_action(o_1)  # OLD VALUE ESTIMATES
        a_2 = agent_2.select_action(o_2)  # NEW POLICY (Agent 2 improved)
        r, o'_1, o'_2 = env.step(a_1, a_2)
        agent_2.update(a_2, r, o'_2)

Result: Agent 1's Q-values become invalid when Agent 2 improves
        Learning is unstable, doesn't converge
```

**Solution**: Use CTDE or opponent modeling

```python
# CTDE Approach:
# During training, use global information to stabilize
trainer.observe(o_1, a_1, o_2, a_2, r)
# Trainer sees both agents' actions, can compute stable target

# During execution:
agent_1.execute(o_1 only)  # Decentralized
agent_2.execute(o_2 only)  # Decentralized
```

### Failure Mode 2: Reward Ambiguity

**Symptom**: Agents don't improve, stuck at local optima.

```python
# Problem: Multi-agent team, shared reward
total_reward = 50

# Distribution: who gets what?
# Agent 1 thinks: "I deserve 50" (overconfident)
# Agent 2 thinks: "I deserve 50" (overconfident)
# Agent 3 thinks: "I deserve 50" (overconfident)

# Each agent overestimates importance
# Each agent learns selfishly (internal conflict)
# Team coordination breaks

Result: Team performance worse than if agents cooperated
```

**Solution**: Use value factorization

```python
# QMIX learns fair decomposition
q_1, q_2, q_3 = compute_individual_values(a_1, a_2, a_3)
team_reward = mixing_network(q_1, q_2, q_3)

# Mixing network learns importance
# If Agent 2 crucial: weight_2 > weight_1, weight_3
# Training adjusts weights based on who actually helped

Result: Fair credit, agents coordinate
```

### Failure Mode 3: Algorithm-Reward Mismatch

**Symptom**: Learning fails in specific problem types (cooperative/competitive).

```python
# Problem: Using QMIX (cooperative) in competitive setting
# Competitive game (agents have opposite rewards)

# QMIX assumes: shared reward (monotonicity works)
# But in competitive:
#   Q_1 high means Agent 1 winning
#   Q_2 high means Agent 2 winning (opposite!)
# QMIX mixing doesn't make sense
# Convergence fails

# Solution: Use MADDPG (handles competitive)
# MADDPG doesn't assume monotonicity
# Works with individual rewards
# Handles competition naturally
```


## Part 8: When to Use Multi-Agent RL

### Problem Characteristics for MARL

**Use MARL when**:

```
1. Multiple simultaneous learners
   - Problem has 2+ agents learning
   - NOT just parallel tasks (that's single-agent x N)

2. Shared/interdependent environment
   - Agents' actions affect each other
   - One agent's action impacts other agent's rewards
   - True interaction (not independent MDPs)

3. Coordination is beneficial
   - Agents can improve by coordinating
   - Alternative: agents could act independently (inefficient)

4. Non-trivial communication/credit
   - Agents need to coordinate or assign credit
   - NOT trivial to decompose into independent subproblems
```

**Use Single-Agent RL when**:

```
1. Single learning agent (others are environment)
   - Example: one RL agent vs static rules-based opponents
   - Environment includes other agents, but they're not learning

2. Independent parallel tasks
   - Example: 10 robots, each with own goal, no interaction
   - Use single-agent RL x 10 (faster, simpler)

3. Fully decomposable problems
   - Example: multi-robot path planning (can use single-agent per robot)
   - Problem decomposes into independent subproblems

4. Scalability critical
   - Single-agent RL scales to huge teams
   - MARL harder to scale (centralized training bottleneck)
```

### Decision Tree

```
Problem: Multiple agents learning together?
  NO → Use single-agent RL
  YES ↓

Problem: Agents' rewards interdependent?
  NO → Use single-agent RL x N (parallel)
  YES ↓

Problem: Agents must coordinate?
  NO → Use independent learning (but expect instability)
  YES ↓

Problem structure:
  COOPERATIVE → Use QMIX, MAPPO, QPLEX
  COMPETITIVE → Use MADDPG, self-play
  MIXED → Use hybrid (cooperative + competitive algorithms)
```


## Part 9: Opponent Modeling in Competitive Settings

### Why Model Opponents?

**Problem Without Opponent Modeling**:

```
Agent 1 (using MADDPG) learns:
  "Move right gives Q=50"

But assumption: Agent 2 plays policy π_2

When Agent 2 improves to π'_2:
  "Move right gives Q=20" (because Agent 2 blocks that path)

Agent 1's Q-value estimates become stale!
Environment has changed (opponent improved)
```

**Solution: Opponent Modeling**

```python
class OpponentModelingAgent:
    def __init__(self, agent_id, n_agents, obs_dim, action_dim):
        self.agent_id = agent_id

        # Own actor and critic
        self.actor = self._build_actor(obs_dim, action_dim)
        self.critic = self._build_critic()

        # Model opponent policies (for agents we compete against)
        self.opponent_models = {
            i: self._build_opponent_model() for i in range(n_agents) if i != agent_id
        }

    def _build_opponent_model(self):
        """Model what opponent will do given state."""
        return nn.Sequential(
            nn.Linear(64, 128),
            nn.ReLU(),
            nn.Linear(128, 128),
            nn.ReLU(),
            nn.Linear(128, self.action_dim)
        )

    def train_step_with_opponent_modeling(self, batch):
        """
        Update own policy AND opponent models.

        Key insight: predict what opponent will do,
        then plan against those predictions
        """
        observations, actions, rewards, next_observations = batch

        # Step 1: Update opponent models (supervised)
        # Predict opponent action from observation
        for opponent_id, model in self.opponent_models.items():
            predicted_action = model(next_observations[opponent_id])
            actual_action = actions[opponent_id]
            opponent_loss = ((predicted_action - actual_action) ** 2).mean()
            # Update opponent model
            optimizer.zero_grad()
            opponent_loss.backward()
            optimizer.step()

        # Step 2: Plan against opponent predictions
        predicted_opponent_actions = {
            i: model(observations[i])
            for i, model in self.opponent_models.items()
        }

        # Use predictions in MADDPG update
        # Critic sees: own obs + predicted opponent actions
        # Actor learns: given opponent predictions, best response

        return {'opponent_loss': opponent_loss.item()}
```

**Opponent Modeling Trade-offs**:

```
Advantages:
  - Accounts for opponent improvements (non-stationarity)
  - Enables planning ahead
  - Reduces brittleness to opponent policy changes

Disadvantages:
  - Requires learning opponent models (additional supervision)
  - If opponent model is wrong, agent learns wrong policy
  - Computational overhead
  - Assumes opponent is predictable

When to use:
  - Competitive settings with clear opponents
  - Limited number of distinct opponents
  - Opponents have consistent strategies

When NOT to use:
  - Too many potential opponents
  - Opponents are unpredictable
  - Cooperative setting (waste of resources)
```


## Part 10: Advanced: Independent Q-Learning (IQL) for Multi-Agent

### IQL in Multi-Agent Settings

**Idea**: Each agent learns Q-value using only own rewards and observations.

```python
class IQLMultiAgent:
    def __init__(self, agent_id, obs_dim, action_dim):
        self.agent_id = agent_id

        # Q-network for this agent only
        self.q_network = nn.Sequential(
            nn.Linear(obs_dim + action_dim, 128),
            nn.ReLU(),
            nn.Linear(128, 1)
        )

        self.optimizer = Adam(self.q_network.parameters(), lr=1e-3)

    def train_step(self, batch):
        """
        Independent Q-learning: each agent learns from own reward only.

        Problem: Non-stationarity
        - Other agents improve policies
        - Environment from this agent's perspective changes
        - Q-values become invalid

        Benefit: Decentralized
        - No centralized training needed
        - Scalable to many agents
        """
        observations, actions, rewards, next_observations = batch

        # Q-value update (standard Q-learning)
        with torch.no_grad():
            # Greedy next action (assume agent acts greedily)
            next_q_values = []
            for action in range(self.action_dim):
                q_input = torch.cat([next_observations, one_hot(action)])
                q_val = self.q_network(q_input)
                next_q_values.append(q_val)

            max_next_q = torch.max(torch.stack(next_q_values), dim=0)[0]
            td_target = rewards + 0.99 * max_next_q

        # Current Q-value
        q_pred = self.q_network(torch.cat([observations, actions], dim=-1))

        # TD loss
        loss = ((q_pred - td_target) ** 2).mean()

        self.optimizer.zero_grad()
        loss.backward()
        self.optimizer.step()

        return {'loss': loss.item()}
```

**IQL in Multi-Agent: Pros and Cons**:

```
Advantages:
  - Fully decentralized (scalable)
  - No communication needed
  - Simple implementation
  - Works with partial observability

Disadvantages:
  - Non-stationarity breaks convergence
  - Agents chase moving targets (other agents improving)
  - No explicit coordination
  - Performance often poor without CTDE

Result:
  - IQL works but is unstable in true multi-agent settings
  - Better to use CTDE (QMIX, MADDPG) for stability
  - IQL useful if centralized training impossible
```


## Part 11: Multi-Agent Experience Replay and Batch Sampling

### Challenges of Experience Replay in Multi-Agent

**Problem**:

```
In single-agent RL:
  Experience replay stores (s, a, r, s', d)
  Sample uniformly from buffer
  Works well (iid samples)

In multi-agent RL:
  Experience replay stores (s, a_1, a_2, ..., a_n, r, s')
  But agents are non-stationary!

  Transition (s, a_1, a_2, r, s') valid only if:
    - Assumptions about other agents' policies still hold
    - If other agents improved, assumptions invalid

Solution: Prioritized experience replay for multi-agent
  - Prioritize transitions where agent's assumptions are likely correct
  - Down-weight transitions from old policies (outdated assumptions)
  - Focus on recent transitions (more relevant)
```

**Batch Sampling Strategy**:

```python
class MultiAgentReplayBuffer:
    def __init__(self, capacity=100000, n_agents=3):
        self.buffer = deque(maxlen=capacity)
        self.n_agents = n_agents
        self.priority_weights = deque(maxlen=capacity)

    def add(self, transition):
        """Store experience with priority."""
        # transition: (observations, actions, rewards, next_observations, dones)
        self.buffer.append(transition)

        # Priority: how relevant is this to current policy?
        # Recent transitions: high priority (policies haven't changed much)
        # Old transitions: low priority (agents have improved, assumptions stale)
        priority = self._compute_priority(transition)
        self.priority_weights.append(priority)

    def _compute_priority(self, transition):
        """Compute priority for multi-agent setting."""
        # Heuristic: prioritize recent transitions
        # Could use TD-error (how surprised are we by this transition?)
        age = len(self.buffer)  # How long ago was this added?
        decay = 0.99 ** age  # Exponential decay
        return decay

    def sample(self, batch_size):
        """Sample prioritized batch."""
        # Weighted sampling: high priority more likely
        indices = np.random.choice(
            len(self.buffer),
            batch_size,
            p=self.priority_weights / self.priority_weights.sum()
        )

        batch = [self.buffer[i] for i in indices]
        return batch
```


## Part 12: 10+ Critical Pitfalls

1. **Treating as independent agents**: Non-stationarity breaks convergence
2. **Giving equal reward to unequal contributors**: Credit assignment fails
3. **Forgetting decentralized execution**: Agents need independent policies
4. **Communicating too much**: High variance, bandwidth waste
5. **Using cooperative algorithm in competitive game**: Convergence fails
6. **Using competitive algorithm in cooperative game**: Agents conflict
7. **Not using CTDE**: Weak coordination, brittle policies
8. **Assuming other agents will converge**: Non-stationarity = moving targets
9. **Value overestimation in team settings**: Similar to offline RL issues
10. **Forgetting opponent modeling**: In competitive settings, must predict others
11. **Communication deception**: Agents learn to mislead for short-term gain
12. **Scalability (too many agents)**: MARL doesn't scale to 100+ agents
13. **Experience replay staleness**: Old transitions assume old opponent policies
14. **Ignoring observability constraints**: Partial obs needs communication or factorization
15. **Reward structure not matching algorithm**: Cooperative/competitive mismatch


## Part 13: 10+ Rationalization Patterns

Users often rationalize MARL mistakes:

1. **"Independent agents should work"**: Doesn't understand non-stationarity
2. **"My algorithm converged to something"**: Might be local optima due to credit ambiguity
3. **"Communication improved rewards"**: Might be learned deception, not coordination
4. **"QMIX should work everywhere"**: Doesn't check problem for monotonicity
5. **"More agents = more parallelism"**: Ignores centralized training bottleneck
6. **"Rewards are subjective anyway"**: Credit assignment is objective (factorization)
7. **"I'll just add more training"**: Non-stationarity can't be fixed by more epochs
8. **"Other agents are fixed"**: But they're learning too (environment is non-stationary)
9. **"Communication bandwidth doesn't matter"**: In real systems, it does
10. **"Nash equilibrium is always stable"**: No, it's just best-response equilibrium


## Part 14: MAPPO - Multi-Agent Proximal Policy Optimization

### When to Use MAPPO

**Cooperative teams with policy gradients**:

```python
class MAPPOAgent:
    def __init__(self, agent_id, obs_dim, action_dim, hidden_dim=256):
        self.agent_id = agent_id

        # Actor: policy for decentralized execution
        self.actor = nn.Sequential(
            nn.Linear(obs_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, action_dim)
        )

        # Critic: centralized value function (uses global state during training)
        self.critic = nn.Sequential(
            nn.Linear(obs_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, 1)
        )

        self.actor_optimizer = Adam(self.actor.parameters(), lr=3e-4)
        self.critic_optimizer = Adam(self.critic.parameters(), lr=1e-3)

    def train_step_on_batch(self, observations, actions, returns, advantages):
        """
        MAPPO training: advantage actor-critic with clipped policy gradient.

        Key difference from DDPG:
        - Policy gradient (not off-policy value)
        - Centralized training (uses global returns/advantages)
        - Decentralized execution (policy uses only own observation)
        """
        # Actor loss (clipped PPO)
        action_probs = torch.softmax(self.actor(observations), dim=-1)
        action_log_probs = torch.log(action_probs.gather(-1, actions))

        # Importance weight (in on-policy setting, = 1)
        # In practice, small advantage clipping for stability
        policy_loss = -(action_log_probs * advantages).mean()

        # Entropy regularization (exploration)
        entropy = -(action_probs * torch.log(action_probs + 1e-8)).sum(dim=-1).mean()
        actor_loss = policy_loss - 0.01 * entropy

        self.actor_optimizer.zero_grad()
        actor_loss.backward()
        self.actor_optimizer.step()

        # Critic loss (value estimation)
        values = self.critic(observations)
        critic_loss = ((values - returns) ** 2).mean()

        self.critic_optimizer.zero_grad()
        critic_loss.backward()
        self.critic_optimizer.step()

        return {
            'actor_loss': actor_loss.item(),
            'critic_loss': critic_loss.item(),
            'entropy': entropy.item()
        }
```

**MAPPO vs QMIX**:

```
QMIX:
  - Value-based (discrete actions)
  - Value factorization (credit assignment)
  - Works with partial observability

MAPPO:
  - Policy gradient-based
  - Centralized critic (advantage estimation)
  - On-policy (requires recent trajectories)

Use MAPPO when:
  - Continuous or large discrete action spaces
  - On-policy learning acceptable
  - Value factorization not needed (reward structure simple)

Use QMIX when:
  - Discrete actions
  - Need explicit credit assignment
  - Off-policy learning preferred
```


## Part 15: Self-Play for Competitive Learning

### Self-Play Mechanism

**Problem**: Training competitive agents requires opponents.

```
Naive approach:
  - Agent 1 trains vs fixed opponent
  - Problem: fixed opponent doesn't adapt
  - Agent 1 learns exploitation (brittle to new opponents)

Self-play:
  - Agent 1 trains vs historical versions of itself
  - Agent 1 improves → creates stronger opponent
  - New Agent 1 trains vs stronger Agent 1
  - Cycle: both improve together
  - Result: robust agent that beats all versions of itself
```

**Self-Play Implementation**:

```python
class SelfPlayTrainer:
    def __init__(self, agent_class, n_checkpoint_opponents=5):
        self.current_agent = agent_class()
        self.opponent_pool = []  # Keep historical versions
        self.n_checkpoints = n_checkpoint_opponents

    def train(self, num_episodes):
        """Train with self-play against previous versions."""
        for episode in range(num_episodes):
            # Select opponent: current agent or historical version
            if not self.opponent_pool or np.random.rand() < 0.5:
                opponent = copy.deepcopy(self.current_agent)
            else:
                opponent = np.random.choice(self.opponent_pool)

            # Play episode: current_agent vs opponent
            trajectory = self._play_episode(self.current_agent, opponent)

            # Train current agent on trajectory
            self.current_agent.train_on_trajectory(trajectory)

            # Periodically add current agent to opponent pool
            if episode % (num_episodes // self.n_checkpoints) == 0:
                self.opponent_pool.append(copy.deepcopy(self.current_agent))

        return self.current_agent

    def _play_episode(self, agent1, agent2):
        """Play episode: agent1 vs agent2, collect experience."""
        trajectory = []
        state = self.env.reset()
        done = False

        while not done:
            # Agent 1 action
            action1 = agent1.select_action(state['agent1_obs'])

            # Agent 2 action (opponent)
            action2 = agent2.select_action(state['agent2_obs'])

            # Step environment
            state, reward, done = self.env.step(action1, action2)

            trajectory.append({
                'obs1': state['agent1_obs'],
                'obs2': state['agent2_obs'],
                'action1': action1,
                'action2': action2,
                'reward1': reward['agent1'],
                'reward2': reward['agent2']
            })

        return trajectory
```

**Self-Play Benefits and Pitfalls**:

```
Benefits:
  - Agents automatically improve together
  - Robust to different opponent styles
  - Emergent complexity (rock-paper-scissors dynamics)

Pitfalls:
  - Agents might exploit specific weaknesses (not generalizable)
  - Training unstable if pool too small
  - Forgetting how to beat weaker opponents (catastrophic forgetting)
  - Computational cost (need to evaluate multiple opponents)

Solution: Diverse opponent pool
  - Keep varied historical versions
  - Mix self-play with evaluation vs fixed benchmark
  - Monitor for forgetting (test vs all opponents periodically)
```


## Part 16: Practical Implementation Considerations

### Observation Space Design

**Key consideration**: Partial vs full observability

```python
# Full Observability (not realistic but simplest)
observation = {
    'own_position': agent_pos,
    'all_agent_positions': [pos1, pos2, pos3],  # See everyone!
    'all_agent_velocities': [vel1, vel2, vel3],
    'targets': [target1, target2, target3]
}

# Partial Observability (more realistic, harder)
observation = {
    'own_position': agent_pos,
    'own_velocity': agent_vel,
    'target': own_target,
    'nearby_agents': agents_within_5m,  # Limited field of view
    # Note: don't see agents far away
}

# Consequence: With partial obs, agents must communicate or learn implicitly
# Via environmental interaction (e.g., bumping into others)
```

### Reward Structure Design

**Critical for multi-agent learning**:

```python
# Cooperative game: shared reward
team_reward = +100 if goal_reached else 0
# Problem: ambiguous who contributed

# Cooperative game: mixed rewards (shared + individual)
team_reward = +100 if goal_reached
individual_bonus = +5 if agent_i_did_critical_action
total_reward_i = team_reward + individual_bonus  # incentivizes both

# Competitive game: zero-sum
reward_1 = goals_1 - goals_2
reward_2 = goals_2 - goals_1  # Opposite

# Competitive game: individual scores
reward_1 = goals_1
reward_2 = goals_2
# Problem: agents don't care about each other (no implicit competition)

# Mixed: cooperation + competition (team sports)
reward_i = +10 if team_wins
        + 1 if agent_i_scores
        + 0.1 * team_score  # Shared team success bonus
```

**Reward Design Pitfall**: Too much individual reward breaks cooperation

```
Example: 3v3 soccer
reward_i = +100 if agent_i_scores  (individual goal)
        + +5 if agent_i_assists     (passes to scorer)
        + 0 if teammate scores      (not rewarded!)

Result:
  Agent learns: "Only my goals matter, don't pass to teammates"
  Agent hoards ball, tries solo shots
  Team coordination breaks
  Lose to coordinated opponent team

Solution: Include team reward
reward_i = +100 if team_wins
        + +10 if agent_i_scores goal
        + +2 if agent_i_assists
```


## Summary: When to Use Multi-Agent RL

**Multi-agent RL is needed when**:

1. Multiple agents learning simultaneously in shared environment
2. Agent interactions cause non-stationarity
3. Coordination or credit assignment is non-trivial
4. Problem structure matches available algorithm (cooperative/competitive)

**Multi-agent RL is NOT needed when**:

1. Single learning agent (others are static)
2. Agents act independently (no true interaction)
3. Problem easily decomposes (use single-agent RL per agent)
4. Scalability to 100+ agents critical (MARL hard to scale)

**Key Algorithms**:

1. **QMIX**: Cooperative, value factorization, decentralized execution
2. **MADDPG**: Competitive/mixed, continuous actions, centralized critic
3. **MAPPO**: Cooperative, policy gradients, centralized training
4. **Self-Play**: Competitive, agents train vs historical versions
5. **Communication**: For partial observability, explicit coordination
6. **CTDE**: Paradigm enabling stable multi-agent learning

**Algorithm Selection Matrix**:

```
                 Cooperative    Competitive    Mixed
Discrete Action    QMIX          Nash-Q         Hybrid
Continuous Action  MAPPO/MADDPG  MADDPG         MADDPG
Partial Obs        +Comm         +Opponent Mod  +Both
Scalable           IQL (unstable) IQL           IQL (unstable)
```

**Critical Success Factors**:

1. Match algorithm to problem structure (cooperative vs competitive)
2. Design reward to align with desired coordination
3. Use CTDE for stable training
4. Monitor for non-stationarity issues
5. Validate agents work independently during execution

Use this skill to understand multi-agent problem structure and select appropriate algorithms for coordination challenges.