Initial commit
This commit is contained in:
430
skills/pufferlib/SKILL.md
Normal file
430
skills/pufferlib/SKILL.md
Normal file
@@ -0,0 +1,430 @@
|
||||
---
|
||||
name: pufferlib
|
||||
description: This skill should be used when working with reinforcement learning tasks including high-performance RL training, custom environment development, vectorized parallel simulation, multi-agent systems, or integration with existing RL environments (Gymnasium, PettingZoo, Atari, Procgen, etc.). Use this skill for implementing PPO training, creating PufferEnv environments, optimizing RL performance, or developing policies with CNNs/LSTMs.
|
||||
---
|
||||
|
||||
# PufferLib - High-Performance Reinforcement Learning
|
||||
|
||||
## Overview
|
||||
|
||||
PufferLib is a high-performance reinforcement learning library designed for fast parallel environment simulation and training. It achieves training at millions of steps per second through optimized vectorization, native multi-agent support, and efficient PPO implementation (PuffeRL). The library provides the Ocean suite of 20+ environments and seamless integration with Gymnasium, PettingZoo, and specialized RL frameworks.
|
||||
|
||||
## When to Use This Skill
|
||||
|
||||
Use this skill when:
|
||||
- **Training RL agents** with PPO on any environment (single or multi-agent)
|
||||
- **Creating custom environments** using the PufferEnv API
|
||||
- **Optimizing performance** for parallel environment simulation (vectorization)
|
||||
- **Integrating existing environments** from Gymnasium, PettingZoo, Atari, Procgen, etc.
|
||||
- **Developing policies** with CNN, LSTM, or custom architectures
|
||||
- **Scaling RL** to millions of steps per second for faster experimentation
|
||||
- **Multi-agent RL** with native multi-agent environment support
|
||||
|
||||
## Core Capabilities
|
||||
|
||||
### 1. High-Performance Training (PuffeRL)
|
||||
|
||||
PuffeRL is PufferLib's optimized PPO+LSTM training algorithm achieving 1M-4M steps/second.
|
||||
|
||||
**Quick start training:**
|
||||
```bash
|
||||
# CLI training
|
||||
puffer train procgen-coinrun --train.device cuda --train.learning-rate 3e-4
|
||||
|
||||
# Distributed training
|
||||
torchrun --nproc_per_node=4 train.py
|
||||
```
|
||||
|
||||
**Python training loop:**
|
||||
```python
|
||||
import pufferlib
|
||||
from pufferlib import PuffeRL
|
||||
|
||||
# Create vectorized environment
|
||||
env = pufferlib.make('procgen-coinrun', num_envs=256)
|
||||
|
||||
# Create trainer
|
||||
trainer = PuffeRL(
|
||||
env=env,
|
||||
policy=my_policy,
|
||||
device='cuda',
|
||||
learning_rate=3e-4,
|
||||
batch_size=32768
|
||||
)
|
||||
|
||||
# Training loop
|
||||
for iteration in range(num_iterations):
|
||||
trainer.evaluate() # Collect rollouts
|
||||
trainer.train() # Train on batch
|
||||
trainer.mean_and_log() # Log results
|
||||
```
|
||||
|
||||
**For comprehensive training guidance**, read `references/training.md` for:
|
||||
- Complete training workflow and CLI options
|
||||
- Hyperparameter tuning with Protein
|
||||
- Distributed multi-GPU/multi-node training
|
||||
- Logger integration (Weights & Biases, Neptune)
|
||||
- Checkpointing and resume training
|
||||
- Performance optimization tips
|
||||
- Curriculum learning patterns
|
||||
|
||||
### 2. Environment Development (PufferEnv)
|
||||
|
||||
Create custom high-performance environments with the PufferEnv API.
|
||||
|
||||
**Basic environment structure:**
|
||||
```python
|
||||
import numpy as np
|
||||
from pufferlib import PufferEnv
|
||||
|
||||
class MyEnvironment(PufferEnv):
|
||||
def __init__(self, buf=None):
|
||||
super().__init__(buf)
|
||||
|
||||
# Define spaces
|
||||
self.observation_space = self.make_space((4,))
|
||||
self.action_space = self.make_discrete(4)
|
||||
|
||||
self.reset()
|
||||
|
||||
def reset(self):
|
||||
# Reset state and return initial observation
|
||||
return np.zeros(4, dtype=np.float32)
|
||||
|
||||
def step(self, action):
|
||||
# Execute action, compute reward, check done
|
||||
obs = self._get_observation()
|
||||
reward = self._compute_reward()
|
||||
done = self._is_done()
|
||||
info = {}
|
||||
|
||||
return obs, reward, done, info
|
||||
```
|
||||
|
||||
**Use the template script:** `scripts/env_template.py` provides complete single-agent and multi-agent environment templates with examples of:
|
||||
- Different observation space types (vector, image, dict)
|
||||
- Action space variations (discrete, continuous, multi-discrete)
|
||||
- Multi-agent environment structure
|
||||
- Testing utilities
|
||||
|
||||
**For complete environment development**, read `references/environments.md` for:
|
||||
- PufferEnv API details and in-place operation patterns
|
||||
- Observation and action space definitions
|
||||
- Multi-agent environment creation
|
||||
- Ocean suite (20+ pre-built environments)
|
||||
- Performance optimization (Python to C workflow)
|
||||
- Environment wrappers and best practices
|
||||
- Debugging and validation techniques
|
||||
|
||||
### 3. Vectorization and Performance
|
||||
|
||||
Achieve maximum throughput with optimized parallel simulation.
|
||||
|
||||
**Vectorization setup:**
|
||||
```python
|
||||
import pufferlib
|
||||
|
||||
# Automatic vectorization
|
||||
env = pufferlib.make('environment_name', num_envs=256, num_workers=8)
|
||||
|
||||
# Performance benchmarks:
|
||||
# - Pure Python envs: 100k-500k SPS
|
||||
# - C-based envs: 100M+ SPS
|
||||
# - With training: 400k-4M total SPS
|
||||
```
|
||||
|
||||
**Key optimizations:**
|
||||
- Shared memory buffers for zero-copy observation passing
|
||||
- Busy-wait flags instead of pipes/queues
|
||||
- Surplus environments for async returns
|
||||
- Multiple environments per worker
|
||||
|
||||
**For vectorization optimization**, read `references/vectorization.md` for:
|
||||
- Architecture and performance characteristics
|
||||
- Worker and batch size configuration
|
||||
- Serial vs multiprocessing vs async modes
|
||||
- Shared memory and zero-copy patterns
|
||||
- Hierarchical vectorization for large scale
|
||||
- Multi-agent vectorization strategies
|
||||
- Performance profiling and troubleshooting
|
||||
|
||||
### 4. Policy Development
|
||||
|
||||
Build policies as standard PyTorch modules with optional utilities.
|
||||
|
||||
**Basic policy structure:**
|
||||
```python
|
||||
import torch.nn as nn
|
||||
from pufferlib.pytorch import layer_init
|
||||
|
||||
class Policy(nn.Module):
|
||||
def __init__(self, observation_space, action_space):
|
||||
super().__init__()
|
||||
|
||||
# Encoder
|
||||
self.encoder = nn.Sequential(
|
||||
layer_init(nn.Linear(obs_dim, 256)),
|
||||
nn.ReLU(),
|
||||
layer_init(nn.Linear(256, 256)),
|
||||
nn.ReLU()
|
||||
)
|
||||
|
||||
# Actor and critic heads
|
||||
self.actor = layer_init(nn.Linear(256, num_actions), std=0.01)
|
||||
self.critic = layer_init(nn.Linear(256, 1), std=1.0)
|
||||
|
||||
def forward(self, observations):
|
||||
features = self.encoder(observations)
|
||||
return self.actor(features), self.critic(features)
|
||||
```
|
||||
|
||||
**For complete policy development**, read `references/policies.md` for:
|
||||
- CNN policies for image observations
|
||||
- Recurrent policies with optimized LSTM (3x faster inference)
|
||||
- Multi-input policies for complex observations
|
||||
- Continuous action policies
|
||||
- Multi-agent policies (shared vs independent parameters)
|
||||
- Advanced architectures (attention, residual)
|
||||
- Observation normalization and gradient clipping
|
||||
- Policy debugging and testing
|
||||
|
||||
### 5. Environment Integration
|
||||
|
||||
Seamlessly integrate environments from popular RL frameworks.
|
||||
|
||||
**Gymnasium integration:**
|
||||
```python
|
||||
import gymnasium as gym
|
||||
import pufferlib
|
||||
|
||||
# Wrap Gymnasium environment
|
||||
gym_env = gym.make('CartPole-v1')
|
||||
env = pufferlib.emulate(gym_env, num_envs=256)
|
||||
|
||||
# Or use make directly
|
||||
env = pufferlib.make('gym-CartPole-v1', num_envs=256)
|
||||
```
|
||||
|
||||
**PettingZoo multi-agent:**
|
||||
```python
|
||||
# Multi-agent environment
|
||||
env = pufferlib.make('pettingzoo-knights-archers-zombies', num_envs=128)
|
||||
```
|
||||
|
||||
**Supported frameworks:**
|
||||
- Gymnasium / OpenAI Gym
|
||||
- PettingZoo (parallel and AEC)
|
||||
- Atari (ALE)
|
||||
- Procgen
|
||||
- NetHack / MiniHack
|
||||
- Minigrid
|
||||
- Neural MMO
|
||||
- Crafter
|
||||
- GPUDrive
|
||||
- MicroRTS
|
||||
- Griddly
|
||||
- And more...
|
||||
|
||||
**For integration details**, read `references/integration.md` for:
|
||||
- Complete integration examples for each framework
|
||||
- Custom wrappers (observation, reward, frame stacking, action repeat)
|
||||
- Space flattening and unflattening
|
||||
- Environment registration
|
||||
- Compatibility patterns
|
||||
- Performance considerations
|
||||
- Integration debugging
|
||||
|
||||
## Quick Start Workflow
|
||||
|
||||
### For Training Existing Environments
|
||||
|
||||
1. Choose environment from Ocean suite or compatible framework
|
||||
2. Use `scripts/train_template.py` as starting point
|
||||
3. Configure hyperparameters for your task
|
||||
4. Run training with CLI or Python script
|
||||
5. Monitor with Weights & Biases or Neptune
|
||||
6. Refer to `references/training.md` for optimization
|
||||
|
||||
### For Creating Custom Environments
|
||||
|
||||
1. Start with `scripts/env_template.py`
|
||||
2. Define observation and action spaces
|
||||
3. Implement `reset()` and `step()` methods
|
||||
4. Test environment locally
|
||||
5. Vectorize with `pufferlib.emulate()` or `make()`
|
||||
6. Refer to `references/environments.md` for advanced patterns
|
||||
7. Optimize with `references/vectorization.md` if needed
|
||||
|
||||
### For Policy Development
|
||||
|
||||
1. Choose architecture based on observations:
|
||||
- Vector observations → MLP policy
|
||||
- Image observations → CNN policy
|
||||
- Sequential tasks → LSTM policy
|
||||
- Complex observations → Multi-input policy
|
||||
2. Use `layer_init` for proper weight initialization
|
||||
3. Follow patterns in `references/policies.md`
|
||||
4. Test with environment before full training
|
||||
|
||||
### For Performance Optimization
|
||||
|
||||
1. Profile current throughput (steps per second)
|
||||
2. Check vectorization configuration (num_envs, num_workers)
|
||||
3. Optimize environment code (in-place ops, numpy vectorization)
|
||||
4. Consider C implementation for critical paths
|
||||
5. Use `references/vectorization.md` for systematic optimization
|
||||
|
||||
## Resources
|
||||
|
||||
### scripts/
|
||||
|
||||
**train_template.py** - Complete training script template with:
|
||||
- Environment creation and configuration
|
||||
- Policy initialization
|
||||
- Logger integration (WandB, Neptune)
|
||||
- Training loop with checkpointing
|
||||
- Command-line argument parsing
|
||||
- Multi-GPU distributed training setup
|
||||
|
||||
**env_template.py** - Environment implementation templates:
|
||||
- Single-agent PufferEnv example (grid world)
|
||||
- Multi-agent PufferEnv example (cooperative navigation)
|
||||
- Multiple observation/action space patterns
|
||||
- Testing utilities
|
||||
|
||||
### references/
|
||||
|
||||
**training.md** - Comprehensive training guide:
|
||||
- Training workflow and CLI options
|
||||
- Hyperparameter configuration
|
||||
- Distributed training (multi-GPU, multi-node)
|
||||
- Monitoring and logging
|
||||
- Checkpointing
|
||||
- Protein hyperparameter tuning
|
||||
- Performance optimization
|
||||
- Common training patterns
|
||||
- Troubleshooting
|
||||
|
||||
**environments.md** - Environment development guide:
|
||||
- PufferEnv API and characteristics
|
||||
- Observation and action spaces
|
||||
- Multi-agent environments
|
||||
- Ocean suite environments
|
||||
- Custom environment development workflow
|
||||
- Python to C optimization path
|
||||
- Third-party environment integration
|
||||
- Wrappers and best practices
|
||||
- Debugging
|
||||
|
||||
**vectorization.md** - Vectorization optimization:
|
||||
- Architecture and key optimizations
|
||||
- Vectorization modes (serial, multiprocessing, async)
|
||||
- Worker and batch configuration
|
||||
- Shared memory and zero-copy patterns
|
||||
- Advanced vectorization (hierarchical, custom)
|
||||
- Multi-agent vectorization
|
||||
- Performance monitoring and profiling
|
||||
- Troubleshooting and best practices
|
||||
|
||||
**policies.md** - Policy architecture guide:
|
||||
- Basic policy structure
|
||||
- CNN policies for images
|
||||
- LSTM policies with optimization
|
||||
- Multi-input policies
|
||||
- Continuous action policies
|
||||
- Multi-agent policies
|
||||
- Advanced architectures (attention, residual)
|
||||
- Observation processing and unflattening
|
||||
- Initialization and normalization
|
||||
- Debugging and testing
|
||||
|
||||
**integration.md** - Framework integration guide:
|
||||
- Gymnasium integration
|
||||
- PettingZoo integration (parallel and AEC)
|
||||
- Third-party environments (Procgen, NetHack, Minigrid, etc.)
|
||||
- Custom wrappers (observation, reward, frame stacking, etc.)
|
||||
- Space conversion and unflattening
|
||||
- Environment registration
|
||||
- Compatibility patterns
|
||||
- Performance considerations
|
||||
- Debugging integration
|
||||
|
||||
## Tips for Success
|
||||
|
||||
1. **Start simple**: Begin with Ocean environments or Gymnasium integration before creating custom environments
|
||||
|
||||
2. **Profile early**: Measure steps per second from the start to identify bottlenecks
|
||||
|
||||
3. **Use templates**: `scripts/train_template.py` and `scripts/env_template.py` provide solid starting points
|
||||
|
||||
4. **Read references as needed**: Each reference file is self-contained and focused on a specific capability
|
||||
|
||||
5. **Optimize progressively**: Start with Python, profile, then optimize critical paths with C if needed
|
||||
|
||||
6. **Leverage vectorization**: PufferLib's vectorization is key to achieving high throughput
|
||||
|
||||
7. **Monitor training**: Use WandB or Neptune to track experiments and identify issues early
|
||||
|
||||
8. **Test environments**: Validate environment logic before scaling up training
|
||||
|
||||
9. **Check existing environments**: Ocean suite provides 20+ pre-built environments
|
||||
|
||||
10. **Use proper initialization**: Always use `layer_init` from `pufferlib.pytorch` for policies
|
||||
|
||||
## Common Use Cases
|
||||
|
||||
### Training on Standard Benchmarks
|
||||
```python
|
||||
# Atari
|
||||
env = pufferlib.make('atari-pong', num_envs=256)
|
||||
|
||||
# Procgen
|
||||
env = pufferlib.make('procgen-coinrun', num_envs=256)
|
||||
|
||||
# Minigrid
|
||||
env = pufferlib.make('minigrid-empty-8x8', num_envs=256)
|
||||
```
|
||||
|
||||
### Multi-Agent Learning
|
||||
```python
|
||||
# PettingZoo
|
||||
env = pufferlib.make('pettingzoo-pistonball', num_envs=128)
|
||||
|
||||
# Shared policy for all agents
|
||||
policy = create_policy(env.observation_space, env.action_space)
|
||||
trainer = PuffeRL(env=env, policy=policy)
|
||||
```
|
||||
|
||||
### Custom Task Development
|
||||
```python
|
||||
# Create custom environment
|
||||
class MyTask(PufferEnv):
|
||||
# ... implement environment ...
|
||||
|
||||
# Vectorize and train
|
||||
env = pufferlib.emulate(MyTask, num_envs=256)
|
||||
trainer = PuffeRL(env=env, policy=my_policy)
|
||||
```
|
||||
|
||||
### High-Performance Optimization
|
||||
```python
|
||||
# Maximize throughput
|
||||
env = pufferlib.make(
|
||||
'my-env',
|
||||
num_envs=1024, # Large batch
|
||||
num_workers=16, # Many workers
|
||||
envs_per_worker=64 # Optimize per worker
|
||||
)
|
||||
```
|
||||
|
||||
## Installation
|
||||
|
||||
```bash
|
||||
uv pip install pufferlib
|
||||
```
|
||||
|
||||
## Documentation
|
||||
|
||||
- Official docs: https://puffer.ai/docs.html
|
||||
- GitHub: https://github.com/PufferAI/PufferLib
|
||||
- Discord: Community support available
|
||||
508
skills/pufferlib/references/environments.md
Normal file
508
skills/pufferlib/references/environments.md
Normal file
@@ -0,0 +1,508 @@
|
||||
# PufferLib Environments Guide
|
||||
|
||||
## Overview
|
||||
|
||||
PufferLib provides the PufferEnv API for creating high-performance custom environments, and the Ocean suite containing 20+ pre-built environments. Environments support both single-agent and multi-agent scenarios with native vectorization.
|
||||
|
||||
## PufferEnv API
|
||||
|
||||
### Core Characteristics
|
||||
|
||||
PufferEnv is designed for performance through in-place operations:
|
||||
- Observations, actions, and rewards are initialized from a shared buffer object
|
||||
- All operations happen in-place to avoid creating and copying arrays
|
||||
- Native support for both single-agent and multi-agent environments
|
||||
- Flat observation/action spaces for efficient vectorization
|
||||
|
||||
### Creating a PufferEnv
|
||||
|
||||
```python
|
||||
import numpy as np
|
||||
import pufferlib
|
||||
from pufferlib import PufferEnv
|
||||
|
||||
class MyEnvironment(PufferEnv):
|
||||
def __init__(self, buf=None):
|
||||
super().__init__(buf)
|
||||
|
||||
# Define observation and action spaces
|
||||
self.observation_space = self.make_space({
|
||||
'image': (84, 84, 3),
|
||||
'vector': (10,)
|
||||
})
|
||||
|
||||
self.action_space = self.make_discrete(4) # 4 discrete actions
|
||||
|
||||
# Initialize state
|
||||
self.reset()
|
||||
|
||||
def reset(self):
|
||||
"""Reset environment to initial state."""
|
||||
# Reset internal state
|
||||
self.agent_pos = np.array([0, 0])
|
||||
self.step_count = 0
|
||||
|
||||
# Return initial observation
|
||||
obs = {
|
||||
'image': np.zeros((84, 84, 3), dtype=np.uint8),
|
||||
'vector': np.zeros(10, dtype=np.float32)
|
||||
}
|
||||
|
||||
return obs
|
||||
|
||||
def step(self, action):
|
||||
"""Execute one environment step."""
|
||||
# Update state based on action
|
||||
self.step_count += 1
|
||||
|
||||
# Calculate reward
|
||||
reward = self._compute_reward()
|
||||
|
||||
# Check if episode is done
|
||||
done = self.step_count >= 1000
|
||||
|
||||
# Generate observation
|
||||
obs = self._get_observation()
|
||||
|
||||
# Additional info
|
||||
info = {'episode': {'r': reward, 'l': self.step_count}} if done else {}
|
||||
|
||||
return obs, reward, done, info
|
||||
|
||||
def _compute_reward(self):
|
||||
"""Compute reward for current state."""
|
||||
return 1.0
|
||||
|
||||
def _get_observation(self):
|
||||
"""Generate observation from current state."""
|
||||
return {
|
||||
'image': np.random.randint(0, 256, (84, 84, 3), dtype=np.uint8),
|
||||
'vector': np.random.randn(10).astype(np.float32)
|
||||
}
|
||||
```
|
||||
|
||||
### Observation Spaces
|
||||
|
||||
#### Discrete Spaces
|
||||
|
||||
```python
|
||||
# Single discrete value
|
||||
self.observation_space = self.make_discrete(10) # Values 0-9
|
||||
|
||||
# Dict with discrete values
|
||||
self.observation_space = self.make_space({
|
||||
'position': (1,), # Continuous
|
||||
'type': self.make_discrete(5) # Discrete
|
||||
})
|
||||
```
|
||||
|
||||
#### Continuous Spaces
|
||||
|
||||
```python
|
||||
# Box space (continuous)
|
||||
self.observation_space = self.make_space({
|
||||
'image': (84, 84, 3), # Image
|
||||
'vector': (10,), # Vector
|
||||
'scalar': (1,) # Single value
|
||||
})
|
||||
```
|
||||
|
||||
#### Multi-Discrete Spaces
|
||||
|
||||
```python
|
||||
# Multiple discrete values
|
||||
self.observation_space = self.make_multi_discrete([3, 5, 2]) # 3 values, 5 values, 2 values
|
||||
```
|
||||
|
||||
### Action Spaces
|
||||
|
||||
```python
|
||||
# Discrete actions
|
||||
self.action_space = self.make_discrete(4) # 4 actions: 0, 1, 2, 3
|
||||
|
||||
# Continuous actions
|
||||
self.action_space = self.make_space((3,)) # 3D continuous action
|
||||
|
||||
# Multi-discrete actions
|
||||
self.action_space = self.make_multi_discrete([3, 3]) # Two 3-way discrete choices
|
||||
```
|
||||
|
||||
## Multi-Agent Environments
|
||||
|
||||
PufferLib has native multi-agent support, treating single-agent and multi-agent environments uniformly.
|
||||
|
||||
### Multi-Agent PufferEnv
|
||||
|
||||
```python
|
||||
class MultiAgentEnv(PufferEnv):
|
||||
def __init__(self, num_agents=4, buf=None):
|
||||
super().__init__(buf)
|
||||
|
||||
self.num_agents = num_agents
|
||||
|
||||
# Per-agent observation space
|
||||
self.single_observation_space = self.make_space({
|
||||
'position': (2,),
|
||||
'velocity': (2,),
|
||||
'global': (10,)
|
||||
})
|
||||
|
||||
# Per-agent action space
|
||||
self.single_action_space = self.make_discrete(5)
|
||||
|
||||
self.reset()
|
||||
|
||||
def reset(self):
|
||||
"""Reset all agents."""
|
||||
self.agents = {f'agent_{i}': Agent(i) for i in range(self.num_agents)}
|
||||
|
||||
# Return observations for all agents
|
||||
return {
|
||||
agent_id: self._get_obs(agent)
|
||||
for agent_id, agent in self.agents.items()
|
||||
}
|
||||
|
||||
def step(self, actions):
|
||||
"""Step all agents."""
|
||||
# actions is a dict: {agent_id: action}
|
||||
observations = {}
|
||||
rewards = {}
|
||||
dones = {}
|
||||
infos = {}
|
||||
|
||||
for agent_id, action in actions.items():
|
||||
agent = self.agents[agent_id]
|
||||
|
||||
# Update agent
|
||||
agent.update(action)
|
||||
|
||||
# Generate results
|
||||
observations[agent_id] = self._get_obs(agent)
|
||||
rewards[agent_id] = self._compute_reward(agent)
|
||||
dones[agent_id] = agent.is_done()
|
||||
infos[agent_id] = {}
|
||||
|
||||
# Check for global done condition
|
||||
dones['__all__'] = all(dones.values())
|
||||
|
||||
return observations, rewards, dones, infos
|
||||
```
|
||||
|
||||
## Ocean Environment Suite
|
||||
|
||||
PufferLib provides the Ocean suite with 20+ pre-built environments:
|
||||
|
||||
### Available Environments
|
||||
|
||||
#### Arcade Games
|
||||
- **Atari**: Classic Atari 2600 games via Arcade Learning Environment
|
||||
- **Procgen**: Procedurally generated games for generalization testing
|
||||
|
||||
#### Grid-Based
|
||||
- **Minigrid**: Partially observable gridworld environments
|
||||
- **Crafter**: Open-ended survival crafting game
|
||||
- **NetHack**: Classic roguelike dungeon crawler
|
||||
- **MiniHack**: Simplified NetHack variants
|
||||
|
||||
#### Multi-Agent
|
||||
- **PettingZoo**: Multi-agent environment suite (including Butterfly)
|
||||
- **MAgent**: Large-scale multi-agent scenarios
|
||||
- **Neural MMO**: Massively multi-agent survival game
|
||||
|
||||
#### Specialized
|
||||
- **Pokemon Red**: Classic Pokemon game environment
|
||||
- **GPUDrive**: High-performance driving simulator
|
||||
- **Griddly**: Grid-based game engine
|
||||
- **MicroRTS**: Real-time strategy game
|
||||
|
||||
### Using Ocean Environments
|
||||
|
||||
```python
|
||||
import pufferlib
|
||||
|
||||
# Make environment
|
||||
env = pufferlib.make('procgen-coinrun', num_envs=256)
|
||||
|
||||
# With custom configuration
|
||||
env = pufferlib.make(
|
||||
'atari-pong',
|
||||
num_envs=128,
|
||||
frameskip=4,
|
||||
framestack=4
|
||||
)
|
||||
|
||||
# Multi-agent environment
|
||||
env = pufferlib.make('pettingzoo-knights-archers-zombies', num_agents=4)
|
||||
```
|
||||
|
||||
## Custom Environment Development
|
||||
|
||||
### Development Workflow
|
||||
|
||||
1. **Prototype in Python**: Start with pure Python PufferEnv
|
||||
2. **Optimize Critical Paths**: Identify bottlenecks
|
||||
3. **Implement in C**: Rewrite performance-critical code in C
|
||||
4. **Create Bindings**: Use Python C API
|
||||
5. **Compile**: Build as extension module
|
||||
6. **Register**: Add to Ocean suite
|
||||
|
||||
### Performance Benchmarks
|
||||
|
||||
- **Pure Python**: 100k-500k steps/second
|
||||
- **C Implementation**: 100M+ steps/second
|
||||
- **Training with Python env**: ~400k total SPS
|
||||
- **Training with C env**: ~4M total SPS
|
||||
|
||||
### Python Optimization Tips
|
||||
|
||||
```python
|
||||
# Use NumPy operations instead of Python loops
|
||||
# Bad
|
||||
for i in range(len(array)):
|
||||
array[i] = array[i] * 2
|
||||
|
||||
# Good
|
||||
array *= 2
|
||||
|
||||
# Pre-allocate arrays instead of appending
|
||||
# Bad
|
||||
observations = []
|
||||
for i in range(n):
|
||||
observations.append(generate_obs())
|
||||
|
||||
# Good
|
||||
observations = np.empty((n, obs_shape), dtype=np.float32)
|
||||
for i in range(n):
|
||||
observations[i] = generate_obs()
|
||||
|
||||
# Use in-place operations
|
||||
# Bad
|
||||
new_state = state + delta
|
||||
|
||||
# Good
|
||||
state += delta
|
||||
```
|
||||
|
||||
### C Extension Example
|
||||
|
||||
```c
|
||||
// my_env.c
|
||||
#include <Python.h>
|
||||
#include <numpy/arrayobject.h>
|
||||
|
||||
// Fast environment step implementation
|
||||
static PyObject* fast_step(PyObject* self, PyObject* args) {
|
||||
PyArrayObject* state;
|
||||
int action;
|
||||
|
||||
if (!PyArg_ParseTuple(args, "O!i", &PyArray_Type, &state, &action)) {
|
||||
return NULL;
|
||||
}
|
||||
|
||||
// High-performance C implementation
|
||||
// ...
|
||||
|
||||
return Py_BuildValue("Ofi", obs, reward, done);
|
||||
}
|
||||
|
||||
static PyMethodDef methods[] = {
|
||||
{"fast_step", fast_step, METH_VARARGS, "Fast environment step"},
|
||||
{NULL, NULL, 0, NULL}
|
||||
};
|
||||
|
||||
static struct PyModuleDef module = {
|
||||
PyModuleDef_HEAD_INIT,
|
||||
"my_env_c",
|
||||
NULL,
|
||||
-1,
|
||||
methods
|
||||
};
|
||||
|
||||
PyMODINIT_FUNC PyInit_my_env_c(void) {
|
||||
import_array();
|
||||
return PyModule_Create(&module);
|
||||
}
|
||||
```
|
||||
|
||||
## Third-Party Environment Integration
|
||||
|
||||
### Gymnasium Environments
|
||||
|
||||
```python
|
||||
import gymnasium as gym
|
||||
import pufferlib
|
||||
|
||||
# Wrap Gymnasium environment
|
||||
gym_env = gym.make('CartPole-v1')
|
||||
puffer_env = pufferlib.emulate(gym_env, num_envs=256)
|
||||
|
||||
# Or use make directly
|
||||
env = pufferlib.make('gym-CartPole-v1', num_envs=256)
|
||||
```
|
||||
|
||||
### PettingZoo Environments
|
||||
|
||||
```python
|
||||
from pettingzoo.butterfly import pistonball_v6
|
||||
import pufferlib
|
||||
|
||||
# Wrap PettingZoo environment
|
||||
pz_env = pistonball_v6.env()
|
||||
puffer_env = pufferlib.emulate(pz_env, num_envs=128)
|
||||
|
||||
# Or use make directly
|
||||
env = pufferlib.make('pettingzoo-pistonball', num_envs=128)
|
||||
```
|
||||
|
||||
### Custom Wrappers
|
||||
|
||||
```python
|
||||
class CustomWrapper(pufferlib.PufferEnv):
|
||||
"""Wrapper to modify environment behavior."""
|
||||
|
||||
def __init__(self, base_env, buf=None):
|
||||
super().__init__(buf)
|
||||
self.base_env = base_env
|
||||
self.observation_space = base_env.observation_space
|
||||
self.action_space = base_env.action_space
|
||||
|
||||
def reset(self):
|
||||
obs = self.base_env.reset()
|
||||
# Modify observation
|
||||
return self._process_obs(obs)
|
||||
|
||||
def step(self, action):
|
||||
# Modify action
|
||||
modified_action = self._process_action(action)
|
||||
|
||||
obs, reward, done, info = self.base_env.step(modified_action)
|
||||
|
||||
# Modify outputs
|
||||
obs = self._process_obs(obs)
|
||||
reward = self._process_reward(reward)
|
||||
|
||||
return obs, reward, done, info
|
||||
```
|
||||
|
||||
## Environment Best Practices
|
||||
|
||||
### State Management
|
||||
|
||||
```python
|
||||
# Store minimal state, compute on demand
|
||||
class EfficientEnv(PufferEnv):
|
||||
def __init__(self, buf=None):
|
||||
super().__init__(buf)
|
||||
self.agent_pos = np.zeros(2) # Minimal state
|
||||
|
||||
def _get_observation(self):
|
||||
# Compute full observation on demand
|
||||
observation = np.zeros((84, 84, 3), dtype=np.uint8)
|
||||
self._render_scene(observation, self.agent_pos)
|
||||
return observation
|
||||
```
|
||||
|
||||
### Reward Scaling
|
||||
|
||||
```python
|
||||
# Normalize rewards to reasonable range
|
||||
def step(self, action):
|
||||
# ... environment logic ...
|
||||
|
||||
# Scale large rewards
|
||||
raw_reward = compute_raw_reward()
|
||||
reward = np.clip(raw_reward / 100.0, -10, 10)
|
||||
|
||||
return obs, reward, done, info
|
||||
```
|
||||
|
||||
### Episode Termination
|
||||
|
||||
```python
|
||||
def step(self, action):
|
||||
# ... environment logic ...
|
||||
|
||||
# Multiple termination conditions
|
||||
timeout = self.step_count >= self.max_steps
|
||||
success = self._check_success()
|
||||
failure = self._check_failure()
|
||||
|
||||
done = timeout or success or failure
|
||||
|
||||
info = {
|
||||
'TimeLimit.truncated': timeout,
|
||||
'success': success
|
||||
}
|
||||
|
||||
return obs, reward, done, info
|
||||
```
|
||||
|
||||
### Memory Efficiency
|
||||
|
||||
```python
|
||||
# Reuse buffers instead of allocating new ones
|
||||
class MemoryEfficientEnv(PufferEnv):
|
||||
def __init__(self, buf=None):
|
||||
super().__init__(buf)
|
||||
|
||||
# Pre-allocate observation buffer
|
||||
self._obs_buffer = np.zeros((84, 84, 3), dtype=np.uint8)
|
||||
|
||||
def _get_observation(self):
|
||||
# Reuse buffer, modify in place
|
||||
self._render_scene(self._obs_buffer)
|
||||
return self._obs_buffer # Return view, not copy
|
||||
```
|
||||
|
||||
## Debugging Environments
|
||||
|
||||
### Validation Checks
|
||||
|
||||
```python
|
||||
# Add assertions to catch bugs
|
||||
def step(self, action):
|
||||
assert self.action_space.contains(action), f"Invalid action: {action}"
|
||||
|
||||
obs, reward, done, info = self._step_impl(action)
|
||||
|
||||
assert self.observation_space.contains(obs), "Invalid observation"
|
||||
assert np.isfinite(reward), "Non-finite reward"
|
||||
|
||||
return obs, reward, done, info
|
||||
```
|
||||
|
||||
### Rendering
|
||||
|
||||
```python
|
||||
class DebuggableEnv(PufferEnv):
|
||||
def __init__(self, buf=None, render_mode=None):
|
||||
super().__init__(buf)
|
||||
self.render_mode = render_mode
|
||||
|
||||
def render(self):
|
||||
"""Render environment for debugging."""
|
||||
if self.render_mode == 'human':
|
||||
# Display to screen
|
||||
self._display_scene()
|
||||
elif self.render_mode == 'rgb_array':
|
||||
# Return image
|
||||
return self._render_to_array()
|
||||
```
|
||||
|
||||
### Logging
|
||||
|
||||
```python
|
||||
import logging
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
def step(self, action):
|
||||
logger.debug(f"Step {self.step_count}: action={action}")
|
||||
|
||||
obs, reward, done, info = self._step_impl(action)
|
||||
|
||||
if done:
|
||||
logger.info(f"Episode finished: reward={self.total_reward}")
|
||||
|
||||
return obs, reward, done, info
|
||||
```
|
||||
621
skills/pufferlib/references/integration.md
Normal file
621
skills/pufferlib/references/integration.md
Normal file
@@ -0,0 +1,621 @@
|
||||
# PufferLib Integration Guide
|
||||
|
||||
## Overview
|
||||
|
||||
PufferLib provides an emulation layer that enables seamless integration with popular RL frameworks including Gymnasium, OpenAI Gym, PettingZoo, and many specialized environment libraries. The emulation layer flattens observation and action spaces for efficient vectorization while maintaining compatibility.
|
||||
|
||||
## Gymnasium Integration
|
||||
|
||||
### Basic Gymnasium Environments
|
||||
|
||||
```python
|
||||
import gymnasium as gym
|
||||
import pufferlib
|
||||
|
||||
# Method 1: Direct wrapping
|
||||
gym_env = gym.make('CartPole-v1')
|
||||
puffer_env = pufferlib.emulate(gym_env, num_envs=256)
|
||||
|
||||
# Method 2: Using make
|
||||
env = pufferlib.make('gym-CartPole-v1', num_envs=256)
|
||||
|
||||
# Method 3: Custom Gymnasium environment
|
||||
class MyGymEnv(gym.Env):
|
||||
def __init__(self):
|
||||
self.observation_space = gym.spaces.Box(low=-1, high=1, shape=(4,))
|
||||
self.action_space = gym.spaces.Discrete(2)
|
||||
|
||||
def reset(self, seed=None, options=None):
|
||||
super().reset(seed=seed)
|
||||
return self.observation_space.sample(), {}
|
||||
|
||||
def step(self, action):
|
||||
obs = self.observation_space.sample()
|
||||
reward = 1.0
|
||||
terminated = False
|
||||
truncated = False
|
||||
info = {}
|
||||
return obs, reward, terminated, truncated, info
|
||||
|
||||
# Wrap custom environment
|
||||
puffer_env = pufferlib.emulate(MyGymEnv, num_envs=128)
|
||||
```
|
||||
|
||||
### Atari Environments
|
||||
|
||||
```python
|
||||
import gymnasium as gym
|
||||
from gymnasium.wrappers import AtariPreprocessing, FrameStack
|
||||
import pufferlib
|
||||
|
||||
# Standard Atari setup
|
||||
def make_atari_env(env_name='ALE/Pong-v5'):
|
||||
env = gym.make(env_name)
|
||||
env = AtariPreprocessing(env, frame_skip=4)
|
||||
env = FrameStack(env, num_stack=4)
|
||||
return env
|
||||
|
||||
# Vectorize with PufferLib
|
||||
env = pufferlib.emulate(make_atari_env, num_envs=256)
|
||||
|
||||
# Or use built-in
|
||||
env = pufferlib.make('atari-pong', num_envs=256, frameskip=4, framestack=4)
|
||||
```
|
||||
|
||||
### Complex Observation Spaces
|
||||
|
||||
```python
|
||||
import gymnasium as gym
|
||||
from gymnasium.spaces import Dict, Box, Discrete
|
||||
import pufferlib
|
||||
|
||||
class ComplexObsEnv(gym.Env):
|
||||
def __init__(self):
|
||||
# Dict observation space
|
||||
self.observation_space = Dict({
|
||||
'image': Box(low=0, high=255, shape=(84, 84, 3), dtype=np.uint8),
|
||||
'vector': Box(low=-np.inf, high=np.inf, shape=(10,), dtype=np.float32),
|
||||
'discrete': Discrete(5)
|
||||
})
|
||||
self.action_space = Discrete(4)
|
||||
|
||||
def reset(self, seed=None, options=None):
|
||||
return {
|
||||
'image': np.zeros((84, 84, 3), dtype=np.uint8),
|
||||
'vector': np.zeros(10, dtype=np.float32),
|
||||
'discrete': 0
|
||||
}, {}
|
||||
|
||||
def step(self, action):
|
||||
obs = {
|
||||
'image': np.random.randint(0, 256, (84, 84, 3), dtype=np.uint8),
|
||||
'vector': np.random.randn(10).astype(np.float32),
|
||||
'discrete': np.random.randint(0, 5)
|
||||
}
|
||||
return obs, 1.0, False, False, {}
|
||||
|
||||
# PufferLib automatically flattens and unflattens complex spaces
|
||||
env = pufferlib.emulate(ComplexObsEnv, num_envs=128)
|
||||
```
|
||||
|
||||
## PettingZoo Integration
|
||||
|
||||
### Parallel Environments
|
||||
|
||||
```python
|
||||
from pettingzoo.butterfly import pistonball_v6
|
||||
import pufferlib
|
||||
|
||||
# Wrap PettingZoo parallel environment
|
||||
pz_env = pistonball_v6.parallel_env()
|
||||
puffer_env = pufferlib.emulate(pz_env, num_envs=128)
|
||||
|
||||
# Or use make directly
|
||||
env = pufferlib.make('pettingzoo-pistonball', num_envs=128)
|
||||
```
|
||||
|
||||
### AEC (Agent Environment Cycle) Environments
|
||||
|
||||
```python
|
||||
from pettingzoo.classic import chess_v5
|
||||
import pufferlib
|
||||
|
||||
# Wrap AEC environment (PufferLib handles conversion to parallel)
|
||||
aec_env = chess_v5.env()
|
||||
puffer_env = pufferlib.emulate(aec_env, num_envs=64)
|
||||
|
||||
# Works with any PettingZoo AEC environment
|
||||
env = pufferlib.make('pettingzoo-chess', num_envs=64)
|
||||
```
|
||||
|
||||
### Multi-Agent Training
|
||||
|
||||
```python
|
||||
import pufferlib
|
||||
from pufferlib import PuffeRL
|
||||
|
||||
# Create multi-agent environment
|
||||
env = pufferlib.make('pettingzoo-knights-archers-zombies', num_envs=128)
|
||||
|
||||
# Shared policy for all agents
|
||||
policy = create_policy(env.observation_space, env.action_space)
|
||||
|
||||
# Train
|
||||
trainer = PuffeRL(env=env, policy=policy)
|
||||
|
||||
for iteration in range(num_iterations):
|
||||
# Observations are dicts: {agent_id: batch_obs}
|
||||
rollout = trainer.evaluate()
|
||||
|
||||
# Train on multi-agent data
|
||||
trainer.train()
|
||||
trainer.mean_and_log()
|
||||
```
|
||||
|
||||
## Third-Party Environments
|
||||
|
||||
### Procgen
|
||||
|
||||
```python
|
||||
import pufferlib
|
||||
|
||||
# Procgen environments
|
||||
env = pufferlib.make('procgen-coinrun', num_envs=256, distribution_mode='easy')
|
||||
|
||||
# Custom configuration
|
||||
env = pufferlib.make(
|
||||
'procgen-coinrun',
|
||||
num_envs=256,
|
||||
num_levels=200, # Number of unique levels
|
||||
start_level=0, # Starting level seed
|
||||
distribution_mode='hard'
|
||||
)
|
||||
```
|
||||
|
||||
### NetHack
|
||||
|
||||
```python
|
||||
import pufferlib
|
||||
|
||||
# NetHack Learning Environment
|
||||
env = pufferlib.make('nethack', num_envs=128)
|
||||
|
||||
# MiniHack variants
|
||||
env = pufferlib.make('minihack-corridor', num_envs=128)
|
||||
env = pufferlib.make('minihack-room', num_envs=128)
|
||||
```
|
||||
|
||||
### Minigrid
|
||||
|
||||
```python
|
||||
import pufferlib
|
||||
|
||||
# Minigrid environments
|
||||
env = pufferlib.make('minigrid-empty-8x8', num_envs=256)
|
||||
env = pufferlib.make('minigrid-doorkey-8x8', num_envs=256)
|
||||
env = pufferlib.make('minigrid-multiroom', num_envs=256)
|
||||
```
|
||||
|
||||
### Neural MMO
|
||||
|
||||
```python
|
||||
import pufferlib
|
||||
|
||||
# Large-scale multi-agent environment
|
||||
env = pufferlib.make(
|
||||
'neuralmmo',
|
||||
num_envs=64,
|
||||
num_agents=128, # Agents per environment
|
||||
map_size=128
|
||||
)
|
||||
```
|
||||
|
||||
### Crafter
|
||||
|
||||
```python
|
||||
import pufferlib
|
||||
|
||||
# Open-ended crafting environment
|
||||
env = pufferlib.make('crafter', num_envs=128)
|
||||
```
|
||||
|
||||
### GPUDrive
|
||||
|
||||
```python
|
||||
import pufferlib
|
||||
|
||||
# GPU-accelerated driving simulator
|
||||
env = pufferlib.make(
|
||||
'gpudrive',
|
||||
num_envs=1024, # Can handle many environments on GPU
|
||||
num_vehicles=8
|
||||
)
|
||||
```
|
||||
|
||||
### MicroRTS
|
||||
|
||||
```python
|
||||
import pufferlib
|
||||
|
||||
# Real-time strategy game
|
||||
env = pufferlib.make(
|
||||
'microrts',
|
||||
num_envs=128,
|
||||
map_size=16,
|
||||
max_steps=2000
|
||||
)
|
||||
```
|
||||
|
||||
### Griddly
|
||||
|
||||
```python
|
||||
import pufferlib
|
||||
|
||||
# Grid-based games
|
||||
env = pufferlib.make('griddly-clusters', num_envs=256)
|
||||
env = pufferlib.make('griddly-sokoban', num_envs=256)
|
||||
```
|
||||
|
||||
## Custom Wrappers
|
||||
|
||||
### Observation Wrappers
|
||||
|
||||
```python
|
||||
import numpy as np
|
||||
import pufferlib
|
||||
from pufferlib import PufferEnv
|
||||
|
||||
class NormalizeObservations(pufferlib.Wrapper):
|
||||
"""Normalize observations to zero mean and unit variance."""
|
||||
|
||||
def __init__(self, env):
|
||||
super().__init__(env)
|
||||
self.obs_mean = np.zeros(env.observation_space.shape)
|
||||
self.obs_std = np.ones(env.observation_space.shape)
|
||||
self.count = 0
|
||||
|
||||
def reset(self):
|
||||
obs = self.env.reset()
|
||||
return self._normalize(obs)
|
||||
|
||||
def step(self, action):
|
||||
obs, reward, done, info = self.env.step(action)
|
||||
return self._normalize(obs), reward, done, info
|
||||
|
||||
def _normalize(self, obs):
|
||||
# Update running statistics
|
||||
self.count += 1
|
||||
delta = obs - self.obs_mean
|
||||
self.obs_mean += delta / self.count
|
||||
self.obs_std = np.sqrt(((self.count - 1) * self.obs_std ** 2 + delta * (obs - self.obs_mean)) / self.count)
|
||||
|
||||
# Normalize
|
||||
return (obs - self.obs_mean) / (self.obs_std + 1e-8)
|
||||
```
|
||||
|
||||
### Reward Wrappers
|
||||
|
||||
```python
|
||||
class RewardShaping(pufferlib.Wrapper):
|
||||
"""Add shaped rewards to environment."""
|
||||
|
||||
def __init__(self, env, shaping_fn):
|
||||
super().__init__(env)
|
||||
self.shaping_fn = shaping_fn
|
||||
|
||||
def step(self, action):
|
||||
obs, reward, done, info = self.env.step(action)
|
||||
|
||||
# Add shaped reward
|
||||
shaped_reward = reward + self.shaping_fn(obs, action)
|
||||
|
||||
return obs, shaped_reward, done, info
|
||||
|
||||
# Usage
|
||||
def proximity_shaping(obs, action):
|
||||
"""Reward agent for getting closer to goal."""
|
||||
goal_pos = np.array([10, 10])
|
||||
agent_pos = obs[:2]
|
||||
distance = np.linalg.norm(goal_pos - agent_pos)
|
||||
return -0.1 * distance
|
||||
|
||||
env = pufferlib.make('myenv', num_envs=128)
|
||||
env = RewardShaping(env, proximity_shaping)
|
||||
```
|
||||
|
||||
### Frame Stacking
|
||||
|
||||
```python
|
||||
class FrameStack(pufferlib.Wrapper):
|
||||
"""Stack frames for temporal context."""
|
||||
|
||||
def __init__(self, env, num_stack=4):
|
||||
super().__init__(env)
|
||||
self.num_stack = num_stack
|
||||
self.frames = None
|
||||
|
||||
def reset(self):
|
||||
obs = self.env.reset()
|
||||
|
||||
# Initialize frame stack
|
||||
self.frames = np.repeat(obs[np.newaxis], self.num_stack, axis=0)
|
||||
|
||||
return self._get_obs()
|
||||
|
||||
def step(self, action):
|
||||
obs, reward, done, info = self.env.step(action)
|
||||
|
||||
# Update frame stack
|
||||
self.frames = np.roll(self.frames, shift=-1, axis=0)
|
||||
self.frames[-1] = obs
|
||||
|
||||
if done:
|
||||
self.frames = None
|
||||
|
||||
return self._get_obs(), reward, done, info
|
||||
|
||||
def _get_obs(self):
|
||||
return self.frames
|
||||
```
|
||||
|
||||
### Action Repeat
|
||||
|
||||
```python
|
||||
class ActionRepeat(pufferlib.Wrapper):
|
||||
"""Repeat actions for multiple steps."""
|
||||
|
||||
def __init__(self, env, repeat=4):
|
||||
super().__init__(env)
|
||||
self.repeat = repeat
|
||||
|
||||
def step(self, action):
|
||||
total_reward = 0.0
|
||||
done = False
|
||||
|
||||
for _ in range(self.repeat):
|
||||
obs, reward, done, info = self.env.step(action)
|
||||
total_reward += reward
|
||||
|
||||
if done:
|
||||
break
|
||||
|
||||
return obs, total_reward, done, info
|
||||
```
|
||||
|
||||
## Space Conversion
|
||||
|
||||
### Flattening Spaces
|
||||
|
||||
PufferLib automatically flattens complex observation/action spaces:
|
||||
|
||||
```python
|
||||
from gymnasium.spaces import Dict, Box, Discrete
|
||||
import pufferlib
|
||||
|
||||
# Complex space
|
||||
original_space = Dict({
|
||||
'image': Box(0, 255, (84, 84, 3), dtype=np.uint8),
|
||||
'vector': Box(-np.inf, np.inf, (10,), dtype=np.float32),
|
||||
'discrete': Discrete(5)
|
||||
})
|
||||
|
||||
# Automatically flattened by PufferLib
|
||||
# Observations are presented as flat arrays for efficient processing
|
||||
# But can be unflattened when needed for policy processing
|
||||
```
|
||||
|
||||
### Unflattening for Policies
|
||||
|
||||
```python
|
||||
from pufferlib.pytorch import unflatten_observations
|
||||
|
||||
class PolicyWithUnflatten(nn.Module):
|
||||
def __init__(self, observation_space, action_space):
|
||||
super().__init__()
|
||||
self.observation_space = observation_space
|
||||
# ... policy architecture ...
|
||||
|
||||
def forward(self, flat_observations):
|
||||
# Unflatten to original structure
|
||||
observations = unflatten_observations(
|
||||
flat_observations,
|
||||
self.observation_space
|
||||
)
|
||||
|
||||
# Now observations is a dict with 'image', 'vector', 'discrete'
|
||||
image_features = self.image_encoder(observations['image'])
|
||||
vector_features = self.vector_encoder(observations['vector'])
|
||||
# ...
|
||||
```
|
||||
|
||||
## Environment Registration
|
||||
|
||||
### Registering Custom Environments
|
||||
|
||||
```python
|
||||
import pufferlib
|
||||
|
||||
# Register environment for easy access
|
||||
pufferlib.register(
|
||||
id='my-custom-env',
|
||||
entry_point='my_package.envs:MyEnvironment',
|
||||
kwargs={'param1': 'value1'}
|
||||
)
|
||||
|
||||
# Now can use with make
|
||||
env = pufferlib.make('my-custom-env', num_envs=256)
|
||||
```
|
||||
|
||||
### Registering in Ocean Suite
|
||||
|
||||
To add your environment to Ocean:
|
||||
|
||||
```python
|
||||
# In ocean/environment.py
|
||||
OCEAN_REGISTRY = {
|
||||
'my-env': {
|
||||
'entry_point': 'my_package.envs:MyEnvironment',
|
||||
'kwargs': {
|
||||
'default_param': 'default_value'
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
## Compatibility Patterns
|
||||
|
||||
### Gymnasium to PufferLib
|
||||
|
||||
```python
|
||||
import gymnasium as gym
|
||||
import pufferlib
|
||||
|
||||
# Standard Gymnasium environment
|
||||
class GymEnv(gym.Env):
|
||||
def reset(self, seed=None, options=None):
|
||||
return observation, info
|
||||
|
||||
def step(self, action):
|
||||
return observation, reward, terminated, truncated, info
|
||||
|
||||
# Convert to PufferEnv
|
||||
puffer_env = pufferlib.emulate(GymEnv, num_envs=128)
|
||||
```
|
||||
|
||||
### PettingZoo to PufferLib
|
||||
|
||||
```python
|
||||
from pettingzoo import ParallelEnv
|
||||
import pufferlib
|
||||
|
||||
# PettingZoo parallel environment
|
||||
class PZEnv(ParallelEnv):
|
||||
def reset(self, seed=None, options=None):
|
||||
return {agent: obs for agent, obs in ...}, {agent: info for agent in ...}
|
||||
|
||||
def step(self, actions):
|
||||
return observations, rewards, terminations, truncations, infos
|
||||
|
||||
# Convert to PufferEnv
|
||||
puffer_env = pufferlib.emulate(PZEnv, num_envs=128)
|
||||
```
|
||||
|
||||
### Legacy Gym (v0.21) to PufferLib
|
||||
|
||||
```python
|
||||
import gym # Old gym
|
||||
import pufferlib
|
||||
|
||||
# Legacy gym environment (returns done instead of terminated/truncated)
|
||||
class LegacyEnv(gym.Env):
|
||||
def reset(self):
|
||||
return observation
|
||||
|
||||
def step(self, action):
|
||||
return observation, reward, done, info
|
||||
|
||||
# PufferLib handles legacy format automatically
|
||||
puffer_env = pufferlib.emulate(LegacyEnv, num_envs=128)
|
||||
```
|
||||
|
||||
## Performance Considerations
|
||||
|
||||
### Efficient Integration
|
||||
|
||||
```python
|
||||
# Fast: Use built-in integrations when available
|
||||
env = pufferlib.make('procgen-coinrun', num_envs=256)
|
||||
|
||||
# Slower: Generic wrapper (still fast, but overhead)
|
||||
import gymnasium as gym
|
||||
gym_env = gym.make('CartPole-v1')
|
||||
env = pufferlib.emulate(gym_env, num_envs=256)
|
||||
|
||||
# Slowest: Nested wrappers add overhead
|
||||
import gymnasium as gym
|
||||
gym_env = gym.make('CartPole-v1')
|
||||
gym_env = SomeWrapper(gym_env)
|
||||
gym_env = AnotherWrapper(gym_env)
|
||||
env = pufferlib.emulate(gym_env, num_envs=256)
|
||||
```
|
||||
|
||||
### Minimize Wrapper Overhead
|
||||
|
||||
```python
|
||||
# BAD: Too many wrappers
|
||||
env = gym.make('CartPole-v1')
|
||||
env = Wrapper1(env)
|
||||
env = Wrapper2(env)
|
||||
env = Wrapper3(env)
|
||||
puffer_env = pufferlib.emulate(env, num_envs=256)
|
||||
|
||||
# GOOD: Combine wrapper logic
|
||||
class CombinedWrapper(gym.Wrapper):
|
||||
def step(self, action):
|
||||
obs, reward, done, truncated, info = self.env.step(action)
|
||||
# Apply all transformations at once
|
||||
obs = self._transform_obs(obs)
|
||||
reward = self._transform_reward(reward)
|
||||
return obs, reward, done, truncated, info
|
||||
|
||||
env = gym.make('CartPole-v1')
|
||||
env = CombinedWrapper(env)
|
||||
puffer_env = pufferlib.emulate(env, num_envs=256)
|
||||
```
|
||||
|
||||
## Debugging Integration
|
||||
|
||||
### Verify Environment Compatibility
|
||||
|
||||
```python
|
||||
def test_environment(env, num_steps=100):
|
||||
"""Test environment for common issues."""
|
||||
# Test reset
|
||||
obs = env.reset()
|
||||
assert env.observation_space.contains(obs), "Invalid initial observation"
|
||||
|
||||
# Test steps
|
||||
for _ in range(num_steps):
|
||||
action = env.action_space.sample()
|
||||
obs, reward, done, info = env.step(action)
|
||||
|
||||
assert env.observation_space.contains(obs), "Invalid observation"
|
||||
assert isinstance(reward, (int, float)), "Invalid reward type"
|
||||
assert isinstance(done, bool), "Invalid done type"
|
||||
assert isinstance(info, dict), "Invalid info type"
|
||||
|
||||
if done:
|
||||
obs = env.reset()
|
||||
|
||||
print("✓ Environment passed compatibility test")
|
||||
|
||||
# Test before vectorizing
|
||||
test_environment(MyEnvironment())
|
||||
```
|
||||
|
||||
### Compare Outputs
|
||||
|
||||
```python
|
||||
# Verify PufferLib emulation matches original
|
||||
import gymnasium as gym
|
||||
import pufferlib
|
||||
import numpy as np
|
||||
|
||||
gym_env = gym.make('CartPole-v1')
|
||||
puffer_env = pufferlib.emulate(lambda: gym.make('CartPole-v1'), num_envs=1)
|
||||
|
||||
# Test with same seed
|
||||
gym_env.reset(seed=42)
|
||||
puffer_obs = puffer_env.reset()
|
||||
|
||||
for _ in range(100):
|
||||
action = gym_env.action_space.sample()
|
||||
|
||||
gym_obs, gym_reward, gym_done, gym_truncated, gym_info = gym_env.step(action)
|
||||
puffer_obs, puffer_reward, puffer_done, puffer_info = puffer_env.step(np.array([action]))
|
||||
|
||||
# Compare outputs (accounting for batch dimension)
|
||||
assert np.allclose(gym_obs, puffer_obs[0])
|
||||
assert gym_reward == puffer_reward[0]
|
||||
assert gym_done == puffer_done[0]
|
||||
```
|
||||
653
skills/pufferlib/references/policies.md
Normal file
653
skills/pufferlib/references/policies.md
Normal file
@@ -0,0 +1,653 @@
|
||||
# PufferLib Policies Guide
|
||||
|
||||
## Overview
|
||||
|
||||
PufferLib policies are standard PyTorch modules with optional utilities for observation processing and LSTM integration. The framework provides default architectures and tools while allowing full flexibility in policy design.
|
||||
|
||||
## Policy Architecture
|
||||
|
||||
### Basic Policy Structure
|
||||
|
||||
```python
|
||||
import torch
|
||||
import torch.nn as nn
|
||||
from pufferlib.pytorch import layer_init
|
||||
|
||||
class BasicPolicy(nn.Module):
|
||||
def __init__(self, observation_space, action_space):
|
||||
super().__init__()
|
||||
|
||||
self.observation_space = observation_space
|
||||
self.action_space = action_space
|
||||
|
||||
# Encoder network
|
||||
self.encoder = nn.Sequential(
|
||||
layer_init(nn.Linear(observation_space.shape[0], 256)),
|
||||
nn.ReLU(),
|
||||
layer_init(nn.Linear(256, 256)),
|
||||
nn.ReLU()
|
||||
)
|
||||
|
||||
# Policy head (actor)
|
||||
self.actor = layer_init(nn.Linear(256, action_space.n), std=0.01)
|
||||
|
||||
# Value head (critic)
|
||||
self.critic = layer_init(nn.Linear(256, 1), std=1.0)
|
||||
|
||||
def forward(self, observations):
|
||||
"""Forward pass through policy."""
|
||||
# Encode observations
|
||||
features = self.encoder(observations)
|
||||
|
||||
# Get action logits and value
|
||||
logits = self.actor(features)
|
||||
value = self.critic(features)
|
||||
|
||||
return logits, value
|
||||
|
||||
def get_action(self, observations, deterministic=False):
|
||||
"""Sample action from policy."""
|
||||
logits, value = self.forward(observations)
|
||||
|
||||
if deterministic:
|
||||
action = logits.argmax(dim=-1)
|
||||
else:
|
||||
dist = torch.distributions.Categorical(logits=logits)
|
||||
action = dist.sample()
|
||||
|
||||
return action, value
|
||||
```
|
||||
|
||||
### Layer Initialization
|
||||
|
||||
PufferLib provides `layer_init` for proper weight initialization:
|
||||
|
||||
```python
|
||||
from pufferlib.pytorch import layer_init
|
||||
|
||||
# Default orthogonal initialization
|
||||
layer = layer_init(nn.Linear(256, 256))
|
||||
|
||||
# Custom standard deviation
|
||||
actor_head = layer_init(nn.Linear(256, num_actions), std=0.01)
|
||||
critic_head = layer_init(nn.Linear(256, 1), std=1.0)
|
||||
|
||||
# Works with any layer type
|
||||
conv = layer_init(nn.Conv2d(3, 32, kernel_size=8, stride=4))
|
||||
```
|
||||
|
||||
## CNN Policies
|
||||
|
||||
For image-based observations:
|
||||
|
||||
```python
|
||||
class CNNPolicy(nn.Module):
|
||||
def __init__(self, observation_space, action_space):
|
||||
super().__init__()
|
||||
|
||||
# CNN encoder for images
|
||||
self.encoder = nn.Sequential(
|
||||
layer_init(nn.Conv2d(3, 32, kernel_size=8, stride=4)),
|
||||
nn.ReLU(),
|
||||
layer_init(nn.Conv2d(32, 64, kernel_size=4, stride=2)),
|
||||
nn.ReLU(),
|
||||
layer_init(nn.Conv2d(64, 64, kernel_size=3, stride=1)),
|
||||
nn.ReLU(),
|
||||
nn.Flatten(),
|
||||
layer_init(nn.Linear(64 * 7 * 7, 512)),
|
||||
nn.ReLU()
|
||||
)
|
||||
|
||||
self.actor = layer_init(nn.Linear(512, action_space.n), std=0.01)
|
||||
self.critic = layer_init(nn.Linear(512, 1), std=1.0)
|
||||
|
||||
def forward(self, observations):
|
||||
# Normalize pixel values
|
||||
x = observations.float() / 255.0
|
||||
|
||||
features = self.encoder(x)
|
||||
logits = self.actor(features)
|
||||
value = self.critic(features)
|
||||
|
||||
return logits, value
|
||||
```
|
||||
|
||||
### Efficient CNN Architecture
|
||||
|
||||
```python
|
||||
class EfficientCNN(nn.Module):
|
||||
"""Optimized CNN for Atari-style games."""
|
||||
|
||||
def __init__(self, observation_space, action_space):
|
||||
super().__init__()
|
||||
|
||||
in_channels = observation_space.shape[0] # Typically 4 for framestack
|
||||
|
||||
self.network = nn.Sequential(
|
||||
layer_init(nn.Conv2d(in_channels, 32, 8, stride=4)),
|
||||
nn.ReLU(),
|
||||
layer_init(nn.Conv2d(32, 64, 4, stride=2)),
|
||||
nn.ReLU(),
|
||||
layer_init(nn.Conv2d(64, 64, 3, stride=1)),
|
||||
nn.ReLU(),
|
||||
nn.Flatten()
|
||||
)
|
||||
|
||||
# Calculate feature size
|
||||
with torch.no_grad():
|
||||
sample = torch.zeros(1, *observation_space.shape)
|
||||
n_features = self.network(sample).shape[1]
|
||||
|
||||
self.fc = layer_init(nn.Linear(n_features, 512))
|
||||
self.actor = layer_init(nn.Linear(512, action_space.n), std=0.01)
|
||||
self.critic = layer_init(nn.Linear(512, 1), std=1.0)
|
||||
|
||||
def forward(self, x):
|
||||
x = x.float() / 255.0
|
||||
x = self.network(x)
|
||||
x = torch.relu(self.fc(x))
|
||||
|
||||
return self.actor(x), self.critic(x)
|
||||
```
|
||||
|
||||
## Recurrent Policies (LSTM)
|
||||
|
||||
PufferLib provides optimized LSTM integration with automatic recurrence handling:
|
||||
|
||||
```python
|
||||
from pufferlib.pytorch import LSTMWrapper
|
||||
|
||||
class RecurrentPolicy(nn.Module):
|
||||
def __init__(self, observation_space, action_space, hidden_size=256):
|
||||
super().__init__()
|
||||
|
||||
# Observation encoder
|
||||
self.encoder = nn.Sequential(
|
||||
layer_init(nn.Linear(observation_space.shape[0], 128)),
|
||||
nn.ReLU()
|
||||
)
|
||||
|
||||
# LSTM layer
|
||||
self.lstm = nn.LSTM(128, hidden_size, num_layers=1)
|
||||
|
||||
# Policy and value heads
|
||||
self.actor = layer_init(nn.Linear(hidden_size, action_space.n), std=0.01)
|
||||
self.critic = layer_init(nn.Linear(hidden_size, 1), std=1.0)
|
||||
|
||||
# Hidden state
|
||||
self.hidden_size = hidden_size
|
||||
|
||||
def forward(self, observations, state=None):
|
||||
"""
|
||||
Args:
|
||||
observations: (batch, obs_dim)
|
||||
state: Optional (h, c) tuple for LSTM
|
||||
|
||||
Returns:
|
||||
logits, value, new_state
|
||||
"""
|
||||
batch_size = observations.shape[0]
|
||||
|
||||
# Encode observations
|
||||
features = self.encoder(observations)
|
||||
|
||||
# Initialize hidden state if needed
|
||||
if state is None:
|
||||
h = torch.zeros(1, batch_size, self.hidden_size, device=features.device)
|
||||
c = torch.zeros(1, batch_size, self.hidden_size, device=features.device)
|
||||
state = (h, c)
|
||||
|
||||
# LSTM forward
|
||||
features = features.unsqueeze(0) # Add sequence dimension
|
||||
lstm_out, new_state = self.lstm(features, state)
|
||||
lstm_out = lstm_out.squeeze(0)
|
||||
|
||||
# Get outputs
|
||||
logits = self.actor(lstm_out)
|
||||
value = self.critic(lstm_out)
|
||||
|
||||
return logits, value, new_state
|
||||
```
|
||||
|
||||
### LSTM Optimization
|
||||
|
||||
PufferLib's LSTM optimization uses LSTMCell during rollouts and LSTM during training for up to 3x faster inference:
|
||||
|
||||
```python
|
||||
class OptimizedLSTMPolicy(nn.Module):
|
||||
def __init__(self, observation_space, action_space, hidden_size=256):
|
||||
super().__init__()
|
||||
|
||||
self.encoder = nn.Sequential(
|
||||
layer_init(nn.Linear(observation_space.shape[0], 128)),
|
||||
nn.ReLU()
|
||||
)
|
||||
|
||||
# Use LSTMCell for step-by-step inference
|
||||
self.lstm_cell = nn.LSTMCell(128, hidden_size)
|
||||
|
||||
# Use LSTM for batch training
|
||||
self.lstm = nn.LSTM(128, hidden_size, num_layers=1)
|
||||
|
||||
self.actor = layer_init(nn.Linear(hidden_size, action_space.n), std=0.01)
|
||||
self.critic = layer_init(nn.Linear(hidden_size, 1), std=1.0)
|
||||
|
||||
self.hidden_size = hidden_size
|
||||
|
||||
def encode_observations(self, observations, state):
|
||||
"""Fast inference using LSTMCell."""
|
||||
features = self.encoder(observations)
|
||||
|
||||
if state is None:
|
||||
h = torch.zeros(observations.shape[0], self.hidden_size, device=features.device)
|
||||
c = torch.zeros(observations.shape[0], self.hidden_size, device=features.device)
|
||||
else:
|
||||
h, c = state
|
||||
|
||||
# Step-by-step with LSTMCell (faster for inference)
|
||||
h, c = self.lstm_cell(features, (h, c))
|
||||
|
||||
logits = self.actor(h)
|
||||
value = self.critic(h)
|
||||
|
||||
return logits, value, (h, c)
|
||||
|
||||
def decode_actions(self, observations, actions, state):
|
||||
"""Batch training using LSTM."""
|
||||
seq_len, batch_size = observations.shape[:2]
|
||||
|
||||
# Reshape for LSTM
|
||||
obs_flat = observations.reshape(seq_len * batch_size, -1)
|
||||
features = self.encoder(obs_flat)
|
||||
features = features.reshape(seq_len, batch_size, -1)
|
||||
|
||||
if state is None:
|
||||
h = torch.zeros(1, batch_size, self.hidden_size, device=features.device)
|
||||
c = torch.zeros(1, batch_size, self.hidden_size, device=features.device)
|
||||
state = (h, c)
|
||||
|
||||
# Batch processing with LSTM (faster for training)
|
||||
lstm_out, new_state = self.lstm(features, state)
|
||||
|
||||
# Flatten back
|
||||
lstm_out = lstm_out.reshape(seq_len * batch_size, -1)
|
||||
|
||||
logits = self.actor(lstm_out)
|
||||
value = self.critic(lstm_out)
|
||||
|
||||
return logits, value, new_state
|
||||
```
|
||||
|
||||
## Multi-Input Policies
|
||||
|
||||
For environments with multiple observation types:
|
||||
|
||||
```python
|
||||
class MultiInputPolicy(nn.Module):
|
||||
def __init__(self, observation_space, action_space):
|
||||
super().__init__()
|
||||
|
||||
# Separate encoders for different observation types
|
||||
self.image_encoder = nn.Sequential(
|
||||
layer_init(nn.Conv2d(3, 32, 8, stride=4)),
|
||||
nn.ReLU(),
|
||||
layer_init(nn.Conv2d(32, 64, 4, stride=2)),
|
||||
nn.ReLU(),
|
||||
nn.Flatten()
|
||||
)
|
||||
|
||||
self.vector_encoder = nn.Sequential(
|
||||
layer_init(nn.Linear(observation_space['vector'].shape[0], 128)),
|
||||
nn.ReLU()
|
||||
)
|
||||
|
||||
# Combine features
|
||||
combined_size = 64 * 9 * 9 + 128 # Image features + vector features
|
||||
self.combiner = nn.Sequential(
|
||||
layer_init(nn.Linear(combined_size, 512)),
|
||||
nn.ReLU()
|
||||
)
|
||||
|
||||
self.actor = layer_init(nn.Linear(512, action_space.n), std=0.01)
|
||||
self.critic = layer_init(nn.Linear(512, 1), std=1.0)
|
||||
|
||||
def forward(self, observations):
|
||||
# Process each observation type
|
||||
image_features = self.image_encoder(observations['image'].float() / 255.0)
|
||||
vector_features = self.vector_encoder(observations['vector'])
|
||||
|
||||
# Combine
|
||||
combined = torch.cat([image_features, vector_features], dim=-1)
|
||||
features = self.combiner(combined)
|
||||
|
||||
return self.actor(features), self.critic(features)
|
||||
```
|
||||
|
||||
## Continuous Action Policies
|
||||
|
||||
For continuous control tasks:
|
||||
|
||||
```python
|
||||
class ContinuousPolicy(nn.Module):
|
||||
def __init__(self, observation_space, action_space):
|
||||
super().__init__()
|
||||
|
||||
self.encoder = nn.Sequential(
|
||||
layer_init(nn.Linear(observation_space.shape[0], 256)),
|
||||
nn.ReLU(),
|
||||
layer_init(nn.Linear(256, 256)),
|
||||
nn.ReLU()
|
||||
)
|
||||
|
||||
# Mean of action distribution
|
||||
self.actor_mean = layer_init(nn.Linear(256, action_space.shape[0]), std=0.01)
|
||||
|
||||
# Log std of action distribution
|
||||
self.actor_logstd = nn.Parameter(torch.zeros(1, action_space.shape[0]))
|
||||
|
||||
# Value head
|
||||
self.critic = layer_init(nn.Linear(256, 1), std=1.0)
|
||||
|
||||
def forward(self, observations):
|
||||
features = self.encoder(observations)
|
||||
|
||||
action_mean = self.actor_mean(features)
|
||||
action_std = torch.exp(self.actor_logstd)
|
||||
|
||||
value = self.critic(features)
|
||||
|
||||
return action_mean, action_std, value
|
||||
|
||||
def get_action(self, observations, deterministic=False):
|
||||
action_mean, action_std, value = self.forward(observations)
|
||||
|
||||
if deterministic:
|
||||
return action_mean, value
|
||||
else:
|
||||
dist = torch.distributions.Normal(action_mean, action_std)
|
||||
action = dist.sample()
|
||||
return torch.tanh(action), value # Bound actions to [-1, 1]
|
||||
```
|
||||
|
||||
## Observation Processing
|
||||
|
||||
PufferLib provides utilities for unflattening observations:
|
||||
|
||||
```python
|
||||
from pufferlib.pytorch import unflatten_observations
|
||||
|
||||
class PolicyWithUnflatten(nn.Module):
|
||||
def __init__(self, observation_space, action_space):
|
||||
super().__init__()
|
||||
|
||||
self.observation_space = observation_space
|
||||
|
||||
# Define encoders for each observation component
|
||||
self.encoders = nn.ModuleDict({
|
||||
'image': self._make_image_encoder(),
|
||||
'vector': self._make_vector_encoder()
|
||||
})
|
||||
|
||||
# ... rest of policy ...
|
||||
|
||||
def forward(self, flat_observations):
|
||||
# Unflatten observations into structured format
|
||||
observations = unflatten_observations(
|
||||
flat_observations,
|
||||
self.observation_space
|
||||
)
|
||||
|
||||
# Process each component
|
||||
image_features = self.encoders['image'](observations['image'])
|
||||
vector_features = self.encoders['vector'](observations['vector'])
|
||||
|
||||
# Combine and continue...
|
||||
```
|
||||
|
||||
## Multi-Agent Policies
|
||||
|
||||
### Shared Parameters
|
||||
|
||||
All agents use the same policy:
|
||||
|
||||
```python
|
||||
class SharedMultiAgentPolicy(nn.Module):
|
||||
def __init__(self, observation_space, action_space, num_agents):
|
||||
super().__init__()
|
||||
|
||||
self.num_agents = num_agents
|
||||
|
||||
# Single policy shared across all agents
|
||||
self.encoder = nn.Sequential(
|
||||
layer_init(nn.Linear(observation_space.shape[0], 256)),
|
||||
nn.ReLU()
|
||||
)
|
||||
|
||||
self.actor = layer_init(nn.Linear(256, action_space.n), std=0.01)
|
||||
self.critic = layer_init(nn.Linear(256, 1), std=1.0)
|
||||
|
||||
def forward(self, observations):
|
||||
"""
|
||||
Args:
|
||||
observations: (batch * num_agents, obs_dim)
|
||||
Returns:
|
||||
logits: (batch * num_agents, num_actions)
|
||||
values: (batch * num_agents, 1)
|
||||
"""
|
||||
features = self.encoder(observations)
|
||||
return self.actor(features), self.critic(features)
|
||||
```
|
||||
|
||||
### Independent Parameters
|
||||
|
||||
Each agent has its own policy:
|
||||
|
||||
```python
|
||||
class IndependentMultiAgentPolicy(nn.Module):
|
||||
def __init__(self, observation_space, action_space, num_agents):
|
||||
super().__init__()
|
||||
|
||||
self.num_agents = num_agents
|
||||
|
||||
# Separate policy for each agent
|
||||
self.policies = nn.ModuleList([
|
||||
self._make_policy(observation_space, action_space)
|
||||
for _ in range(num_agents)
|
||||
])
|
||||
|
||||
def _make_policy(self, observation_space, action_space):
|
||||
return nn.Sequential(
|
||||
layer_init(nn.Linear(observation_space.shape[0], 256)),
|
||||
nn.ReLU(),
|
||||
layer_init(nn.Linear(256, 256)),
|
||||
nn.ReLU()
|
||||
)
|
||||
|
||||
def forward(self, observations, agent_ids):
|
||||
"""
|
||||
Args:
|
||||
observations: (batch, obs_dim)
|
||||
agent_ids: (batch,) which agent each obs belongs to
|
||||
"""
|
||||
outputs = []
|
||||
for agent_id in range(self.num_agents):
|
||||
mask = agent_ids == agent_id
|
||||
if mask.any():
|
||||
agent_obs = observations[mask]
|
||||
agent_out = self.policies[agent_id](agent_obs)
|
||||
outputs.append(agent_out)
|
||||
|
||||
return torch.cat(outputs, dim=0)
|
||||
```
|
||||
|
||||
## Advanced Architectures
|
||||
|
||||
### Attention-Based Policy
|
||||
|
||||
```python
|
||||
class AttentionPolicy(nn.Module):
|
||||
def __init__(self, observation_space, action_space, d_model=256, nhead=8):
|
||||
super().__init__()
|
||||
|
||||
self.encoder = layer_init(nn.Linear(observation_space.shape[0], d_model))
|
||||
|
||||
self.attention = nn.MultiheadAttention(d_model, nhead, batch_first=True)
|
||||
|
||||
self.actor = layer_init(nn.Linear(d_model, action_space.n), std=0.01)
|
||||
self.critic = layer_init(nn.Linear(d_model, 1), std=1.0)
|
||||
|
||||
def forward(self, observations):
|
||||
# Encode
|
||||
features = self.encoder(observations)
|
||||
|
||||
# Self-attention
|
||||
features = features.unsqueeze(1) # Add sequence dimension
|
||||
attn_out, _ = self.attention(features, features, features)
|
||||
attn_out = attn_out.squeeze(1)
|
||||
|
||||
return self.actor(attn_out), self.critic(attn_out)
|
||||
```
|
||||
|
||||
### Residual Policy
|
||||
|
||||
```python
|
||||
class ResidualBlock(nn.Module):
|
||||
def __init__(self, dim):
|
||||
super().__init__()
|
||||
self.block = nn.Sequential(
|
||||
layer_init(nn.Linear(dim, dim)),
|
||||
nn.ReLU(),
|
||||
layer_init(nn.Linear(dim, dim))
|
||||
)
|
||||
|
||||
def forward(self, x):
|
||||
return x + self.block(x)
|
||||
|
||||
class ResidualPolicy(nn.Module):
|
||||
def __init__(self, observation_space, action_space, num_blocks=4):
|
||||
super().__init__()
|
||||
|
||||
dim = 256
|
||||
|
||||
self.encoder = layer_init(nn.Linear(observation_space.shape[0], dim))
|
||||
|
||||
self.blocks = nn.Sequential(
|
||||
*[ResidualBlock(dim) for _ in range(num_blocks)]
|
||||
)
|
||||
|
||||
self.actor = layer_init(nn.Linear(dim, action_space.n), std=0.01)
|
||||
self.critic = layer_init(nn.Linear(dim, 1), std=1.0)
|
||||
|
||||
def forward(self, observations):
|
||||
x = torch.relu(self.encoder(observations))
|
||||
x = self.blocks(x)
|
||||
return self.actor(x), self.critic(x)
|
||||
```
|
||||
|
||||
## Policy Best Practices
|
||||
|
||||
### Initialization
|
||||
|
||||
```python
|
||||
# Always use layer_init for proper initialization
|
||||
good_layer = layer_init(nn.Linear(256, 256))
|
||||
|
||||
# Use small std for actor head (more stable early training)
|
||||
actor = layer_init(nn.Linear(256, num_actions), std=0.01)
|
||||
|
||||
# Use std=1.0 for critic head
|
||||
critic = layer_init(nn.Linear(256, 1), std=1.0)
|
||||
```
|
||||
|
||||
### Observation Normalization
|
||||
|
||||
```python
|
||||
class NormalizedPolicy(nn.Module):
|
||||
def __init__(self, observation_space, action_space):
|
||||
super().__init__()
|
||||
|
||||
# Running statistics for normalization
|
||||
self.obs_mean = nn.Parameter(torch.zeros(observation_space.shape[0]), requires_grad=False)
|
||||
self.obs_std = nn.Parameter(torch.ones(observation_space.shape[0]), requires_grad=False)
|
||||
|
||||
# ... rest of policy ...
|
||||
|
||||
def forward(self, observations):
|
||||
# Normalize observations
|
||||
normalized_obs = (observations - self.obs_mean) / (self.obs_std + 1e-8)
|
||||
|
||||
# Continue with normalized observations
|
||||
return self.policy(normalized_obs)
|
||||
|
||||
def update_normalization(self, observations):
|
||||
"""Update running statistics."""
|
||||
self.obs_mean.data = observations.mean(dim=0)
|
||||
self.obs_std.data = observations.std(dim=0)
|
||||
```
|
||||
|
||||
### Gradient Clipping
|
||||
|
||||
```python
|
||||
# PufferLib trainer handles gradient clipping automatically
|
||||
trainer = PuffeRL(
|
||||
env=env,
|
||||
policy=policy,
|
||||
max_grad_norm=0.5 # Clip gradients to this norm
|
||||
)
|
||||
```
|
||||
|
||||
### Model Compilation
|
||||
|
||||
```python
|
||||
# Enable torch.compile for faster training (PyTorch 2.0+)
|
||||
policy = MyPolicy(observation_space, action_space)
|
||||
|
||||
# Compile the model
|
||||
policy = torch.compile(policy, mode='reduce-overhead')
|
||||
|
||||
# Use with trainer
|
||||
trainer = PuffeRL(env=env, policy=policy, compile=True)
|
||||
```
|
||||
|
||||
## Debugging Policies
|
||||
|
||||
### Check Output Shapes
|
||||
|
||||
```python
|
||||
def test_policy_shapes(policy, observation_space, batch_size=32):
|
||||
"""Verify policy output shapes."""
|
||||
# Create dummy observations
|
||||
obs = torch.randn(batch_size, *observation_space.shape)
|
||||
|
||||
# Forward pass
|
||||
logits, value = policy(obs)
|
||||
|
||||
# Check shapes
|
||||
assert logits.shape == (batch_size, policy.action_space.n)
|
||||
assert value.shape == (batch_size, 1)
|
||||
|
||||
print("✓ Policy shapes correct")
|
||||
```
|
||||
|
||||
### Verify Gradients
|
||||
|
||||
```python
|
||||
def check_gradients(policy, observation_space):
|
||||
"""Check that gradients flow properly."""
|
||||
obs = torch.randn(1, *observation_space.shape, requires_grad=True)
|
||||
|
||||
logits, value = policy(obs)
|
||||
|
||||
# Backward pass
|
||||
loss = logits.sum() + value.sum()
|
||||
loss.backward()
|
||||
|
||||
# Check gradients exist
|
||||
for name, param in policy.named_parameters():
|
||||
if param.grad is None:
|
||||
print(f"⚠ No gradient for {name}")
|
||||
elif torch.isnan(param.grad).any():
|
||||
print(f"⚠ NaN gradient for {name}")
|
||||
else:
|
||||
print(f"✓ Gradient OK for {name}")
|
||||
```
|
||||
360
skills/pufferlib/references/training.md
Normal file
360
skills/pufferlib/references/training.md
Normal file
@@ -0,0 +1,360 @@
|
||||
# PufferLib Training Guide
|
||||
|
||||
## Overview
|
||||
|
||||
PuffeRL is PufferLib's high-performance training algorithm based on CleanRL's PPO with LSTMs, enhanced with proprietary research improvements. It achieves training at millions of steps per second through optimized vectorization and efficient implementation.
|
||||
|
||||
## Training Workflow
|
||||
|
||||
### Basic Training Loop
|
||||
|
||||
The PuffeRL trainer provides three core methods:
|
||||
|
||||
```python
|
||||
# Collect environment interactions
|
||||
rollout_data = trainer.evaluate()
|
||||
|
||||
# Train on collected batch
|
||||
train_metrics = trainer.train()
|
||||
|
||||
# Aggregate and log results
|
||||
trainer.mean_and_log()
|
||||
```
|
||||
|
||||
### CLI Training
|
||||
|
||||
Quick start training via command line:
|
||||
|
||||
```bash
|
||||
# Basic training
|
||||
puffer train environment_name --train.device cuda --train.learning-rate 0.001
|
||||
|
||||
# Custom configuration
|
||||
puffer train environment_name \
|
||||
--train.device cuda \
|
||||
--train.batch-size 32768 \
|
||||
--train.learning-rate 0.0003 \
|
||||
--train.num-iterations 10000
|
||||
```
|
||||
|
||||
### Python Training Script
|
||||
|
||||
```python
|
||||
import pufferlib
|
||||
from pufferlib import PuffeRL
|
||||
|
||||
# Initialize environment
|
||||
env = pufferlib.make('environment_name', num_envs=256)
|
||||
|
||||
# Create trainer
|
||||
trainer = PuffeRL(
|
||||
env=env,
|
||||
policy=my_policy,
|
||||
device='cuda',
|
||||
learning_rate=3e-4,
|
||||
batch_size=32768,
|
||||
n_epochs=4,
|
||||
gamma=0.99,
|
||||
gae_lambda=0.95,
|
||||
clip_coef=0.2,
|
||||
ent_coef=0.01,
|
||||
vf_coef=0.5,
|
||||
max_grad_norm=0.5
|
||||
)
|
||||
|
||||
# Training loop
|
||||
for iteration in range(num_iterations):
|
||||
# Collect rollouts
|
||||
rollout_data = trainer.evaluate()
|
||||
|
||||
# Train on batch
|
||||
train_metrics = trainer.train()
|
||||
|
||||
# Log results
|
||||
trainer.mean_and_log()
|
||||
```
|
||||
|
||||
## Key Training Parameters
|
||||
|
||||
### Core Hyperparameters
|
||||
|
||||
- **learning_rate**: Learning rate for optimizer (default: 3e-4)
|
||||
- **batch_size**: Number of timesteps per training batch (default: 32768)
|
||||
- **n_epochs**: Number of training epochs per batch (default: 4)
|
||||
- **num_envs**: Number of parallel environments (default: 256)
|
||||
- **num_steps**: Steps per environment per rollout (default: 128)
|
||||
|
||||
### PPO Parameters
|
||||
|
||||
- **gamma**: Discount factor (default: 0.99)
|
||||
- **gae_lambda**: Lambda for GAE calculation (default: 0.95)
|
||||
- **clip_coef**: PPO clipping coefficient (default: 0.2)
|
||||
- **ent_coef**: Entropy coefficient for exploration (default: 0.01)
|
||||
- **vf_coef**: Value function loss coefficient (default: 0.5)
|
||||
- **max_grad_norm**: Maximum gradient norm for clipping (default: 0.5)
|
||||
|
||||
### Performance Parameters
|
||||
|
||||
- **device**: Computing device ('cuda' or 'cpu')
|
||||
- **compile**: Use torch.compile for faster training (default: True)
|
||||
- **num_workers**: Number of vectorization workers (default: auto)
|
||||
|
||||
## Distributed Training
|
||||
|
||||
### Multi-GPU Training
|
||||
|
||||
Use torchrun for distributed training across multiple GPUs:
|
||||
|
||||
```bash
|
||||
torchrun --nproc_per_node=4 train.py \
|
||||
--train.device cuda \
|
||||
--train.batch-size 131072
|
||||
```
|
||||
|
||||
### Multi-Node Training
|
||||
|
||||
For distributed training across multiple nodes:
|
||||
|
||||
```bash
|
||||
# On main node (rank 0)
|
||||
torchrun --nproc_per_node=8 \
|
||||
--nnodes=4 \
|
||||
--node_rank=0 \
|
||||
--master_addr=MASTER_IP \
|
||||
--master_port=29500 \
|
||||
train.py
|
||||
|
||||
# On worker nodes (rank 1, 2, 3)
|
||||
torchrun --nproc_per_node=8 \
|
||||
--nnodes=4 \
|
||||
--node_rank=NODE_RANK \
|
||||
--master_addr=MASTER_IP \
|
||||
--master_port=29500 \
|
||||
train.py
|
||||
```
|
||||
|
||||
## Monitoring and Logging
|
||||
|
||||
### Logger Integration
|
||||
|
||||
PufferLib supports multiple logging backends:
|
||||
|
||||
#### Weights & Biases
|
||||
|
||||
```python
|
||||
from pufferlib import WandbLogger
|
||||
|
||||
logger = WandbLogger(
|
||||
project='my_project',
|
||||
entity='my_team',
|
||||
name='experiment_name',
|
||||
config=trainer_config
|
||||
)
|
||||
|
||||
trainer = PuffeRL(env, policy, logger=logger)
|
||||
```
|
||||
|
||||
#### Neptune
|
||||
|
||||
```python
|
||||
from pufferlib import NeptuneLogger
|
||||
|
||||
logger = NeptuneLogger(
|
||||
project='my_team/my_project',
|
||||
name='experiment_name',
|
||||
api_token='YOUR_TOKEN'
|
||||
)
|
||||
|
||||
trainer = PuffeRL(env, policy, logger=logger)
|
||||
```
|
||||
|
||||
#### No Logger
|
||||
|
||||
```python
|
||||
from pufferlib import NoLogger
|
||||
|
||||
trainer = PuffeRL(env, policy, logger=NoLogger())
|
||||
```
|
||||
|
||||
### Key Metrics
|
||||
|
||||
Training logs include:
|
||||
|
||||
- **Performance Metrics**:
|
||||
- Steps per second (SPS)
|
||||
- Training throughput
|
||||
- Wall-clock time per iteration
|
||||
|
||||
- **Learning Metrics**:
|
||||
- Episode rewards (mean, min, max)
|
||||
- Episode lengths
|
||||
- Value function loss
|
||||
- Policy loss
|
||||
- Entropy
|
||||
- Explained variance
|
||||
- Clipfrac
|
||||
|
||||
- **Environment Metrics**:
|
||||
- Environment-specific rewards
|
||||
- Success rates
|
||||
- Custom metrics
|
||||
|
||||
### Terminal Dashboard
|
||||
|
||||
PufferLib provides a real-time terminal dashboard showing:
|
||||
- Training progress
|
||||
- Current SPS
|
||||
- Episode statistics
|
||||
- Loss values
|
||||
- GPU utilization
|
||||
|
||||
## Checkpointing
|
||||
|
||||
### Saving Checkpoints
|
||||
|
||||
```python
|
||||
# Save checkpoint
|
||||
trainer.save_checkpoint('checkpoint.pt')
|
||||
|
||||
# Save with additional metadata
|
||||
trainer.save_checkpoint(
|
||||
'checkpoint.pt',
|
||||
metadata={'iteration': iteration, 'best_reward': best_reward}
|
||||
)
|
||||
```
|
||||
|
||||
### Loading Checkpoints
|
||||
|
||||
```python
|
||||
# Load checkpoint
|
||||
trainer.load_checkpoint('checkpoint.pt')
|
||||
|
||||
# Resume training
|
||||
for iteration in range(resume_iteration, num_iterations):
|
||||
trainer.evaluate()
|
||||
trainer.train()
|
||||
trainer.mean_and_log()
|
||||
```
|
||||
|
||||
## Hyperparameter Tuning with Protein
|
||||
|
||||
The Protein system enables automatic hyperparameter and reward tuning:
|
||||
|
||||
```python
|
||||
from pufferlib import Protein
|
||||
|
||||
# Define search space
|
||||
search_space = {
|
||||
'learning_rate': [1e-4, 3e-4, 1e-3],
|
||||
'batch_size': [16384, 32768, 65536],
|
||||
'ent_coef': [0.001, 0.01, 0.1],
|
||||
'clip_coef': [0.1, 0.2, 0.3]
|
||||
}
|
||||
|
||||
# Run hyperparameter search
|
||||
protein = Protein(
|
||||
env_name='environment_name',
|
||||
search_space=search_space,
|
||||
num_trials=100,
|
||||
metric='mean_reward'
|
||||
)
|
||||
|
||||
best_config = protein.optimize()
|
||||
```
|
||||
|
||||
## Performance Optimization Tips
|
||||
|
||||
### Maximizing Throughput
|
||||
|
||||
1. **Batch Size**: Increase batch_size to fully utilize GPU
|
||||
2. **Num Envs**: Balance between CPU and GPU utilization
|
||||
3. **Compile**: Enable torch.compile for 10-20% speedup
|
||||
4. **Workers**: Adjust num_workers based on environment complexity
|
||||
5. **Device**: Always use 'cuda' for neural network training
|
||||
|
||||
### Environment Speed
|
||||
|
||||
- Pure Python environments: ~100k-500k SPS
|
||||
- C-based environments: ~4M SPS
|
||||
- With training overhead: ~1M-4M total SPS
|
||||
|
||||
### Memory Management
|
||||
|
||||
- Reduce batch_size if running out of GPU memory
|
||||
- Decrease num_envs if running out of CPU memory
|
||||
- Use gradient accumulation for large effective batch sizes
|
||||
|
||||
## Common Training Patterns
|
||||
|
||||
### Curriculum Learning
|
||||
|
||||
```python
|
||||
# Start with easy tasks, gradually increase difficulty
|
||||
difficulty_levels = [0.1, 0.3, 0.5, 0.7, 1.0]
|
||||
|
||||
for difficulty in difficulty_levels:
|
||||
env = pufferlib.make('environment_name', difficulty=difficulty)
|
||||
trainer = PuffeRL(env, policy)
|
||||
|
||||
for iteration in range(iterations_per_level):
|
||||
trainer.evaluate()
|
||||
trainer.train()
|
||||
trainer.mean_and_log()
|
||||
```
|
||||
|
||||
### Reward Shaping
|
||||
|
||||
```python
|
||||
# Wrap environment with custom reward shaping
|
||||
class RewardShapedEnv(pufferlib.PufferEnv):
|
||||
def step(self, actions):
|
||||
obs, rewards, dones, infos = super().step(actions)
|
||||
|
||||
# Add shaped rewards
|
||||
shaped_rewards = rewards + 0.1 * proximity_bonus
|
||||
|
||||
return obs, shaped_rewards, dones, infos
|
||||
```
|
||||
|
||||
### Multi-Stage Training
|
||||
|
||||
```python
|
||||
# Train in multiple stages with different configurations
|
||||
stages = [
|
||||
{'learning_rate': 1e-3, 'iterations': 1000}, # Exploration
|
||||
{'learning_rate': 3e-4, 'iterations': 5000}, # Main training
|
||||
{'learning_rate': 1e-4, 'iterations': 2000} # Fine-tuning
|
||||
]
|
||||
|
||||
for stage in stages:
|
||||
trainer.learning_rate = stage['learning_rate']
|
||||
for iteration in range(stage['iterations']):
|
||||
trainer.evaluate()
|
||||
trainer.train()
|
||||
trainer.mean_and_log()
|
||||
```
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Low Performance
|
||||
|
||||
- Check environment is vectorized correctly
|
||||
- Verify GPU utilization with `nvidia-smi`
|
||||
- Increase batch_size to saturate GPU
|
||||
- Enable compile mode
|
||||
- Profile with `torch.profiler`
|
||||
|
||||
### Training Instability
|
||||
|
||||
- Reduce learning_rate
|
||||
- Decrease batch_size
|
||||
- Increase num_envs for more diverse samples
|
||||
- Add entropy coefficient for more exploration
|
||||
- Check reward scaling
|
||||
|
||||
### Memory Issues
|
||||
|
||||
- Reduce batch_size or num_envs
|
||||
- Use gradient accumulation
|
||||
- Disable compile mode if causing OOM
|
||||
- Check for memory leaks in custom environments
|
||||
557
skills/pufferlib/references/vectorization.md
Normal file
557
skills/pufferlib/references/vectorization.md
Normal file
@@ -0,0 +1,557 @@
|
||||
# PufferLib Vectorization Guide
|
||||
|
||||
## Overview
|
||||
|
||||
PufferLib's vectorization system enables high-performance parallel environment simulation, achieving millions of steps per second through optimized implementation inspired by EnvPool. The system supports both synchronous and asynchronous vectorization with minimal overhead.
|
||||
|
||||
## Vectorization Architecture
|
||||
|
||||
### Key Optimizations
|
||||
|
||||
1. **Shared Memory Buffer**: Single unified buffer across all environments (unlike Gymnasium's per-environment buffers)
|
||||
2. **Busy-Wait Flags**: Workers busy-wait on unlocked flags rather than using pipes/queues
|
||||
3. **Zero-Copy Batching**: Contiguous worker subsets return observations without copying
|
||||
4. **Surplus Environments**: Simulates more environments than batch size for async returns
|
||||
5. **Multiple Envs per Worker**: Optimizes performance for lightweight environments
|
||||
|
||||
### Performance Characteristics
|
||||
|
||||
- **Pure Python environments**: 100k-500k SPS
|
||||
- **C-based environments**: 100M+ SPS
|
||||
- **With training**: 400k-4M total SPS
|
||||
- **Vectorization overhead**: <5% with optimal configuration
|
||||
|
||||
## Creating Vectorized Environments
|
||||
|
||||
### Basic Vectorization
|
||||
|
||||
```python
|
||||
import pufferlib
|
||||
|
||||
# Automatic vectorization
|
||||
env = pufferlib.make('environment_name', num_envs=256)
|
||||
|
||||
# With explicit configuration
|
||||
env = pufferlib.make(
|
||||
'environment_name',
|
||||
num_envs=256,
|
||||
num_workers=8,
|
||||
envs_per_worker=32
|
||||
)
|
||||
```
|
||||
|
||||
### Manual Vectorization
|
||||
|
||||
```python
|
||||
from pufferlib import PufferEnv
|
||||
from pufferlib.vectorization import Serial, Multiprocessing
|
||||
|
||||
# Serial vectorization (single process)
|
||||
vec_env = Serial(
|
||||
env_creator=lambda: MyEnvironment(),
|
||||
num_envs=16
|
||||
)
|
||||
|
||||
# Multiprocessing vectorization
|
||||
vec_env = Multiprocessing(
|
||||
env_creator=lambda: MyEnvironment(),
|
||||
num_envs=256,
|
||||
num_workers=8
|
||||
)
|
||||
```
|
||||
|
||||
## Vectorization Modes
|
||||
|
||||
### Serial Vectorization
|
||||
|
||||
Best for debugging and lightweight environments:
|
||||
|
||||
```python
|
||||
from pufferlib.vectorization import Serial
|
||||
|
||||
vec_env = Serial(
|
||||
env_creator=env_creator_fn,
|
||||
num_envs=16
|
||||
)
|
||||
|
||||
# All environments run in main process
|
||||
# No multiprocessing overhead
|
||||
# Easier debugging with standard tools
|
||||
```
|
||||
|
||||
**When to use:**
|
||||
- Development and debugging
|
||||
- Very fast environments (< 1μs per step)
|
||||
- Small number of environments (< 32)
|
||||
- Single-threaded profiling
|
||||
|
||||
### Multiprocessing Vectorization
|
||||
|
||||
Best for most production use cases:
|
||||
|
||||
```python
|
||||
from pufferlib.vectorization import Multiprocessing
|
||||
|
||||
vec_env = Multiprocessing(
|
||||
env_creator=env_creator_fn,
|
||||
num_envs=256,
|
||||
num_workers=8,
|
||||
envs_per_worker=32
|
||||
)
|
||||
|
||||
# Parallel execution across workers
|
||||
# True parallelism for CPU-bound environments
|
||||
# Scales to hundreds of environments
|
||||
```
|
||||
|
||||
**When to use:**
|
||||
- Production training
|
||||
- CPU-intensive environments
|
||||
- Large-scale parallel simulation
|
||||
- Maximizing throughput
|
||||
|
||||
### Async Vectorization
|
||||
|
||||
For environments with variable step times:
|
||||
|
||||
```python
|
||||
vec_env = Multiprocessing(
|
||||
env_creator=env_creator_fn,
|
||||
num_envs=256,
|
||||
num_workers=8,
|
||||
mode='async',
|
||||
surplus_envs=32 # Simulate extra environments
|
||||
)
|
||||
|
||||
# Returns batches as soon as ready
|
||||
# Better GPU utilization
|
||||
# Handles variable environment speeds
|
||||
```
|
||||
|
||||
**When to use:**
|
||||
- Variable environment step times
|
||||
- Maximizing GPU utilization
|
||||
- Network-based environments
|
||||
- External simulators
|
||||
|
||||
## Optimizing Vectorization Performance
|
||||
|
||||
### Worker Configuration
|
||||
|
||||
```python
|
||||
import multiprocessing
|
||||
|
||||
# Calculate optimal workers
|
||||
num_cpus = multiprocessing.cpu_count()
|
||||
|
||||
# Conservative (leave headroom for training)
|
||||
num_workers = num_cpus - 2
|
||||
|
||||
# Aggressive (maximize environment throughput)
|
||||
num_workers = num_cpus
|
||||
|
||||
# With hyperthreading
|
||||
num_workers = num_cpus // 2 # Physical cores only
|
||||
```
|
||||
|
||||
### Envs Per Worker
|
||||
|
||||
```python
|
||||
# Fast environments (< 10μs per step)
|
||||
envs_per_worker = 64 # More envs per worker
|
||||
|
||||
# Medium environments (10-100μs per step)
|
||||
envs_per_worker = 32 # Balanced
|
||||
|
||||
# Slow environments (> 100μs per step)
|
||||
envs_per_worker = 16 # Fewer envs per worker
|
||||
|
||||
# Calculate from target batch size
|
||||
batch_size = 32768
|
||||
num_workers = 8
|
||||
envs_per_worker = batch_size // num_workers
|
||||
```
|
||||
|
||||
### Batch Size Tuning
|
||||
|
||||
```python
|
||||
# Small batch (< 8k): Good for fast iteration
|
||||
batch_size = 4096
|
||||
num_envs = 256
|
||||
steps_per_env = batch_size // num_envs # 16 steps
|
||||
|
||||
# Medium batch (8k-32k): Good balance
|
||||
batch_size = 16384
|
||||
num_envs = 512
|
||||
steps_per_env = 32
|
||||
|
||||
# Large batch (> 32k): Maximum throughput
|
||||
batch_size = 65536
|
||||
num_envs = 1024
|
||||
steps_per_env = 64
|
||||
```
|
||||
|
||||
## Shared Memory Optimization
|
||||
|
||||
### Buffer Management
|
||||
|
||||
PufferLib uses shared memory for zero-copy observation passing:
|
||||
|
||||
```python
|
||||
import numpy as np
|
||||
from multiprocessing import shared_memory
|
||||
|
||||
class OptimizedEnv(PufferEnv):
|
||||
def __init__(self, buf=None):
|
||||
super().__init__(buf)
|
||||
|
||||
# Environment will use provided shared buffer
|
||||
self.observation_space = self.make_space({'obs': (84, 84, 3)})
|
||||
|
||||
# Observations written directly to shared memory
|
||||
self._obs_buffer = None
|
||||
|
||||
def reset(self):
|
||||
# Write to shared memory in-place
|
||||
if self._obs_buffer is None:
|
||||
self._obs_buffer = np.zeros((84, 84, 3), dtype=np.uint8)
|
||||
|
||||
self._render_to_buffer(self._obs_buffer)
|
||||
return {'obs': self._obs_buffer}
|
||||
|
||||
def step(self, action):
|
||||
# In-place updates only
|
||||
self._update_state(action)
|
||||
self._render_to_buffer(self._obs_buffer)
|
||||
|
||||
return {'obs': self._obs_buffer}, reward, done, info
|
||||
```
|
||||
|
||||
### Zero-Copy Patterns
|
||||
|
||||
```python
|
||||
# BAD: Creates copies
|
||||
def get_observation(self):
|
||||
obs = np.zeros((84, 84, 3))
|
||||
# ... fill obs ...
|
||||
return obs.copy() # Unnecessary copy!
|
||||
|
||||
# GOOD: Reuses buffer
|
||||
def get_observation(self):
|
||||
# Use pre-allocated buffer
|
||||
self._render_to_buffer(self._obs_buffer)
|
||||
return self._obs_buffer # No copy
|
||||
|
||||
# BAD: Allocates new arrays
|
||||
def step(self, action):
|
||||
new_state = self.state + action # Allocates
|
||||
self.state = new_state
|
||||
return obs, reward, done, info
|
||||
|
||||
# GOOD: In-place operations
|
||||
def step(self, action):
|
||||
self.state += action # In-place
|
||||
return obs, reward, done, info
|
||||
```
|
||||
|
||||
## Advanced Vectorization
|
||||
|
||||
### Custom Vectorization
|
||||
|
||||
```python
|
||||
from pufferlib.vectorization import VectorEnv
|
||||
|
||||
class CustomVectorEnv(VectorEnv):
|
||||
"""Custom vectorization implementation."""
|
||||
|
||||
def __init__(self, env_creator, num_envs, **kwargs):
|
||||
super().__init__()
|
||||
|
||||
self.envs = [env_creator() for _ in range(num_envs)]
|
||||
self.num_envs = num_envs
|
||||
|
||||
def reset(self):
|
||||
"""Reset all environments."""
|
||||
observations = [env.reset() for env in self.envs]
|
||||
return self._stack_obs(observations)
|
||||
|
||||
def step(self, actions):
|
||||
"""Step all environments."""
|
||||
results = [env.step(action) for env, action in zip(self.envs, actions)]
|
||||
|
||||
obs, rewards, dones, infos = zip(*results)
|
||||
|
||||
return (
|
||||
self._stack_obs(obs),
|
||||
np.array(rewards),
|
||||
np.array(dones),
|
||||
list(infos)
|
||||
)
|
||||
|
||||
def _stack_obs(self, observations):
|
||||
"""Stack observations into batch."""
|
||||
return np.stack(observations, axis=0)
|
||||
```
|
||||
|
||||
### Hierarchical Vectorization
|
||||
|
||||
For very large-scale parallelism:
|
||||
|
||||
```python
|
||||
# Outer: Multiprocessing vectorization (8 workers)
|
||||
# Inner: Each worker runs serial vectorization (32 envs)
|
||||
# Total: 256 parallel environments
|
||||
|
||||
def create_serial_vec_env():
|
||||
return Serial(
|
||||
env_creator=lambda: MyEnvironment(),
|
||||
num_envs=32
|
||||
)
|
||||
|
||||
outer_vec_env = Multiprocessing(
|
||||
env_creator=create_serial_vec_env,
|
||||
num_envs=8, # 8 serial vec envs
|
||||
num_workers=8
|
||||
)
|
||||
|
||||
# Total environments: 8 * 32 = 256
|
||||
```
|
||||
|
||||
## Multi-Agent Vectorization
|
||||
|
||||
### Native Multi-Agent Support
|
||||
|
||||
PufferLib treats multi-agent environments as first-class citizens:
|
||||
|
||||
```python
|
||||
# Multi-agent environment automatically vectorized
|
||||
env = pufferlib.make(
|
||||
'pettingzoo-knights-archers-zombies',
|
||||
num_envs=128,
|
||||
num_agents=4
|
||||
)
|
||||
|
||||
# Observations: {agent_id: [batch_obs]} for each agent
|
||||
# Actions: {agent_id: [batch_actions]} for each agent
|
||||
# Rewards: {agent_id: [batch_rewards]} for each agent
|
||||
```
|
||||
|
||||
### Custom Multi-Agent Vectorization
|
||||
|
||||
```python
|
||||
class MultiAgentVectorEnv(VectorEnv):
|
||||
def step(self, actions):
|
||||
"""
|
||||
Args:
|
||||
actions: Dict of {agent_id: [batch_actions]}
|
||||
|
||||
Returns:
|
||||
observations: Dict of {agent_id: [batch_obs]}
|
||||
rewards: Dict of {agent_id: [batch_rewards]}
|
||||
dones: Dict of {agent_id: [batch_dones]}
|
||||
infos: List of dicts
|
||||
"""
|
||||
# Distribute actions to environments
|
||||
env_actions = self._distribute_actions(actions)
|
||||
|
||||
# Step each environment
|
||||
results = [env.step(act) for env, act in zip(self.envs, env_actions)]
|
||||
|
||||
# Collect and batch results
|
||||
return self._batch_results(results)
|
||||
```
|
||||
|
||||
## Performance Monitoring
|
||||
|
||||
### Profiling Vectorization
|
||||
|
||||
```python
|
||||
import time
|
||||
|
||||
def profile_vectorization(vec_env, num_steps=10000):
|
||||
"""Profile vectorization performance."""
|
||||
start = time.time()
|
||||
|
||||
vec_env.reset()
|
||||
|
||||
for _ in range(num_steps):
|
||||
actions = vec_env.action_space.sample()
|
||||
vec_env.step(actions)
|
||||
|
||||
elapsed = time.time() - start
|
||||
sps = (num_steps * vec_env.num_envs) / elapsed
|
||||
|
||||
print(f"Steps per second: {sps:,.0f}")
|
||||
print(f"Time per step: {elapsed/num_steps*1000:.2f}ms")
|
||||
|
||||
return sps
|
||||
```
|
||||
|
||||
### Bottleneck Analysis
|
||||
|
||||
```python
|
||||
import cProfile
|
||||
import pstats
|
||||
|
||||
def analyze_bottlenecks(vec_env):
|
||||
"""Identify vectorization bottlenecks."""
|
||||
profiler = cProfile.Profile()
|
||||
|
||||
profiler.enable()
|
||||
|
||||
vec_env.reset()
|
||||
for _ in range(1000):
|
||||
actions = vec_env.action_space.sample()
|
||||
vec_env.step(actions)
|
||||
|
||||
profiler.disable()
|
||||
|
||||
stats = pstats.Stats(profiler)
|
||||
stats.sort_stats('cumulative')
|
||||
stats.print_stats(20)
|
||||
```
|
||||
|
||||
### Real-Time Monitoring
|
||||
|
||||
```python
|
||||
class MonitoredVectorEnv(VectorEnv):
|
||||
"""Vector environment with performance monitoring."""
|
||||
|
||||
def __init__(self, *args, **kwargs):
|
||||
super().__init__(*args, **kwargs)
|
||||
|
||||
self.step_times = []
|
||||
self.step_count = 0
|
||||
|
||||
def step(self, actions):
|
||||
start = time.perf_counter()
|
||||
|
||||
result = super().step(actions)
|
||||
|
||||
elapsed = time.perf_counter() - start
|
||||
self.step_times.append(elapsed)
|
||||
self.step_count += 1
|
||||
|
||||
# Log every 1000 steps
|
||||
if self.step_count % 1000 == 0:
|
||||
mean_time = np.mean(self.step_times[-1000:])
|
||||
sps = self.num_envs / mean_time
|
||||
print(f"SPS: {sps:,.0f} | Step time: {mean_time*1000:.2f}ms")
|
||||
|
||||
return result
|
||||
```
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Low Throughput
|
||||
|
||||
```python
|
||||
# Check configuration
|
||||
print(f"Num envs: {vec_env.num_envs}")
|
||||
print(f"Num workers: {vec_env.num_workers}")
|
||||
print(f"Envs per worker: {vec_env.num_envs // vec_env.num_workers}")
|
||||
|
||||
# Profile single environment
|
||||
single_env = MyEnvironment()
|
||||
single_sps = profile_single_env(single_env)
|
||||
print(f"Single env SPS: {single_sps:,.0f}")
|
||||
|
||||
# Compare vectorized
|
||||
vec_sps = profile_vectorization(vec_env)
|
||||
print(f"Vectorized SPS: {vec_sps:,.0f}")
|
||||
print(f"Speedup: {vec_sps / single_sps:.1f}x")
|
||||
```
|
||||
|
||||
### Memory Issues
|
||||
|
||||
```python
|
||||
# Reduce number of environments
|
||||
num_envs = 128 # Instead of 256
|
||||
|
||||
# Reduce envs per worker
|
||||
envs_per_worker = 16 # Instead of 32
|
||||
|
||||
# Use Serial mode for debugging
|
||||
vec_env = Serial(env_creator, num_envs=16)
|
||||
```
|
||||
|
||||
### Synchronization Problems
|
||||
|
||||
```python
|
||||
# Ensure thread-safe operations
|
||||
import threading
|
||||
|
||||
class ThreadSafeEnv(PufferEnv):
|
||||
def __init__(self, buf=None):
|
||||
super().__init__(buf)
|
||||
self.lock = threading.Lock()
|
||||
|
||||
def step(self, action):
|
||||
with self.lock:
|
||||
return super().step(action)
|
||||
```
|
||||
|
||||
## Best Practices
|
||||
|
||||
### Configuration Guidelines
|
||||
|
||||
```python
|
||||
# Start conservative
|
||||
config = {
|
||||
'num_envs': 64,
|
||||
'num_workers': 4,
|
||||
'envs_per_worker': 16
|
||||
}
|
||||
|
||||
# Scale up iteratively
|
||||
config = {
|
||||
'num_envs': 256, # 4x increase
|
||||
'num_workers': 8, # 2x increase
|
||||
'envs_per_worker': 32 # 2x increase
|
||||
}
|
||||
|
||||
# Monitor and adjust
|
||||
if sps < target_sps:
|
||||
# Try increasing num_envs or num_workers
|
||||
pass
|
||||
if memory_usage > threshold:
|
||||
# Reduce num_envs or envs_per_worker
|
||||
pass
|
||||
```
|
||||
|
||||
### Environment Design
|
||||
|
||||
```python
|
||||
# Minimize per-step allocations
|
||||
class EfficientEnv(PufferEnv):
|
||||
def __init__(self, buf=None):
|
||||
super().__init__(buf)
|
||||
|
||||
# Pre-allocate all buffers
|
||||
self._obs = np.zeros((84, 84, 3), dtype=np.uint8)
|
||||
self._state = np.zeros(10, dtype=np.float32)
|
||||
|
||||
def step(self, action):
|
||||
# Use pre-allocated buffers
|
||||
self._update_state_inplace(action)
|
||||
self._render_to_obs()
|
||||
|
||||
return self._obs, reward, done, info
|
||||
```
|
||||
|
||||
### Testing
|
||||
|
||||
```python
|
||||
# Test vectorization matches serial
|
||||
serial_env = Serial(env_creator, num_envs=4)
|
||||
vec_env = Multiprocessing(env_creator, num_envs=4, num_workers=2)
|
||||
|
||||
# Run parallel and verify results match
|
||||
serial_env.seed(42)
|
||||
vec_env.seed(42)
|
||||
|
||||
serial_obs = serial_env.reset()
|
||||
vec_obs = vec_env.reset()
|
||||
|
||||
assert np.allclose(serial_obs, vec_obs), "Vectorization mismatch!"
|
||||
```
|
||||
340
skills/pufferlib/scripts/env_template.py
Normal file
340
skills/pufferlib/scripts/env_template.py
Normal file
@@ -0,0 +1,340 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
PufferLib Environment Template
|
||||
|
||||
This template provides a starting point for creating custom PufferEnv environments.
|
||||
Customize the observation space, action space, and environment logic for your task.
|
||||
"""
|
||||
|
||||
import numpy as np
|
||||
import pufferlib
|
||||
from pufferlib import PufferEnv
|
||||
|
||||
|
||||
class MyEnvironment(PufferEnv):
|
||||
"""
|
||||
Custom PufferLib environment template.
|
||||
|
||||
This is a simple grid world example. Customize it for your specific task.
|
||||
"""
|
||||
|
||||
def __init__(self, buf=None, grid_size=10, max_steps=1000):
|
||||
"""
|
||||
Initialize environment.
|
||||
|
||||
Args:
|
||||
buf: Shared memory buffer (managed by PufferLib)
|
||||
grid_size: Size of the grid world
|
||||
max_steps: Maximum steps per episode
|
||||
"""
|
||||
super().__init__(buf)
|
||||
|
||||
self.grid_size = grid_size
|
||||
self.max_steps = max_steps
|
||||
|
||||
# Define observation space
|
||||
# Option 1: Flat vector observation
|
||||
self.observation_space = self.make_space((4,)) # [x, y, goal_x, goal_y]
|
||||
|
||||
# Option 2: Dict observation with multiple components
|
||||
# self.observation_space = self.make_space({
|
||||
# 'position': (2,),
|
||||
# 'goal': (2,),
|
||||
# 'grid': (grid_size, grid_size)
|
||||
# })
|
||||
|
||||
# Option 3: Image observation
|
||||
# self.observation_space = self.make_space((grid_size, grid_size, 3))
|
||||
|
||||
# Define action space
|
||||
# Option 1: Discrete actions
|
||||
self.action_space = self.make_discrete(4) # 0: up, 1: right, 2: down, 3: left
|
||||
|
||||
# Option 2: Continuous actions
|
||||
# self.action_space = self.make_space((2,)) # [dx, dy]
|
||||
|
||||
# Option 3: Multi-discrete actions
|
||||
# self.action_space = self.make_multi_discrete([3, 3]) # Two 3-way choices
|
||||
|
||||
# Initialize state
|
||||
self.agent_pos = None
|
||||
self.goal_pos = None
|
||||
self.step_count = 0
|
||||
|
||||
self.reset()
|
||||
|
||||
def reset(self):
|
||||
"""
|
||||
Reset environment to initial state.
|
||||
|
||||
Returns:
|
||||
observation: Initial observation
|
||||
"""
|
||||
# Reset state
|
||||
self.agent_pos = np.array([0, 0], dtype=np.float32)
|
||||
self.goal_pos = np.array([self.grid_size - 1, self.grid_size - 1], dtype=np.float32)
|
||||
self.step_count = 0
|
||||
|
||||
# Return initial observation
|
||||
return self._get_observation()
|
||||
|
||||
def step(self, action):
|
||||
"""
|
||||
Execute one environment step.
|
||||
|
||||
Args:
|
||||
action: Action to take
|
||||
|
||||
Returns:
|
||||
observation: New observation
|
||||
reward: Reward for this step
|
||||
done: Whether episode is complete
|
||||
info: Additional information
|
||||
"""
|
||||
self.step_count += 1
|
||||
|
||||
# Execute action
|
||||
self._apply_action(action)
|
||||
|
||||
# Compute reward
|
||||
reward = self._compute_reward()
|
||||
|
||||
# Check if episode is done
|
||||
done = self._is_done()
|
||||
|
||||
# Get new observation
|
||||
observation = self._get_observation()
|
||||
|
||||
# Additional info
|
||||
info = {}
|
||||
if done:
|
||||
# Include episode statistics when episode ends
|
||||
info['episode'] = {
|
||||
'r': reward,
|
||||
'l': self.step_count
|
||||
}
|
||||
|
||||
return observation, reward, done, info
|
||||
|
||||
def _apply_action(self, action):
|
||||
"""Apply action to update environment state."""
|
||||
# Discrete actions: 0=up, 1=right, 2=down, 3=left
|
||||
if action == 0: # Up
|
||||
self.agent_pos[1] = min(self.agent_pos[1] + 1, self.grid_size - 1)
|
||||
elif action == 1: # Right
|
||||
self.agent_pos[0] = min(self.agent_pos[0] + 1, self.grid_size - 1)
|
||||
elif action == 2: # Down
|
||||
self.agent_pos[1] = max(self.agent_pos[1] - 1, 0)
|
||||
elif action == 3: # Left
|
||||
self.agent_pos[0] = max(self.agent_pos[0] - 1, 0)
|
||||
|
||||
def _compute_reward(self):
|
||||
"""Compute reward for current state."""
|
||||
# Distance to goal
|
||||
distance = np.linalg.norm(self.agent_pos - self.goal_pos)
|
||||
|
||||
# Reward shaping: negative distance + bonus for reaching goal
|
||||
reward = -distance / self.grid_size
|
||||
|
||||
# Goal reached
|
||||
if distance < 0.5:
|
||||
reward += 10.0
|
||||
|
||||
return reward
|
||||
|
||||
def _is_done(self):
|
||||
"""Check if episode is complete."""
|
||||
# Episode ends if goal reached or max steps exceeded
|
||||
distance = np.linalg.norm(self.agent_pos - self.goal_pos)
|
||||
goal_reached = distance < 0.5
|
||||
timeout = self.step_count >= self.max_steps
|
||||
|
||||
return goal_reached or timeout
|
||||
|
||||
def _get_observation(self):
|
||||
"""Generate observation from current state."""
|
||||
# Return flat vector observation
|
||||
observation = np.concatenate([
|
||||
self.agent_pos,
|
||||
self.goal_pos
|
||||
]).astype(np.float32)
|
||||
|
||||
return observation
|
||||
|
||||
|
||||
class MultiAgentEnvironment(PufferEnv):
|
||||
"""
|
||||
Multi-agent environment template.
|
||||
|
||||
Example: Cooperative navigation task where agents must reach individual goals.
|
||||
"""
|
||||
|
||||
def __init__(self, buf=None, num_agents=4, grid_size=10, max_steps=1000):
|
||||
super().__init__(buf)
|
||||
|
||||
self.num_agents = num_agents
|
||||
self.grid_size = grid_size
|
||||
self.max_steps = max_steps
|
||||
|
||||
# Per-agent observation space
|
||||
self.single_observation_space = self.make_space({
|
||||
'position': (2,),
|
||||
'goal': (2,),
|
||||
'others': (2 * (num_agents - 1),) # Positions of other agents
|
||||
})
|
||||
|
||||
# Per-agent action space
|
||||
self.single_action_space = self.make_discrete(5) # 4 directions + stay
|
||||
|
||||
# Initialize state
|
||||
self.agent_positions = None
|
||||
self.goal_positions = None
|
||||
self.step_count = 0
|
||||
|
||||
self.reset()
|
||||
|
||||
def reset(self):
|
||||
"""Reset all agents."""
|
||||
# Random initial positions
|
||||
self.agent_positions = np.random.rand(self.num_agents, 2) * self.grid_size
|
||||
|
||||
# Random goal positions
|
||||
self.goal_positions = np.random.rand(self.num_agents, 2) * self.grid_size
|
||||
|
||||
self.step_count = 0
|
||||
|
||||
# Return observations for all agents
|
||||
return {
|
||||
f'agent_{i}': self._get_obs(i)
|
||||
for i in range(self.num_agents)
|
||||
}
|
||||
|
||||
def step(self, actions):
|
||||
"""
|
||||
Step all agents.
|
||||
|
||||
Args:
|
||||
actions: Dict of {agent_id: action}
|
||||
|
||||
Returns:
|
||||
observations: Dict of {agent_id: observation}
|
||||
rewards: Dict of {agent_id: reward}
|
||||
dones: Dict of {agent_id: done}
|
||||
infos: Dict of {agent_id: info}
|
||||
"""
|
||||
self.step_count += 1
|
||||
|
||||
observations = {}
|
||||
rewards = {}
|
||||
dones = {}
|
||||
infos = {}
|
||||
|
||||
# Update all agents
|
||||
for agent_id, action in actions.items():
|
||||
agent_idx = int(agent_id.split('_')[1])
|
||||
|
||||
# Apply action
|
||||
self._apply_action(agent_idx, action)
|
||||
|
||||
# Generate outputs
|
||||
observations[agent_id] = self._get_obs(agent_idx)
|
||||
rewards[agent_id] = self._compute_reward(agent_idx)
|
||||
dones[agent_id] = self._is_done(agent_idx)
|
||||
infos[agent_id] = {}
|
||||
|
||||
# Global done condition
|
||||
dones['__all__'] = all(dones.values()) or self.step_count >= self.max_steps
|
||||
|
||||
return observations, rewards, dones, infos
|
||||
|
||||
def _apply_action(self, agent_idx, action):
|
||||
"""Apply action for specific agent."""
|
||||
if action == 0: # Up
|
||||
self.agent_positions[agent_idx, 1] += 1
|
||||
elif action == 1: # Right
|
||||
self.agent_positions[agent_idx, 0] += 1
|
||||
elif action == 2: # Down
|
||||
self.agent_positions[agent_idx, 1] -= 1
|
||||
elif action == 3: # Left
|
||||
self.agent_positions[agent_idx, 0] -= 1
|
||||
# action == 4: Stay
|
||||
|
||||
# Clip to grid bounds
|
||||
self.agent_positions[agent_idx] = np.clip(
|
||||
self.agent_positions[agent_idx],
|
||||
0,
|
||||
self.grid_size - 1
|
||||
)
|
||||
|
||||
def _compute_reward(self, agent_idx):
|
||||
"""Compute reward for specific agent."""
|
||||
distance = np.linalg.norm(
|
||||
self.agent_positions[agent_idx] - self.goal_positions[agent_idx]
|
||||
)
|
||||
return -distance / self.grid_size
|
||||
|
||||
def _is_done(self, agent_idx):
|
||||
"""Check if specific agent is done."""
|
||||
distance = np.linalg.norm(
|
||||
self.agent_positions[agent_idx] - self.goal_positions[agent_idx]
|
||||
)
|
||||
return distance < 0.5
|
||||
|
||||
def _get_obs(self, agent_idx):
|
||||
"""Get observation for specific agent."""
|
||||
# Get positions of other agents
|
||||
other_positions = np.concatenate([
|
||||
self.agent_positions[i]
|
||||
for i in range(self.num_agents)
|
||||
if i != agent_idx
|
||||
])
|
||||
|
||||
return {
|
||||
'position': self.agent_positions[agent_idx].astype(np.float32),
|
||||
'goal': self.goal_positions[agent_idx].astype(np.float32),
|
||||
'others': other_positions.astype(np.float32)
|
||||
}
|
||||
|
||||
|
||||
def test_environment():
|
||||
"""Test environment to verify it works correctly."""
|
||||
print("Testing single-agent environment...")
|
||||
env = MyEnvironment()
|
||||
|
||||
obs = env.reset()
|
||||
print(f"Initial observation shape: {obs.shape}")
|
||||
|
||||
for step in range(10):
|
||||
action = env.action_space.sample()
|
||||
obs, reward, done, info = env.step(action)
|
||||
|
||||
print(f"Step {step}: reward={reward:.3f}, done={done}")
|
||||
|
||||
if done:
|
||||
obs = env.reset()
|
||||
print("Episode finished, resetting...")
|
||||
|
||||
print("\nTesting multi-agent environment...")
|
||||
multi_env = MultiAgentEnvironment(num_agents=4)
|
||||
|
||||
obs = multi_env.reset()
|
||||
print(f"Number of agents: {len(obs)}")
|
||||
|
||||
for step in range(10):
|
||||
actions = {
|
||||
agent_id: multi_env.single_action_space.sample()
|
||||
for agent_id in obs.keys()
|
||||
}
|
||||
obs, rewards, dones, infos = multi_env.step(actions)
|
||||
|
||||
print(f"Step {step}: mean_reward={np.mean(list(rewards.values())):.3f}")
|
||||
|
||||
if dones.get('__all__', False):
|
||||
obs = multi_env.reset()
|
||||
print("Episode finished, resetting...")
|
||||
|
||||
print("\n✓ Environment tests passed!")
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
test_environment()
|
||||
239
skills/pufferlib/scripts/train_template.py
Normal file
239
skills/pufferlib/scripts/train_template.py
Normal file
@@ -0,0 +1,239 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
PufferLib Training Template
|
||||
|
||||
This template provides a complete training script for reinforcement learning
|
||||
with PufferLib. Customize the environment, policy, and training configuration
|
||||
as needed for your use case.
|
||||
"""
|
||||
|
||||
import argparse
|
||||
import torch
|
||||
import torch.nn as nn
|
||||
import pufferlib
|
||||
from pufferlib import PuffeRL
|
||||
from pufferlib.pytorch import layer_init
|
||||
|
||||
|
||||
class Policy(nn.Module):
|
||||
"""Example policy network."""
|
||||
|
||||
def __init__(self, observation_space, action_space, hidden_size=256):
|
||||
super().__init__()
|
||||
|
||||
self.observation_space = observation_space
|
||||
self.action_space = action_space
|
||||
|
||||
# Encoder network
|
||||
self.encoder = nn.Sequential(
|
||||
layer_init(nn.Linear(observation_space.shape[0], hidden_size)),
|
||||
nn.ReLU(),
|
||||
layer_init(nn.Linear(hidden_size, hidden_size)),
|
||||
nn.ReLU()
|
||||
)
|
||||
|
||||
# Policy head (actor)
|
||||
self.actor = layer_init(nn.Linear(hidden_size, action_space.n), std=0.01)
|
||||
|
||||
# Value head (critic)
|
||||
self.critic = layer_init(nn.Linear(hidden_size, 1), std=1.0)
|
||||
|
||||
def forward(self, observations):
|
||||
"""Forward pass through policy."""
|
||||
features = self.encoder(observations)
|
||||
logits = self.actor(features)
|
||||
value = self.critic(features)
|
||||
return logits, value
|
||||
|
||||
|
||||
def make_env():
|
||||
"""Create environment. Customize this for your task."""
|
||||
# Option 1: Use Ocean environment
|
||||
return pufferlib.make('procgen-coinrun', num_envs=256)
|
||||
|
||||
# Option 2: Use Gymnasium environment
|
||||
# return pufferlib.make('gym-CartPole-v1', num_envs=256)
|
||||
|
||||
# Option 3: Use custom environment
|
||||
# from my_envs import MyEnvironment
|
||||
# return pufferlib.emulate(MyEnvironment, num_envs=256)
|
||||
|
||||
|
||||
def create_policy(env):
|
||||
"""Create policy network."""
|
||||
return Policy(
|
||||
observation_space=env.observation_space,
|
||||
action_space=env.action_space,
|
||||
hidden_size=256
|
||||
)
|
||||
|
||||
|
||||
def train(args):
|
||||
"""Main training function."""
|
||||
# Set random seeds
|
||||
torch.manual_seed(args.seed)
|
||||
|
||||
# Create environment
|
||||
print(f"Creating environment with {args.num_envs} parallel environments...")
|
||||
env = pufferlib.make(
|
||||
args.env_name,
|
||||
num_envs=args.num_envs,
|
||||
num_workers=args.num_workers
|
||||
)
|
||||
|
||||
# Create policy
|
||||
print("Initializing policy...")
|
||||
policy = create_policy(env)
|
||||
|
||||
if args.device == 'cuda' and torch.cuda.is_available():
|
||||
policy = policy.cuda()
|
||||
print(f"Using GPU: {torch.cuda.get_device_name(0)}")
|
||||
else:
|
||||
print("Using CPU")
|
||||
|
||||
# Create logger
|
||||
if args.logger == 'wandb':
|
||||
from pufferlib import WandbLogger
|
||||
logger = WandbLogger(
|
||||
project=args.project,
|
||||
name=args.exp_name,
|
||||
config=vars(args)
|
||||
)
|
||||
elif args.logger == 'neptune':
|
||||
from pufferlib import NeptuneLogger
|
||||
logger = NeptuneLogger(
|
||||
project=args.project,
|
||||
name=args.exp_name,
|
||||
api_token=args.neptune_token
|
||||
)
|
||||
else:
|
||||
from pufferlib import NoLogger
|
||||
logger = NoLogger()
|
||||
|
||||
# Create trainer
|
||||
print("Creating trainer...")
|
||||
trainer = PuffeRL(
|
||||
env=env,
|
||||
policy=policy,
|
||||
device=args.device,
|
||||
learning_rate=args.learning_rate,
|
||||
batch_size=args.batch_size,
|
||||
n_epochs=args.n_epochs,
|
||||
gamma=args.gamma,
|
||||
gae_lambda=args.gae_lambda,
|
||||
clip_coef=args.clip_coef,
|
||||
ent_coef=args.ent_coef,
|
||||
vf_coef=args.vf_coef,
|
||||
max_grad_norm=args.max_grad_norm,
|
||||
logger=logger,
|
||||
compile=args.compile
|
||||
)
|
||||
|
||||
# Training loop
|
||||
print(f"Starting training for {args.num_iterations} iterations...")
|
||||
for iteration in range(1, args.num_iterations + 1):
|
||||
# Collect rollouts
|
||||
rollout_data = trainer.evaluate()
|
||||
|
||||
# Train on batch
|
||||
train_metrics = trainer.train()
|
||||
|
||||
# Log results
|
||||
trainer.mean_and_log()
|
||||
|
||||
# Save checkpoint
|
||||
if iteration % args.save_freq == 0:
|
||||
checkpoint_path = f"{args.checkpoint_dir}/checkpoint_{iteration}.pt"
|
||||
trainer.save_checkpoint(checkpoint_path)
|
||||
print(f"Saved checkpoint to {checkpoint_path}")
|
||||
|
||||
# Print progress
|
||||
if iteration % args.log_freq == 0:
|
||||
mean_reward = rollout_data.get('mean_reward', 0)
|
||||
sps = rollout_data.get('sps', 0)
|
||||
print(f"Iteration {iteration}/{args.num_iterations} | "
|
||||
f"Mean Reward: {mean_reward:.2f} | "
|
||||
f"SPS: {sps:,.0f}")
|
||||
|
||||
print("Training complete!")
|
||||
|
||||
# Save final model
|
||||
final_path = f"{args.checkpoint_dir}/final_model.pt"
|
||||
trainer.save_checkpoint(final_path)
|
||||
print(f"Saved final model to {final_path}")
|
||||
|
||||
|
||||
def main():
|
||||
parser = argparse.ArgumentParser(description='PufferLib Training')
|
||||
|
||||
# Environment
|
||||
parser.add_argument('--env-name', type=str, default='procgen-coinrun',
|
||||
help='Environment name')
|
||||
parser.add_argument('--num-envs', type=int, default=256,
|
||||
help='Number of parallel environments')
|
||||
parser.add_argument('--num-workers', type=int, default=8,
|
||||
help='Number of vectorization workers')
|
||||
|
||||
# Training
|
||||
parser.add_argument('--num-iterations', type=int, default=10000,
|
||||
help='Number of training iterations')
|
||||
parser.add_argument('--learning-rate', type=float, default=3e-4,
|
||||
help='Learning rate')
|
||||
parser.add_argument('--batch-size', type=int, default=32768,
|
||||
help='Batch size for training')
|
||||
parser.add_argument('--n-epochs', type=int, default=4,
|
||||
help='Number of training epochs per batch')
|
||||
parser.add_argument('--device', type=str, default='cuda',
|
||||
choices=['cuda', 'cpu'], help='Device to use')
|
||||
|
||||
# PPO Parameters
|
||||
parser.add_argument('--gamma', type=float, default=0.99,
|
||||
help='Discount factor')
|
||||
parser.add_argument('--gae-lambda', type=float, default=0.95,
|
||||
help='GAE lambda')
|
||||
parser.add_argument('--clip-coef', type=float, default=0.2,
|
||||
help='PPO clipping coefficient')
|
||||
parser.add_argument('--ent-coef', type=float, default=0.01,
|
||||
help='Entropy coefficient')
|
||||
parser.add_argument('--vf-coef', type=float, default=0.5,
|
||||
help='Value function coefficient')
|
||||
parser.add_argument('--max-grad-norm', type=float, default=0.5,
|
||||
help='Maximum gradient norm')
|
||||
|
||||
# Logging
|
||||
parser.add_argument('--logger', type=str, default='none',
|
||||
choices=['wandb', 'neptune', 'none'],
|
||||
help='Logger to use')
|
||||
parser.add_argument('--project', type=str, default='pufferlib-training',
|
||||
help='Project name for logging')
|
||||
parser.add_argument('--exp-name', type=str, default='experiment',
|
||||
help='Experiment name')
|
||||
parser.add_argument('--neptune-token', type=str, default=None,
|
||||
help='Neptune API token')
|
||||
parser.add_argument('--log-freq', type=int, default=10,
|
||||
help='Logging frequency (iterations)')
|
||||
|
||||
# Checkpointing
|
||||
parser.add_argument('--checkpoint-dir', type=str, default='checkpoints',
|
||||
help='Directory to save checkpoints')
|
||||
parser.add_argument('--save-freq', type=int, default=100,
|
||||
help='Checkpoint save frequency (iterations)')
|
||||
|
||||
# Misc
|
||||
parser.add_argument('--seed', type=int, default=42,
|
||||
help='Random seed')
|
||||
parser.add_argument('--compile', action='store_true',
|
||||
help='Use torch.compile for faster training')
|
||||
|
||||
args = parser.parse_args()
|
||||
|
||||
# Create checkpoint directory
|
||||
import os
|
||||
os.makedirs(args.checkpoint_dir, exist_ok=True)
|
||||
|
||||
# Run training
|
||||
train(args)
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
main()
|
||||
Reference in New Issue
Block a user