361 lines
7.9 KiB
Markdown
361 lines
7.9 KiB
Markdown
# PufferLib Training Guide
|
|
|
|
## Overview
|
|
|
|
PuffeRL is PufferLib's high-performance training algorithm based on CleanRL's PPO with LSTMs, enhanced with proprietary research improvements. It achieves training at millions of steps per second through optimized vectorization and efficient implementation.
|
|
|
|
## Training Workflow
|
|
|
|
### Basic Training Loop
|
|
|
|
The PuffeRL trainer provides three core methods:
|
|
|
|
```python
|
|
# Collect environment interactions
|
|
rollout_data = trainer.evaluate()
|
|
|
|
# Train on collected batch
|
|
train_metrics = trainer.train()
|
|
|
|
# Aggregate and log results
|
|
trainer.mean_and_log()
|
|
```
|
|
|
|
### CLI Training
|
|
|
|
Quick start training via command line:
|
|
|
|
```bash
|
|
# Basic training
|
|
puffer train environment_name --train.device cuda --train.learning-rate 0.001
|
|
|
|
# Custom configuration
|
|
puffer train environment_name \
|
|
--train.device cuda \
|
|
--train.batch-size 32768 \
|
|
--train.learning-rate 0.0003 \
|
|
--train.num-iterations 10000
|
|
```
|
|
|
|
### Python Training Script
|
|
|
|
```python
|
|
import pufferlib
|
|
from pufferlib import PuffeRL
|
|
|
|
# Initialize environment
|
|
env = pufferlib.make('environment_name', num_envs=256)
|
|
|
|
# Create trainer
|
|
trainer = PuffeRL(
|
|
env=env,
|
|
policy=my_policy,
|
|
device='cuda',
|
|
learning_rate=3e-4,
|
|
batch_size=32768,
|
|
n_epochs=4,
|
|
gamma=0.99,
|
|
gae_lambda=0.95,
|
|
clip_coef=0.2,
|
|
ent_coef=0.01,
|
|
vf_coef=0.5,
|
|
max_grad_norm=0.5
|
|
)
|
|
|
|
# Training loop
|
|
for iteration in range(num_iterations):
|
|
# Collect rollouts
|
|
rollout_data = trainer.evaluate()
|
|
|
|
# Train on batch
|
|
train_metrics = trainer.train()
|
|
|
|
# Log results
|
|
trainer.mean_and_log()
|
|
```
|
|
|
|
## Key Training Parameters
|
|
|
|
### Core Hyperparameters
|
|
|
|
- **learning_rate**: Learning rate for optimizer (default: 3e-4)
|
|
- **batch_size**: Number of timesteps per training batch (default: 32768)
|
|
- **n_epochs**: Number of training epochs per batch (default: 4)
|
|
- **num_envs**: Number of parallel environments (default: 256)
|
|
- **num_steps**: Steps per environment per rollout (default: 128)
|
|
|
|
### PPO Parameters
|
|
|
|
- **gamma**: Discount factor (default: 0.99)
|
|
- **gae_lambda**: Lambda for GAE calculation (default: 0.95)
|
|
- **clip_coef**: PPO clipping coefficient (default: 0.2)
|
|
- **ent_coef**: Entropy coefficient for exploration (default: 0.01)
|
|
- **vf_coef**: Value function loss coefficient (default: 0.5)
|
|
- **max_grad_norm**: Maximum gradient norm for clipping (default: 0.5)
|
|
|
|
### Performance Parameters
|
|
|
|
- **device**: Computing device ('cuda' or 'cpu')
|
|
- **compile**: Use torch.compile for faster training (default: True)
|
|
- **num_workers**: Number of vectorization workers (default: auto)
|
|
|
|
## Distributed Training
|
|
|
|
### Multi-GPU Training
|
|
|
|
Use torchrun for distributed training across multiple GPUs:
|
|
|
|
```bash
|
|
torchrun --nproc_per_node=4 train.py \
|
|
--train.device cuda \
|
|
--train.batch-size 131072
|
|
```
|
|
|
|
### Multi-Node Training
|
|
|
|
For distributed training across multiple nodes:
|
|
|
|
```bash
|
|
# On main node (rank 0)
|
|
torchrun --nproc_per_node=8 \
|
|
--nnodes=4 \
|
|
--node_rank=0 \
|
|
--master_addr=MASTER_IP \
|
|
--master_port=29500 \
|
|
train.py
|
|
|
|
# On worker nodes (rank 1, 2, 3)
|
|
torchrun --nproc_per_node=8 \
|
|
--nnodes=4 \
|
|
--node_rank=NODE_RANK \
|
|
--master_addr=MASTER_IP \
|
|
--master_port=29500 \
|
|
train.py
|
|
```
|
|
|
|
## Monitoring and Logging
|
|
|
|
### Logger Integration
|
|
|
|
PufferLib supports multiple logging backends:
|
|
|
|
#### Weights & Biases
|
|
|
|
```python
|
|
from pufferlib import WandbLogger
|
|
|
|
logger = WandbLogger(
|
|
project='my_project',
|
|
entity='my_team',
|
|
name='experiment_name',
|
|
config=trainer_config
|
|
)
|
|
|
|
trainer = PuffeRL(env, policy, logger=logger)
|
|
```
|
|
|
|
#### Neptune
|
|
|
|
```python
|
|
from pufferlib import NeptuneLogger
|
|
|
|
logger = NeptuneLogger(
|
|
project='my_team/my_project',
|
|
name='experiment_name',
|
|
api_token='YOUR_TOKEN'
|
|
)
|
|
|
|
trainer = PuffeRL(env, policy, logger=logger)
|
|
```
|
|
|
|
#### No Logger
|
|
|
|
```python
|
|
from pufferlib import NoLogger
|
|
|
|
trainer = PuffeRL(env, policy, logger=NoLogger())
|
|
```
|
|
|
|
### Key Metrics
|
|
|
|
Training logs include:
|
|
|
|
- **Performance Metrics**:
|
|
- Steps per second (SPS)
|
|
- Training throughput
|
|
- Wall-clock time per iteration
|
|
|
|
- **Learning Metrics**:
|
|
- Episode rewards (mean, min, max)
|
|
- Episode lengths
|
|
- Value function loss
|
|
- Policy loss
|
|
- Entropy
|
|
- Explained variance
|
|
- Clipfrac
|
|
|
|
- **Environment Metrics**:
|
|
- Environment-specific rewards
|
|
- Success rates
|
|
- Custom metrics
|
|
|
|
### Terminal Dashboard
|
|
|
|
PufferLib provides a real-time terminal dashboard showing:
|
|
- Training progress
|
|
- Current SPS
|
|
- Episode statistics
|
|
- Loss values
|
|
- GPU utilization
|
|
|
|
## Checkpointing
|
|
|
|
### Saving Checkpoints
|
|
|
|
```python
|
|
# Save checkpoint
|
|
trainer.save_checkpoint('checkpoint.pt')
|
|
|
|
# Save with additional metadata
|
|
trainer.save_checkpoint(
|
|
'checkpoint.pt',
|
|
metadata={'iteration': iteration, 'best_reward': best_reward}
|
|
)
|
|
```
|
|
|
|
### Loading Checkpoints
|
|
|
|
```python
|
|
# Load checkpoint
|
|
trainer.load_checkpoint('checkpoint.pt')
|
|
|
|
# Resume training
|
|
for iteration in range(resume_iteration, num_iterations):
|
|
trainer.evaluate()
|
|
trainer.train()
|
|
trainer.mean_and_log()
|
|
```
|
|
|
|
## Hyperparameter Tuning with Protein
|
|
|
|
The Protein system enables automatic hyperparameter and reward tuning:
|
|
|
|
```python
|
|
from pufferlib import Protein
|
|
|
|
# Define search space
|
|
search_space = {
|
|
'learning_rate': [1e-4, 3e-4, 1e-3],
|
|
'batch_size': [16384, 32768, 65536],
|
|
'ent_coef': [0.001, 0.01, 0.1],
|
|
'clip_coef': [0.1, 0.2, 0.3]
|
|
}
|
|
|
|
# Run hyperparameter search
|
|
protein = Protein(
|
|
env_name='environment_name',
|
|
search_space=search_space,
|
|
num_trials=100,
|
|
metric='mean_reward'
|
|
)
|
|
|
|
best_config = protein.optimize()
|
|
```
|
|
|
|
## Performance Optimization Tips
|
|
|
|
### Maximizing Throughput
|
|
|
|
1. **Batch Size**: Increase batch_size to fully utilize GPU
|
|
2. **Num Envs**: Balance between CPU and GPU utilization
|
|
3. **Compile**: Enable torch.compile for 10-20% speedup
|
|
4. **Workers**: Adjust num_workers based on environment complexity
|
|
5. **Device**: Always use 'cuda' for neural network training
|
|
|
|
### Environment Speed
|
|
|
|
- Pure Python environments: ~100k-500k SPS
|
|
- C-based environments: ~4M SPS
|
|
- With training overhead: ~1M-4M total SPS
|
|
|
|
### Memory Management
|
|
|
|
- Reduce batch_size if running out of GPU memory
|
|
- Decrease num_envs if running out of CPU memory
|
|
- Use gradient accumulation for large effective batch sizes
|
|
|
|
## Common Training Patterns
|
|
|
|
### Curriculum Learning
|
|
|
|
```python
|
|
# Start with easy tasks, gradually increase difficulty
|
|
difficulty_levels = [0.1, 0.3, 0.5, 0.7, 1.0]
|
|
|
|
for difficulty in difficulty_levels:
|
|
env = pufferlib.make('environment_name', difficulty=difficulty)
|
|
trainer = PuffeRL(env, policy)
|
|
|
|
for iteration in range(iterations_per_level):
|
|
trainer.evaluate()
|
|
trainer.train()
|
|
trainer.mean_and_log()
|
|
```
|
|
|
|
### Reward Shaping
|
|
|
|
```python
|
|
# Wrap environment with custom reward shaping
|
|
class RewardShapedEnv(pufferlib.PufferEnv):
|
|
def step(self, actions):
|
|
obs, rewards, dones, infos = super().step(actions)
|
|
|
|
# Add shaped rewards
|
|
shaped_rewards = rewards + 0.1 * proximity_bonus
|
|
|
|
return obs, shaped_rewards, dones, infos
|
|
```
|
|
|
|
### Multi-Stage Training
|
|
|
|
```python
|
|
# Train in multiple stages with different configurations
|
|
stages = [
|
|
{'learning_rate': 1e-3, 'iterations': 1000}, # Exploration
|
|
{'learning_rate': 3e-4, 'iterations': 5000}, # Main training
|
|
{'learning_rate': 1e-4, 'iterations': 2000} # Fine-tuning
|
|
]
|
|
|
|
for stage in stages:
|
|
trainer.learning_rate = stage['learning_rate']
|
|
for iteration in range(stage['iterations']):
|
|
trainer.evaluate()
|
|
trainer.train()
|
|
trainer.mean_and_log()
|
|
```
|
|
|
|
## Troubleshooting
|
|
|
|
### Low Performance
|
|
|
|
- Check environment is vectorized correctly
|
|
- Verify GPU utilization with `nvidia-smi`
|
|
- Increase batch_size to saturate GPU
|
|
- Enable compile mode
|
|
- Profile with `torch.profiler`
|
|
|
|
### Training Instability
|
|
|
|
- Reduce learning_rate
|
|
- Decrease batch_size
|
|
- Increase num_envs for more diverse samples
|
|
- Add entropy coefficient for more exploration
|
|
- Check reward scaling
|
|
|
|
### Memory Issues
|
|
|
|
- Reduce batch_size or num_envs
|
|
- Use gradient accumulation
|
|
- Disable compile mode if causing OOM
|
|
- Check for memory leaks in custom environments
|