Initial commit

This commit is contained in:
Zhongwei Li
2025-11-30 08:30:10 +08:00
commit f0bd18fb4e
824 changed files with 331919 additions and 0 deletions

View File

@@ -0,0 +1,360 @@
# PufferLib Training Guide
## Overview
PuffeRL is PufferLib's high-performance training algorithm based on CleanRL's PPO with LSTMs, enhanced with proprietary research improvements. It achieves training at millions of steps per second through optimized vectorization and efficient implementation.
## Training Workflow
### Basic Training Loop
The PuffeRL trainer provides three core methods:
```python
# Collect environment interactions
rollout_data = trainer.evaluate()
# Train on collected batch
train_metrics = trainer.train()
# Aggregate and log results
trainer.mean_and_log()
```
### CLI Training
Quick start training via command line:
```bash
# Basic training
puffer train environment_name --train.device cuda --train.learning-rate 0.001
# Custom configuration
puffer train environment_name \
--train.device cuda \
--train.batch-size 32768 \
--train.learning-rate 0.0003 \
--train.num-iterations 10000
```
### Python Training Script
```python
import pufferlib
from pufferlib import PuffeRL
# Initialize environment
env = pufferlib.make('environment_name', num_envs=256)
# Create trainer
trainer = PuffeRL(
env=env,
policy=my_policy,
device='cuda',
learning_rate=3e-4,
batch_size=32768,
n_epochs=4,
gamma=0.99,
gae_lambda=0.95,
clip_coef=0.2,
ent_coef=0.01,
vf_coef=0.5,
max_grad_norm=0.5
)
# Training loop
for iteration in range(num_iterations):
# Collect rollouts
rollout_data = trainer.evaluate()
# Train on batch
train_metrics = trainer.train()
# Log results
trainer.mean_and_log()
```
## Key Training Parameters
### Core Hyperparameters
- **learning_rate**: Learning rate for optimizer (default: 3e-4)
- **batch_size**: Number of timesteps per training batch (default: 32768)
- **n_epochs**: Number of training epochs per batch (default: 4)
- **num_envs**: Number of parallel environments (default: 256)
- **num_steps**: Steps per environment per rollout (default: 128)
### PPO Parameters
- **gamma**: Discount factor (default: 0.99)
- **gae_lambda**: Lambda for GAE calculation (default: 0.95)
- **clip_coef**: PPO clipping coefficient (default: 0.2)
- **ent_coef**: Entropy coefficient for exploration (default: 0.01)
- **vf_coef**: Value function loss coefficient (default: 0.5)
- **max_grad_norm**: Maximum gradient norm for clipping (default: 0.5)
### Performance Parameters
- **device**: Computing device ('cuda' or 'cpu')
- **compile**: Use torch.compile for faster training (default: True)
- **num_workers**: Number of vectorization workers (default: auto)
## Distributed Training
### Multi-GPU Training
Use torchrun for distributed training across multiple GPUs:
```bash
torchrun --nproc_per_node=4 train.py \
--train.device cuda \
--train.batch-size 131072
```
### Multi-Node Training
For distributed training across multiple nodes:
```bash
# On main node (rank 0)
torchrun --nproc_per_node=8 \
--nnodes=4 \
--node_rank=0 \
--master_addr=MASTER_IP \
--master_port=29500 \
train.py
# On worker nodes (rank 1, 2, 3)
torchrun --nproc_per_node=8 \
--nnodes=4 \
--node_rank=NODE_RANK \
--master_addr=MASTER_IP \
--master_port=29500 \
train.py
```
## Monitoring and Logging
### Logger Integration
PufferLib supports multiple logging backends:
#### Weights & Biases
```python
from pufferlib import WandbLogger
logger = WandbLogger(
project='my_project',
entity='my_team',
name='experiment_name',
config=trainer_config
)
trainer = PuffeRL(env, policy, logger=logger)
```
#### Neptune
```python
from pufferlib import NeptuneLogger
logger = NeptuneLogger(
project='my_team/my_project',
name='experiment_name',
api_token='YOUR_TOKEN'
)
trainer = PuffeRL(env, policy, logger=logger)
```
#### No Logger
```python
from pufferlib import NoLogger
trainer = PuffeRL(env, policy, logger=NoLogger())
```
### Key Metrics
Training logs include:
- **Performance Metrics**:
- Steps per second (SPS)
- Training throughput
- Wall-clock time per iteration
- **Learning Metrics**:
- Episode rewards (mean, min, max)
- Episode lengths
- Value function loss
- Policy loss
- Entropy
- Explained variance
- Clipfrac
- **Environment Metrics**:
- Environment-specific rewards
- Success rates
- Custom metrics
### Terminal Dashboard
PufferLib provides a real-time terminal dashboard showing:
- Training progress
- Current SPS
- Episode statistics
- Loss values
- GPU utilization
## Checkpointing
### Saving Checkpoints
```python
# Save checkpoint
trainer.save_checkpoint('checkpoint.pt')
# Save with additional metadata
trainer.save_checkpoint(
'checkpoint.pt',
metadata={'iteration': iteration, 'best_reward': best_reward}
)
```
### Loading Checkpoints
```python
# Load checkpoint
trainer.load_checkpoint('checkpoint.pt')
# Resume training
for iteration in range(resume_iteration, num_iterations):
trainer.evaluate()
trainer.train()
trainer.mean_and_log()
```
## Hyperparameter Tuning with Protein
The Protein system enables automatic hyperparameter and reward tuning:
```python
from pufferlib import Protein
# Define search space
search_space = {
'learning_rate': [1e-4, 3e-4, 1e-3],
'batch_size': [16384, 32768, 65536],
'ent_coef': [0.001, 0.01, 0.1],
'clip_coef': [0.1, 0.2, 0.3]
}
# Run hyperparameter search
protein = Protein(
env_name='environment_name',
search_space=search_space,
num_trials=100,
metric='mean_reward'
)
best_config = protein.optimize()
```
## Performance Optimization Tips
### Maximizing Throughput
1. **Batch Size**: Increase batch_size to fully utilize GPU
2. **Num Envs**: Balance between CPU and GPU utilization
3. **Compile**: Enable torch.compile for 10-20% speedup
4. **Workers**: Adjust num_workers based on environment complexity
5. **Device**: Always use 'cuda' for neural network training
### Environment Speed
- Pure Python environments: ~100k-500k SPS
- C-based environments: ~4M SPS
- With training overhead: ~1M-4M total SPS
### Memory Management
- Reduce batch_size if running out of GPU memory
- Decrease num_envs if running out of CPU memory
- Use gradient accumulation for large effective batch sizes
## Common Training Patterns
### Curriculum Learning
```python
# Start with easy tasks, gradually increase difficulty
difficulty_levels = [0.1, 0.3, 0.5, 0.7, 1.0]
for difficulty in difficulty_levels:
env = pufferlib.make('environment_name', difficulty=difficulty)
trainer = PuffeRL(env, policy)
for iteration in range(iterations_per_level):
trainer.evaluate()
trainer.train()
trainer.mean_and_log()
```
### Reward Shaping
```python
# Wrap environment with custom reward shaping
class RewardShapedEnv(pufferlib.PufferEnv):
def step(self, actions):
obs, rewards, dones, infos = super().step(actions)
# Add shaped rewards
shaped_rewards = rewards + 0.1 * proximity_bonus
return obs, shaped_rewards, dones, infos
```
### Multi-Stage Training
```python
# Train in multiple stages with different configurations
stages = [
{'learning_rate': 1e-3, 'iterations': 1000}, # Exploration
{'learning_rate': 3e-4, 'iterations': 5000}, # Main training
{'learning_rate': 1e-4, 'iterations': 2000} # Fine-tuning
]
for stage in stages:
trainer.learning_rate = stage['learning_rate']
for iteration in range(stage['iterations']):
trainer.evaluate()
trainer.train()
trainer.mean_and_log()
```
## Troubleshooting
### Low Performance
- Check environment is vectorized correctly
- Verify GPU utilization with `nvidia-smi`
- Increase batch_size to saturate GPU
- Enable compile mode
- Profile with `torch.profiler`
### Training Instability
- Reduce learning_rate
- Decrease batch_size
- Increase num_envs for more diverse samples
- Add entropy coefficient for more exploration
- Check reward scaling
### Memory Issues
- Reduce batch_size or num_envs
- Use gradient accumulation
- Disable compile mode if causing OOM
- Check for memory leaks in custom environments