Initial commit
This commit is contained in:
360
skills/pufferlib/references/training.md
Normal file
360
skills/pufferlib/references/training.md
Normal file
@@ -0,0 +1,360 @@
|
||||
# PufferLib Training Guide
|
||||
|
||||
## Overview
|
||||
|
||||
PuffeRL is PufferLib's high-performance training algorithm based on CleanRL's PPO with LSTMs, enhanced with proprietary research improvements. It achieves training at millions of steps per second through optimized vectorization and efficient implementation.
|
||||
|
||||
## Training Workflow
|
||||
|
||||
### Basic Training Loop
|
||||
|
||||
The PuffeRL trainer provides three core methods:
|
||||
|
||||
```python
|
||||
# Collect environment interactions
|
||||
rollout_data = trainer.evaluate()
|
||||
|
||||
# Train on collected batch
|
||||
train_metrics = trainer.train()
|
||||
|
||||
# Aggregate and log results
|
||||
trainer.mean_and_log()
|
||||
```
|
||||
|
||||
### CLI Training
|
||||
|
||||
Quick start training via command line:
|
||||
|
||||
```bash
|
||||
# Basic training
|
||||
puffer train environment_name --train.device cuda --train.learning-rate 0.001
|
||||
|
||||
# Custom configuration
|
||||
puffer train environment_name \
|
||||
--train.device cuda \
|
||||
--train.batch-size 32768 \
|
||||
--train.learning-rate 0.0003 \
|
||||
--train.num-iterations 10000
|
||||
```
|
||||
|
||||
### Python Training Script
|
||||
|
||||
```python
|
||||
import pufferlib
|
||||
from pufferlib import PuffeRL
|
||||
|
||||
# Initialize environment
|
||||
env = pufferlib.make('environment_name', num_envs=256)
|
||||
|
||||
# Create trainer
|
||||
trainer = PuffeRL(
|
||||
env=env,
|
||||
policy=my_policy,
|
||||
device='cuda',
|
||||
learning_rate=3e-4,
|
||||
batch_size=32768,
|
||||
n_epochs=4,
|
||||
gamma=0.99,
|
||||
gae_lambda=0.95,
|
||||
clip_coef=0.2,
|
||||
ent_coef=0.01,
|
||||
vf_coef=0.5,
|
||||
max_grad_norm=0.5
|
||||
)
|
||||
|
||||
# Training loop
|
||||
for iteration in range(num_iterations):
|
||||
# Collect rollouts
|
||||
rollout_data = trainer.evaluate()
|
||||
|
||||
# Train on batch
|
||||
train_metrics = trainer.train()
|
||||
|
||||
# Log results
|
||||
trainer.mean_and_log()
|
||||
```
|
||||
|
||||
## Key Training Parameters
|
||||
|
||||
### Core Hyperparameters
|
||||
|
||||
- **learning_rate**: Learning rate for optimizer (default: 3e-4)
|
||||
- **batch_size**: Number of timesteps per training batch (default: 32768)
|
||||
- **n_epochs**: Number of training epochs per batch (default: 4)
|
||||
- **num_envs**: Number of parallel environments (default: 256)
|
||||
- **num_steps**: Steps per environment per rollout (default: 128)
|
||||
|
||||
### PPO Parameters
|
||||
|
||||
- **gamma**: Discount factor (default: 0.99)
|
||||
- **gae_lambda**: Lambda for GAE calculation (default: 0.95)
|
||||
- **clip_coef**: PPO clipping coefficient (default: 0.2)
|
||||
- **ent_coef**: Entropy coefficient for exploration (default: 0.01)
|
||||
- **vf_coef**: Value function loss coefficient (default: 0.5)
|
||||
- **max_grad_norm**: Maximum gradient norm for clipping (default: 0.5)
|
||||
|
||||
### Performance Parameters
|
||||
|
||||
- **device**: Computing device ('cuda' or 'cpu')
|
||||
- **compile**: Use torch.compile for faster training (default: True)
|
||||
- **num_workers**: Number of vectorization workers (default: auto)
|
||||
|
||||
## Distributed Training
|
||||
|
||||
### Multi-GPU Training
|
||||
|
||||
Use torchrun for distributed training across multiple GPUs:
|
||||
|
||||
```bash
|
||||
torchrun --nproc_per_node=4 train.py \
|
||||
--train.device cuda \
|
||||
--train.batch-size 131072
|
||||
```
|
||||
|
||||
### Multi-Node Training
|
||||
|
||||
For distributed training across multiple nodes:
|
||||
|
||||
```bash
|
||||
# On main node (rank 0)
|
||||
torchrun --nproc_per_node=8 \
|
||||
--nnodes=4 \
|
||||
--node_rank=0 \
|
||||
--master_addr=MASTER_IP \
|
||||
--master_port=29500 \
|
||||
train.py
|
||||
|
||||
# On worker nodes (rank 1, 2, 3)
|
||||
torchrun --nproc_per_node=8 \
|
||||
--nnodes=4 \
|
||||
--node_rank=NODE_RANK \
|
||||
--master_addr=MASTER_IP \
|
||||
--master_port=29500 \
|
||||
train.py
|
||||
```
|
||||
|
||||
## Monitoring and Logging
|
||||
|
||||
### Logger Integration
|
||||
|
||||
PufferLib supports multiple logging backends:
|
||||
|
||||
#### Weights & Biases
|
||||
|
||||
```python
|
||||
from pufferlib import WandbLogger
|
||||
|
||||
logger = WandbLogger(
|
||||
project='my_project',
|
||||
entity='my_team',
|
||||
name='experiment_name',
|
||||
config=trainer_config
|
||||
)
|
||||
|
||||
trainer = PuffeRL(env, policy, logger=logger)
|
||||
```
|
||||
|
||||
#### Neptune
|
||||
|
||||
```python
|
||||
from pufferlib import NeptuneLogger
|
||||
|
||||
logger = NeptuneLogger(
|
||||
project='my_team/my_project',
|
||||
name='experiment_name',
|
||||
api_token='YOUR_TOKEN'
|
||||
)
|
||||
|
||||
trainer = PuffeRL(env, policy, logger=logger)
|
||||
```
|
||||
|
||||
#### No Logger
|
||||
|
||||
```python
|
||||
from pufferlib import NoLogger
|
||||
|
||||
trainer = PuffeRL(env, policy, logger=NoLogger())
|
||||
```
|
||||
|
||||
### Key Metrics
|
||||
|
||||
Training logs include:
|
||||
|
||||
- **Performance Metrics**:
|
||||
- Steps per second (SPS)
|
||||
- Training throughput
|
||||
- Wall-clock time per iteration
|
||||
|
||||
- **Learning Metrics**:
|
||||
- Episode rewards (mean, min, max)
|
||||
- Episode lengths
|
||||
- Value function loss
|
||||
- Policy loss
|
||||
- Entropy
|
||||
- Explained variance
|
||||
- Clipfrac
|
||||
|
||||
- **Environment Metrics**:
|
||||
- Environment-specific rewards
|
||||
- Success rates
|
||||
- Custom metrics
|
||||
|
||||
### Terminal Dashboard
|
||||
|
||||
PufferLib provides a real-time terminal dashboard showing:
|
||||
- Training progress
|
||||
- Current SPS
|
||||
- Episode statistics
|
||||
- Loss values
|
||||
- GPU utilization
|
||||
|
||||
## Checkpointing
|
||||
|
||||
### Saving Checkpoints
|
||||
|
||||
```python
|
||||
# Save checkpoint
|
||||
trainer.save_checkpoint('checkpoint.pt')
|
||||
|
||||
# Save with additional metadata
|
||||
trainer.save_checkpoint(
|
||||
'checkpoint.pt',
|
||||
metadata={'iteration': iteration, 'best_reward': best_reward}
|
||||
)
|
||||
```
|
||||
|
||||
### Loading Checkpoints
|
||||
|
||||
```python
|
||||
# Load checkpoint
|
||||
trainer.load_checkpoint('checkpoint.pt')
|
||||
|
||||
# Resume training
|
||||
for iteration in range(resume_iteration, num_iterations):
|
||||
trainer.evaluate()
|
||||
trainer.train()
|
||||
trainer.mean_and_log()
|
||||
```
|
||||
|
||||
## Hyperparameter Tuning with Protein
|
||||
|
||||
The Protein system enables automatic hyperparameter and reward tuning:
|
||||
|
||||
```python
|
||||
from pufferlib import Protein
|
||||
|
||||
# Define search space
|
||||
search_space = {
|
||||
'learning_rate': [1e-4, 3e-4, 1e-3],
|
||||
'batch_size': [16384, 32768, 65536],
|
||||
'ent_coef': [0.001, 0.01, 0.1],
|
||||
'clip_coef': [0.1, 0.2, 0.3]
|
||||
}
|
||||
|
||||
# Run hyperparameter search
|
||||
protein = Protein(
|
||||
env_name='environment_name',
|
||||
search_space=search_space,
|
||||
num_trials=100,
|
||||
metric='mean_reward'
|
||||
)
|
||||
|
||||
best_config = protein.optimize()
|
||||
```
|
||||
|
||||
## Performance Optimization Tips
|
||||
|
||||
### Maximizing Throughput
|
||||
|
||||
1. **Batch Size**: Increase batch_size to fully utilize GPU
|
||||
2. **Num Envs**: Balance between CPU and GPU utilization
|
||||
3. **Compile**: Enable torch.compile for 10-20% speedup
|
||||
4. **Workers**: Adjust num_workers based on environment complexity
|
||||
5. **Device**: Always use 'cuda' for neural network training
|
||||
|
||||
### Environment Speed
|
||||
|
||||
- Pure Python environments: ~100k-500k SPS
|
||||
- C-based environments: ~4M SPS
|
||||
- With training overhead: ~1M-4M total SPS
|
||||
|
||||
### Memory Management
|
||||
|
||||
- Reduce batch_size if running out of GPU memory
|
||||
- Decrease num_envs if running out of CPU memory
|
||||
- Use gradient accumulation for large effective batch sizes
|
||||
|
||||
## Common Training Patterns
|
||||
|
||||
### Curriculum Learning
|
||||
|
||||
```python
|
||||
# Start with easy tasks, gradually increase difficulty
|
||||
difficulty_levels = [0.1, 0.3, 0.5, 0.7, 1.0]
|
||||
|
||||
for difficulty in difficulty_levels:
|
||||
env = pufferlib.make('environment_name', difficulty=difficulty)
|
||||
trainer = PuffeRL(env, policy)
|
||||
|
||||
for iteration in range(iterations_per_level):
|
||||
trainer.evaluate()
|
||||
trainer.train()
|
||||
trainer.mean_and_log()
|
||||
```
|
||||
|
||||
### Reward Shaping
|
||||
|
||||
```python
|
||||
# Wrap environment with custom reward shaping
|
||||
class RewardShapedEnv(pufferlib.PufferEnv):
|
||||
def step(self, actions):
|
||||
obs, rewards, dones, infos = super().step(actions)
|
||||
|
||||
# Add shaped rewards
|
||||
shaped_rewards = rewards + 0.1 * proximity_bonus
|
||||
|
||||
return obs, shaped_rewards, dones, infos
|
||||
```
|
||||
|
||||
### Multi-Stage Training
|
||||
|
||||
```python
|
||||
# Train in multiple stages with different configurations
|
||||
stages = [
|
||||
{'learning_rate': 1e-3, 'iterations': 1000}, # Exploration
|
||||
{'learning_rate': 3e-4, 'iterations': 5000}, # Main training
|
||||
{'learning_rate': 1e-4, 'iterations': 2000} # Fine-tuning
|
||||
]
|
||||
|
||||
for stage in stages:
|
||||
trainer.learning_rate = stage['learning_rate']
|
||||
for iteration in range(stage['iterations']):
|
||||
trainer.evaluate()
|
||||
trainer.train()
|
||||
trainer.mean_and_log()
|
||||
```
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Low Performance
|
||||
|
||||
- Check environment is vectorized correctly
|
||||
- Verify GPU utilization with `nvidia-smi`
|
||||
- Increase batch_size to saturate GPU
|
||||
- Enable compile mode
|
||||
- Profile with `torch.profiler`
|
||||
|
||||
### Training Instability
|
||||
|
||||
- Reduce learning_rate
|
||||
- Decrease batch_size
|
||||
- Increase num_envs for more diverse samples
|
||||
- Add entropy coefficient for more exploration
|
||||
- Check reward scaling
|
||||
|
||||
### Memory Issues
|
||||
|
||||
- Reduce batch_size or num_envs
|
||||
- Use gradient accumulation
|
||||
- Disable compile mode if causing OOM
|
||||
- Check for memory leaks in custom environments
|
||||
Reference in New Issue
Block a user