7.9 KiB
7.9 KiB
PufferLib Training Guide
Overview
PuffeRL is PufferLib's high-performance training algorithm based on CleanRL's PPO with LSTMs, enhanced with proprietary research improvements. It achieves training at millions of steps per second through optimized vectorization and efficient implementation.
Training Workflow
Basic Training Loop
The PuffeRL trainer provides three core methods:
# Collect environment interactions
rollout_data = trainer.evaluate()
# Train on collected batch
train_metrics = trainer.train()
# Aggregate and log results
trainer.mean_and_log()
CLI Training
Quick start training via command line:
# Basic training
puffer train environment_name --train.device cuda --train.learning-rate 0.001
# Custom configuration
puffer train environment_name \
--train.device cuda \
--train.batch-size 32768 \
--train.learning-rate 0.0003 \
--train.num-iterations 10000
Python Training Script
import pufferlib
from pufferlib import PuffeRL
# Initialize environment
env = pufferlib.make('environment_name', num_envs=256)
# Create trainer
trainer = PuffeRL(
env=env,
policy=my_policy,
device='cuda',
learning_rate=3e-4,
batch_size=32768,
n_epochs=4,
gamma=0.99,
gae_lambda=0.95,
clip_coef=0.2,
ent_coef=0.01,
vf_coef=0.5,
max_grad_norm=0.5
)
# Training loop
for iteration in range(num_iterations):
# Collect rollouts
rollout_data = trainer.evaluate()
# Train on batch
train_metrics = trainer.train()
# Log results
trainer.mean_and_log()
Key Training Parameters
Core Hyperparameters
- learning_rate: Learning rate for optimizer (default: 3e-4)
- batch_size: Number of timesteps per training batch (default: 32768)
- n_epochs: Number of training epochs per batch (default: 4)
- num_envs: Number of parallel environments (default: 256)
- num_steps: Steps per environment per rollout (default: 128)
PPO Parameters
- gamma: Discount factor (default: 0.99)
- gae_lambda: Lambda for GAE calculation (default: 0.95)
- clip_coef: PPO clipping coefficient (default: 0.2)
- ent_coef: Entropy coefficient for exploration (default: 0.01)
- vf_coef: Value function loss coefficient (default: 0.5)
- max_grad_norm: Maximum gradient norm for clipping (default: 0.5)
Performance Parameters
- device: Computing device ('cuda' or 'cpu')
- compile: Use torch.compile for faster training (default: True)
- num_workers: Number of vectorization workers (default: auto)
Distributed Training
Multi-GPU Training
Use torchrun for distributed training across multiple GPUs:
torchrun --nproc_per_node=4 train.py \
--train.device cuda \
--train.batch-size 131072
Multi-Node Training
For distributed training across multiple nodes:
# On main node (rank 0)
torchrun --nproc_per_node=8 \
--nnodes=4 \
--node_rank=0 \
--master_addr=MASTER_IP \
--master_port=29500 \
train.py
# On worker nodes (rank 1, 2, 3)
torchrun --nproc_per_node=8 \
--nnodes=4 \
--node_rank=NODE_RANK \
--master_addr=MASTER_IP \
--master_port=29500 \
train.py
Monitoring and Logging
Logger Integration
PufferLib supports multiple logging backends:
Weights & Biases
from pufferlib import WandbLogger
logger = WandbLogger(
project='my_project',
entity='my_team',
name='experiment_name',
config=trainer_config
)
trainer = PuffeRL(env, policy, logger=logger)
Neptune
from pufferlib import NeptuneLogger
logger = NeptuneLogger(
project='my_team/my_project',
name='experiment_name',
api_token='YOUR_TOKEN'
)
trainer = PuffeRL(env, policy, logger=logger)
No Logger
from pufferlib import NoLogger
trainer = PuffeRL(env, policy, logger=NoLogger())
Key Metrics
Training logs include:
-
Performance Metrics:
- Steps per second (SPS)
- Training throughput
- Wall-clock time per iteration
-
Learning Metrics:
- Episode rewards (mean, min, max)
- Episode lengths
- Value function loss
- Policy loss
- Entropy
- Explained variance
- Clipfrac
-
Environment Metrics:
- Environment-specific rewards
- Success rates
- Custom metrics
Terminal Dashboard
PufferLib provides a real-time terminal dashboard showing:
- Training progress
- Current SPS
- Episode statistics
- Loss values
- GPU utilization
Checkpointing
Saving Checkpoints
# Save checkpoint
trainer.save_checkpoint('checkpoint.pt')
# Save with additional metadata
trainer.save_checkpoint(
'checkpoint.pt',
metadata={'iteration': iteration, 'best_reward': best_reward}
)
Loading Checkpoints
# Load checkpoint
trainer.load_checkpoint('checkpoint.pt')
# Resume training
for iteration in range(resume_iteration, num_iterations):
trainer.evaluate()
trainer.train()
trainer.mean_and_log()
Hyperparameter Tuning with Protein
The Protein system enables automatic hyperparameter and reward tuning:
from pufferlib import Protein
# Define search space
search_space = {
'learning_rate': [1e-4, 3e-4, 1e-3],
'batch_size': [16384, 32768, 65536],
'ent_coef': [0.001, 0.01, 0.1],
'clip_coef': [0.1, 0.2, 0.3]
}
# Run hyperparameter search
protein = Protein(
env_name='environment_name',
search_space=search_space,
num_trials=100,
metric='mean_reward'
)
best_config = protein.optimize()
Performance Optimization Tips
Maximizing Throughput
- Batch Size: Increase batch_size to fully utilize GPU
- Num Envs: Balance between CPU and GPU utilization
- Compile: Enable torch.compile for 10-20% speedup
- Workers: Adjust num_workers based on environment complexity
- Device: Always use 'cuda' for neural network training
Environment Speed
- Pure Python environments: ~100k-500k SPS
- C-based environments: ~4M SPS
- With training overhead: ~1M-4M total SPS
Memory Management
- Reduce batch_size if running out of GPU memory
- Decrease num_envs if running out of CPU memory
- Use gradient accumulation for large effective batch sizes
Common Training Patterns
Curriculum Learning
# Start with easy tasks, gradually increase difficulty
difficulty_levels = [0.1, 0.3, 0.5, 0.7, 1.0]
for difficulty in difficulty_levels:
env = pufferlib.make('environment_name', difficulty=difficulty)
trainer = PuffeRL(env, policy)
for iteration in range(iterations_per_level):
trainer.evaluate()
trainer.train()
trainer.mean_and_log()
Reward Shaping
# Wrap environment with custom reward shaping
class RewardShapedEnv(pufferlib.PufferEnv):
def step(self, actions):
obs, rewards, dones, infos = super().step(actions)
# Add shaped rewards
shaped_rewards = rewards + 0.1 * proximity_bonus
return obs, shaped_rewards, dones, infos
Multi-Stage Training
# Train in multiple stages with different configurations
stages = [
{'learning_rate': 1e-3, 'iterations': 1000}, # Exploration
{'learning_rate': 3e-4, 'iterations': 5000}, # Main training
{'learning_rate': 1e-4, 'iterations': 2000} # Fine-tuning
]
for stage in stages:
trainer.learning_rate = stage['learning_rate']
for iteration in range(stage['iterations']):
trainer.evaluate()
trainer.train()
trainer.mean_and_log()
Troubleshooting
Low Performance
- Check environment is vectorized correctly
- Verify GPU utilization with
nvidia-smi - Increase batch_size to saturate GPU
- Enable compile mode
- Profile with
torch.profiler
Training Instability
- Reduce learning_rate
- Decrease batch_size
- Increase num_envs for more diverse samples
- Add entropy coefficient for more exploration
- Check reward scaling
Memory Issues
- Reduce batch_size or num_envs
- Use gradient accumulation
- Disable compile mode if causing OOM
- Check for memory leaks in custom environments