Initial commit

2025-11-30 08:30:10 +08:00
commit f0bd18fb4e
824 changed files with 331919 additions and 0 deletions
--- a/skills/pufferlib/references/training.md
+++ b/skills/pufferlib/references/training.md
@@ -0,0 +1,360 @@
+# PufferLib Training Guide
+
+## Overview
+
+PuffeRL is PufferLib's high-performance training algorithm based on CleanRL's PPO with LSTMs, enhanced with proprietary research improvements. It achieves training at millions of steps per second through optimized vectorization and efficient implementation.
+
+## Training Workflow
+
+### Basic Training Loop
+
+The PuffeRL trainer provides three core methods:
+
+```python
+# Collect environment interactions
+rollout_data = trainer.evaluate()
+
+# Train on collected batch
+train_metrics = trainer.train()
+
+# Aggregate and log results
+trainer.mean_and_log()
+```
+
+### CLI Training
+
+Quick start training via command line:
+
+```bash
+# Basic training
+puffer train environment_name --train.device cuda --train.learning-rate 0.001
+
+# Custom configuration
+puffer train environment_name \
+    --train.device cuda \
+    --train.batch-size 32768 \
+    --train.learning-rate 0.0003 \
+    --train.num-iterations 10000
+```
+
+### Python Training Script
+
+```python
+import pufferlib
+from pufferlib import PuffeRL
+
+# Initialize environment
+env = pufferlib.make('environment_name', num_envs=256)
+
+# Create trainer
+trainer = PuffeRL(
+    env=env,
+    policy=my_policy,
+    device='cuda',
+    learning_rate=3e-4,
+    batch_size=32768,
+    n_epochs=4,
+    gamma=0.99,
+    gae_lambda=0.95,
+    clip_coef=0.2,
+    ent_coef=0.01,
+    vf_coef=0.5,
+    max_grad_norm=0.5
+)
+
+# Training loop
+for iteration in range(num_iterations):
+    # Collect rollouts
+    rollout_data = trainer.evaluate()
+
+    # Train on batch
+    train_metrics = trainer.train()
+
+    # Log results
+    trainer.mean_and_log()
+```
+
+## Key Training Parameters
+
+### Core Hyperparameters
+
+- **learning_rate**: Learning rate for optimizer (default: 3e-4)
+- **batch_size**: Number of timesteps per training batch (default: 32768)
+- **n_epochs**: Number of training epochs per batch (default: 4)
+- **num_envs**: Number of parallel environments (default: 256)
+- **num_steps**: Steps per environment per rollout (default: 128)
+
+### PPO Parameters
+
+- **gamma**: Discount factor (default: 0.99)
+- **gae_lambda**: Lambda for GAE calculation (default: 0.95)
+- **clip_coef**: PPO clipping coefficient (default: 0.2)
+- **ent_coef**: Entropy coefficient for exploration (default: 0.01)
+- **vf_coef**: Value function loss coefficient (default: 0.5)
+- **max_grad_norm**: Maximum gradient norm for clipping (default: 0.5)
+
+### Performance Parameters
+
+- **device**: Computing device ('cuda' or 'cpu')
+- **compile**: Use torch.compile for faster training (default: True)
+- **num_workers**: Number of vectorization workers (default: auto)
+
+## Distributed Training
+
+### Multi-GPU Training
+
+Use torchrun for distributed training across multiple GPUs:
+
+```bash
+torchrun --nproc_per_node=4 train.py \
+    --train.device cuda \
+    --train.batch-size 131072
+```
+
+### Multi-Node Training
+
+For distributed training across multiple nodes:
+
+```bash
+# On main node (rank 0)
+torchrun --nproc_per_node=8 \
+    --nnodes=4 \
+    --node_rank=0 \
+    --master_addr=MASTER_IP \
+    --master_port=29500 \
+    train.py
+
+# On worker nodes (rank 1, 2, 3)
+torchrun --nproc_per_node=8 \
+    --nnodes=4 \
+    --node_rank=NODE_RANK \
+    --master_addr=MASTER_IP \
+    --master_port=29500 \
+    train.py
+```
+
+## Monitoring and Logging
+
+### Logger Integration
+
+PufferLib supports multiple logging backends:
+
+#### Weights & Biases
+
+```python
+from pufferlib import WandbLogger
+
+logger = WandbLogger(
+    project='my_project',
+    entity='my_team',
+    name='experiment_name',
+    config=trainer_config
+)
+
+trainer = PuffeRL(env, policy, logger=logger)
+```
+
+#### Neptune
+
+```python
+from pufferlib import NeptuneLogger
+
+logger = NeptuneLogger(
+    project='my_team/my_project',
+    name='experiment_name',
+    api_token='YOUR_TOKEN'
+)
+
+trainer = PuffeRL(env, policy, logger=logger)
+```
+
+#### No Logger
+
+```python
+from pufferlib import NoLogger
+
+trainer = PuffeRL(env, policy, logger=NoLogger())
+```
+
+### Key Metrics
+
+Training logs include:
+
+- **Performance Metrics**:
+  - Steps per second (SPS)
+  - Training throughput
+  - Wall-clock time per iteration
+
+- **Learning Metrics**:
+  - Episode rewards (mean, min, max)
+  - Episode lengths
+  - Value function loss
+  - Policy loss
+  - Entropy
+  - Explained variance
+  - Clipfrac
+
+- **Environment Metrics**:
+  - Environment-specific rewards
+  - Success rates
+  - Custom metrics
+
+### Terminal Dashboard
+
+PufferLib provides a real-time terminal dashboard showing:
+- Training progress
+- Current SPS
+- Episode statistics
+- Loss values
+- GPU utilization
+
+## Checkpointing
+
+### Saving Checkpoints
+
+```python
+# Save checkpoint
+trainer.save_checkpoint('checkpoint.pt')
+
+# Save with additional metadata
+trainer.save_checkpoint(
+    'checkpoint.pt',
+    metadata={'iteration': iteration, 'best_reward': best_reward}
+)
+```
+
+### Loading Checkpoints
+
+```python
+# Load checkpoint
+trainer.load_checkpoint('checkpoint.pt')
+
+# Resume training
+for iteration in range(resume_iteration, num_iterations):
+    trainer.evaluate()
+    trainer.train()
+    trainer.mean_and_log()
+```
+
+## Hyperparameter Tuning with Protein
+
+The Protein system enables automatic hyperparameter and reward tuning:
+
+```python
+from pufferlib import Protein
+
+# Define search space
+search_space = {
+    'learning_rate': [1e-4, 3e-4, 1e-3],
+    'batch_size': [16384, 32768, 65536],
+    'ent_coef': [0.001, 0.01, 0.1],
+    'clip_coef': [0.1, 0.2, 0.3]
+}
+
+# Run hyperparameter search
+protein = Protein(
+    env_name='environment_name',
+    search_space=search_space,
+    num_trials=100,
+    metric='mean_reward'
+)
+
+best_config = protein.optimize()
+```
+
+## Performance Optimization Tips
+
+### Maximizing Throughput
+
+1. **Batch Size**: Increase batch_size to fully utilize GPU
+2. **Num Envs**: Balance between CPU and GPU utilization
+3. **Compile**: Enable torch.compile for 10-20% speedup
+4. **Workers**: Adjust num_workers based on environment complexity
+5. **Device**: Always use 'cuda' for neural network training
+
+### Environment Speed
+
+- Pure Python environments: ~100k-500k SPS
+- C-based environments: ~4M SPS
+- With training overhead: ~1M-4M total SPS
+
+### Memory Management
+
+- Reduce batch_size if running out of GPU memory
+- Decrease num_envs if running out of CPU memory
+- Use gradient accumulation for large effective batch sizes
+
+## Common Training Patterns
+
+### Curriculum Learning
+
+```python
+# Start with easy tasks, gradually increase difficulty
+difficulty_levels = [0.1, 0.3, 0.5, 0.7, 1.0]
+
+for difficulty in difficulty_levels:
+    env = pufferlib.make('environment_name', difficulty=difficulty)
+    trainer = PuffeRL(env, policy)
+
+    for iteration in range(iterations_per_level):
+        trainer.evaluate()
+        trainer.train()
+        trainer.mean_and_log()
+```
+
+### Reward Shaping
+
+```python
+# Wrap environment with custom reward shaping
+class RewardShapedEnv(pufferlib.PufferEnv):
+    def step(self, actions):
+        obs, rewards, dones, infos = super().step(actions)
+
+        # Add shaped rewards
+        shaped_rewards = rewards + 0.1 * proximity_bonus
+
+        return obs, shaped_rewards, dones, infos
+```
+
+### Multi-Stage Training
+
+```python
+# Train in multiple stages with different configurations
+stages = [
+    {'learning_rate': 1e-3, 'iterations': 1000},   # Exploration
+    {'learning_rate': 3e-4, 'iterations': 5000},   # Main training
+    {'learning_rate': 1e-4, 'iterations': 2000}    # Fine-tuning
+]
+
+for stage in stages:
+    trainer.learning_rate = stage['learning_rate']
+    for iteration in range(stage['iterations']):
+        trainer.evaluate()
+        trainer.train()
+        trainer.mean_and_log()
+```
+
+## Troubleshooting
+
+### Low Performance
+
+- Check environment is vectorized correctly
+- Verify GPU utilization with `nvidia-smi`
+- Increase batch_size to saturate GPU
+- Enable compile mode
+- Profile with `torch.profiler`
+
+### Training Instability
+
+- Reduce learning_rate
+- Decrease batch_size
+- Increase num_envs for more diverse samples
+- Add entropy coefficient for more exploration
+- Check reward scaling
+
+### Memory Issues
+
+- Reduce batch_size or num_envs
+- Use gradient accumulation
+- Disable compile mode if causing OOM
+- Check for memory leaks in custom environments