# PufferLib Training Guide

## Overview

PuffeRL is PufferLib's high-performance training algorithm based on CleanRL's PPO with LSTMs, enhanced with proprietary research improvements. It achieves training at millions of steps per second through optimized vectorization and efficient implementation.

## Training Workflow

### Basic Training Loop

The PuffeRL trainer provides three core methods:

```python
# Collect environment interactions
rollout_data = trainer.evaluate()

# Train on collected batch
train_metrics = trainer.train()

# Aggregate and log results
trainer.mean_and_log()
```

### CLI Training

Quick start training via command line:

```bash
# Basic training
puffer train environment_name --train.device cuda --train.learning-rate 0.001

# Custom configuration
puffer train environment_name \
    --train.device cuda \
    --train.batch-size 32768 \
    --train.learning-rate 0.0003 \
    --train.num-iterations 10000
```

### Python Training Script

```python
import pufferlib
from pufferlib import PuffeRL

# Initialize environment
env = pufferlib.make('environment_name', num_envs=256)

# Create trainer
trainer = PuffeRL(
    env=env,
    policy=my_policy,
    device='cuda',
    learning_rate=3e-4,
    batch_size=32768,
    n_epochs=4,
    gamma=0.99,
    gae_lambda=0.95,
    clip_coef=0.2,
    ent_coef=0.01,
    vf_coef=0.5,
    max_grad_norm=0.5
)

# Training loop
for iteration in range(num_iterations):
    # Collect rollouts
    rollout_data = trainer.evaluate()

    # Train on batch
    train_metrics = trainer.train()

    # Log results
    trainer.mean_and_log()
```

## Key Training Parameters

### Core Hyperparameters

- **learning_rate**: Learning rate for optimizer (default: 3e-4)
- **batch_size**: Number of timesteps per training batch (default: 32768)
- **n_epochs**: Number of training epochs per batch (default: 4)
- **num_envs**: Number of parallel environments (default: 256)
- **num_steps**: Steps per environment per rollout (default: 128)

### PPO Parameters

- **gamma**: Discount factor (default: 0.99)
- **gae_lambda**: Lambda for GAE calculation (default: 0.95)
- **clip_coef**: PPO clipping coefficient (default: 0.2)
- **ent_coef**: Entropy coefficient for exploration (default: 0.01)
- **vf_coef**: Value function loss coefficient (default: 0.5)
- **max_grad_norm**: Maximum gradient norm for clipping (default: 0.5)

### Performance Parameters

- **device**: Computing device ('cuda' or 'cpu')
- **compile**: Use torch.compile for faster training (default: True)
- **num_workers**: Number of vectorization workers (default: auto)

## Distributed Training

### Multi-GPU Training

Use torchrun for distributed training across multiple GPUs:

```bash
torchrun --nproc_per_node=4 train.py \
    --train.device cuda \
    --train.batch-size 131072
```

### Multi-Node Training

For distributed training across multiple nodes:

```bash
# On main node (rank 0)
torchrun --nproc_per_node=8 \
    --nnodes=4 \
    --node_rank=0 \
    --master_addr=MASTER_IP \
    --master_port=29500 \
    train.py

# On worker nodes (rank 1, 2, 3)
torchrun --nproc_per_node=8 \
    --nnodes=4 \
    --node_rank=NODE_RANK \
    --master_addr=MASTER_IP \
    --master_port=29500 \
    train.py
```

## Monitoring and Logging

### Logger Integration

PufferLib supports multiple logging backends:

#### Weights & Biases

```python
from pufferlib import WandbLogger

logger = WandbLogger(
    project='my_project',
    entity='my_team',
    name='experiment_name',
    config=trainer_config
)

trainer = PuffeRL(env, policy, logger=logger)
```

#### Neptune

```python
from pufferlib import NeptuneLogger

logger = NeptuneLogger(
    project='my_team/my_project',
    name='experiment_name',
    api_token='YOUR_TOKEN'
)

trainer = PuffeRL(env, policy, logger=logger)
```

#### No Logger

```python
from pufferlib import NoLogger

trainer = PuffeRL(env, policy, logger=NoLogger())
```

### Key Metrics

Training logs include:

- **Performance Metrics**:
  - Steps per second (SPS)
  - Training throughput
  - Wall-clock time per iteration

- **Learning Metrics**:
  - Episode rewards (mean, min, max)
  - Episode lengths
  - Value function loss
  - Policy loss
  - Entropy
  - Explained variance
  - Clipfrac

- **Environment Metrics**:
  - Environment-specific rewards
  - Success rates
  - Custom metrics

### Terminal Dashboard

PufferLib provides a real-time terminal dashboard showing:
- Training progress
- Current SPS
- Episode statistics
- Loss values
- GPU utilization

## Checkpointing

### Saving Checkpoints

```python
# Save checkpoint
trainer.save_checkpoint('checkpoint.pt')

# Save with additional metadata
trainer.save_checkpoint(
    'checkpoint.pt',
    metadata={'iteration': iteration, 'best_reward': best_reward}
)
```

### Loading Checkpoints

```python
# Load checkpoint
trainer.load_checkpoint('checkpoint.pt')

# Resume training
for iteration in range(resume_iteration, num_iterations):
    trainer.evaluate()
    trainer.train()
    trainer.mean_and_log()
```

## Hyperparameter Tuning with Protein

The Protein system enables automatic hyperparameter and reward tuning:

```python
from pufferlib import Protein

# Define search space
search_space = {
    'learning_rate': [1e-4, 3e-4, 1e-3],
    'batch_size': [16384, 32768, 65536],
    'ent_coef': [0.001, 0.01, 0.1],
    'clip_coef': [0.1, 0.2, 0.3]
}

# Run hyperparameter search
protein = Protein(
    env_name='environment_name',
    search_space=search_space,
    num_trials=100,
    metric='mean_reward'
)

best_config = protein.optimize()
```

## Performance Optimization Tips

### Maximizing Throughput

1. **Batch Size**: Increase batch_size to fully utilize GPU
2. **Num Envs**: Balance between CPU and GPU utilization
3. **Compile**: Enable torch.compile for 10-20% speedup
4. **Workers**: Adjust num_workers based on environment complexity
5. **Device**: Always use 'cuda' for neural network training

### Environment Speed

- Pure Python environments: ~100k-500k SPS
- C-based environments: ~4M SPS
- With training overhead: ~1M-4M total SPS

### Memory Management

- Reduce batch_size if running out of GPU memory
- Decrease num_envs if running out of CPU memory
- Use gradient accumulation for large effective batch sizes

## Common Training Patterns

### Curriculum Learning

```python
# Start with easy tasks, gradually increase difficulty
difficulty_levels = [0.1, 0.3, 0.5, 0.7, 1.0]

for difficulty in difficulty_levels:
    env = pufferlib.make('environment_name', difficulty=difficulty)
    trainer = PuffeRL(env, policy)

    for iteration in range(iterations_per_level):
        trainer.evaluate()
        trainer.train()
        trainer.mean_and_log()
```

### Reward Shaping

```python
# Wrap environment with custom reward shaping
class RewardShapedEnv(pufferlib.PufferEnv):
    def step(self, actions):
        obs, rewards, dones, infos = super().step(actions)

        # Add shaped rewards
        shaped_rewards = rewards + 0.1 * proximity_bonus

        return obs, shaped_rewards, dones, infos
```

### Multi-Stage Training

```python
# Train in multiple stages with different configurations
stages = [
    {'learning_rate': 1e-3, 'iterations': 1000},   # Exploration
    {'learning_rate': 3e-4, 'iterations': 5000},   # Main training
    {'learning_rate': 1e-4, 'iterations': 2000}    # Fine-tuning
]

for stage in stages:
    trainer.learning_rate = stage['learning_rate']
    for iteration in range(stage['iterations']):
        trainer.evaluate()
        trainer.train()
        trainer.mean_and_log()
```

## Troubleshooting

### Low Performance

- Check environment is vectorized correctly
- Verify GPU utilization with `nvidia-smi`
- Increase batch_size to saturate GPU
- Enable compile mode
- Profile with `torch.profiler`

### Training Instability

- Reduce learning_rate
- Decrease batch_size
- Increase num_envs for more diverse samples
- Add entropy coefficient for more exploration
- Check reward scaling

### Memory Issues

- Reduce batch_size or num_envs
- Use gradient accumulation
- Disable compile mode if causing OOM
- Check for memory leaks in custom environments