Files
gh-tachyon-beep-skillpacks-…/skills/using-deep-rl/rl-debugging.md
2025-11-30 08:59:51 +08:00

1395 lines
43 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# RL Debugging Methodology
## When to Use This Skill
Invoke this skill when you encounter:
- **Agent Won't Learn**: Reward stuck at baseline, not improving
- **Training Unstable**: Loss bouncing, reward highly variable
- **Suboptimal Policy**: Agent learned something but worse than expected
- **Reward Hacking**: Agent gaming the reward function
- **Exploration Issues**: Agent stuck in local optimum or exploring poorly
- **Hyperparameter Sensitivity**: Small changes break training
- **Learning Rate Tuning**: Not sure what value is right
- **Convergence Problems**: Loss doesn't decrease or decreases then stops
- **Environment vs Algorithm**: Unsure if problem is environment or RL algorithm
- **Logging Confusion**: Not sure what metrics to monitor
- **Gradual Performance Degradation**: Early training good, later poor
- **Sparse Reward Challenge**: Agent never finds reward signal
**Core Problem**: RL debugging often becomes random hyperparameter tweaking. Agents are complex systems with many failure modes. Systematic diagnosis finds root causes; random tweaking wastes time and leads to contradictory findings.
## Do NOT Use This Skill For
- **Learning RL theory** (route to rl-foundations for MDPs, Bellman equations, policy gradients)
- **Implementing new algorithms** (route to algorithm-specific skills like value-based-methods, policy-gradient-methods, actor-critic-methods)
- **Environment API questions** (route to rl-environments for Gym/Gymnasium API, custom environments, wrappers)
- **Evaluation methodology** (route to rl-evaluation for rigorous statistical testing, generalization assessment)
- **Initial algorithm selection** (route to using-deep-rl router or rl-foundations for choosing the right algorithm family)
## Core Principle: The 80/20 Rule
**The most important insight in RL debugging:**
```
80% of RL failures are in:
1. Environment design (agent can't see true state)
2. Reward function (misaligned or wrong scale)
3. Observation/action representation (missing information)
15% are in:
4. Hyperparameters (learning rate, batch size, etc.)
5. Exploration strategy (too much or too little)
5% are in:
6. Algorithm selection (wrong algorithm for problem)
```
**Consequence**: If training fails, check environment and reward FIRST. Changing the algorithm last.
### Why This Order?
**Scenario 1: Broken Environment**
```python
# BROKEN ENVIRONMENT: Agent can't win no matter what algorithm
class BrokenEnv:
def reset(self):
self.state = random_state() # Agent can't control this
return self.state
def step(self, action):
# Reward independent of action!
reward = random.random()
return self.state, reward
# No amount of PPO, DQN, SAC can learn from random reward
# CORRECT ENVIRONMENT: Agent can win with right policy
class CorrectEnv:
def reset(self):
self.state = initial_state
return self.state
def step(self, action):
# Reward depends on action
reward = compute_reward(self.state, action)
self.state = compute_next_state(self.state, action)
return self.state, reward
```
**If environment is broken, no algorithm will learn.**
**Scenario 2: Reward Scale Issue**
```python
# WRONG SCALE: Reward in [0, 1000000]
# Algorithm gradient updates: param = param - lr * grad
# If gradient huge (due to reward scale), single step breaks everything
# CORRECT SCALE: Reward in [-1, 1]
# Gradients are reasonable, learning stable
# Fix is simple: divide reward by scale factor
# But if you don't know to check reward scale, you'll try 10 learning rates instead
```
**Consequence: Always check reward scale before tuning learning rate.**
## Part 1: Systematic Debugging Framework
### The Debugging Process (Not Random Tweaking)
```
START: Agent not learning (or training unstable, or suboptimal)
Step 1: ENVIRONMENT CHECK (Does agent have what it needs?)
├─ Can agent see the state? (Is observation sufficient?)
├─ Is environment deterministic or stochastic? (Affects algorithm choice)
├─ Can agent actually win? (Does optimal policy exist?)
└─ Is environment reset working? (Fresh episode each reset?)
Step 2: REWARD SCALE CHECK (Is reward in reasonable range?)
├─ What's the range of rewards? (Min, max, typical)
├─ Are rewards normalized? (Should be ≈ [-1, 1])
├─ Is reward aligned with desired behavior? (No reward hacking)
└─ Are rewards sparse or dense? (Affects exploration strategy)
Step 3: OBSERVATION REPRESENTATION (Is information preserved?)
├─ Are observations normalized? (Images: [0, 255] or [0, 1]?)
├─ Is temporal information included? (Frame stacking for Atari?)
├─ Are observations consistent? (Same format each episode?)
└─ Is observation sufficient to solve problem? (Can human win from this info?)
Step 4: BASIC ALGORITHM CHECK (Is the RL algorithm working at all?)
├─ Run on simple environment (CartPole, simple task)
├─ Can algorithm learn on simple env? (If not: algorithm issue)
├─ Can algorithm beat random baseline? (If not: something is broken)
└─ Does loss decrease? (If not: learning not happening)
Step 5: HYPERPARAMETER TUNING (Only after above passed)
├─ Is learning rate in reasonable range? (1e-5 to 1e-3 typical)
├─ Is batch size appropriate? (Power of 2: 32, 64, 128, 256)
├─ Is exploration sufficient? (Epsilon decaying? Entropy positive?)
└─ Are network layers reasonable? (3 hidden layers typical)
Step 6: LOGGING ANALYSIS (What do the metrics say?)
├─ Policy loss: decreasing? exploding? zero?
├─ Value loss: decreasing? stable?
├─ Reward curve: trending up? flat? oscillating?
├─ Entropy: decreasing over time? (Exploration → exploitation)
└─ Gradient norms: reasonable? exploding? vanishing?
Step 7: IDENTIFY ROOT CAUSE (Synthesize findings)
└─ Where is the actual problem? (Environment, reward, algorithm, hyperparameters)
```
### Why This Order Matters
**Common mistake: Jump to Step 5 (hyperparameter tuning)**
```python
# Agent not learning. Frustration sets in.
# "I'll try learning rate 1e-4" (Step 5, skipped 1-4)
# Doesn't work.
# "I'll try batch size 64" (more Step 5 tweaking)
# Doesn't work.
# "I'll try a bigger network" (still Step 5)
# Doesn't work.
# Hours wasted.
# Correct approach: Follow Steps 1-4 first.
# Step 1: Oh! Environment reset is broken, always same initial state
# Fix environment.
# Now agent learns immediately with default hyperparameters.
```
**The order reflects probability**: It's more likely the environment is broken than the algorithm; more likely the reward scale is wrong than learning rate is wrong.
## Part 2: Diagnosis Trees by Symptom
### Diagnosis Tree 1: "Agent Won't Learn"
**Symptom**: Reward stuck near random baseline. Loss doesn't decrease meaningfully.
```
START: Agent Won't Learn
├─ STEP 1: Can agent beat random baseline?
│ ├─ YES → Skip to STEP 4
│ └─ NO → Environment issue likely
│ ├─ Check 1A: Is environment output sane?
│ │ ├─ Print first 5 episodes: state, action, reward, next_state
│ │ ├─ Verify types match (shapes, ranges, dtypes)
│ │ └─ Is reward always same? Always zero? (Red flag: no signal)
│ ├─ Check 1B: Can you beat it manually?
│ │ ├─ Play environment by hand (hardcode a policy)
│ │ ├─ Can you get >0 reward? (If not: environment is broken)
│ │ └─ If yes: Agent is missing something
│ └─ Check 1C: Is reset working?
│ ├─ Call reset() twice, check states differ
│ └─ If states same: reset is broken, fix it
├─ STEP 2: Is reward scale reasonable?
│ ├─ Compute: min, max, mean, std of rewards from random policy
│ ├─ If range >> 1 (e.g., [0, 10000]):
│ │ ├─ Action: Normalize rewards to [-1, 1]
│ │ ├─ Code: reward = reward / max_possible_reward
│ │ └─ Retest: Usually fixes "won't learn"
│ ├─ If range << 1 (e.g., [0, 0.001]):
│ │ ├─ Action: Scale up rewards
│ │ ├─ Code: reward = reward * 1000
│ │ └─ Or increase network capacity (more signal needed)
│ └─ If reward is [0, 1] (looks fine):
│ └─ Continue to STEP 3
├─ STEP 3: Is observation sufficient?
│ ├─ Check 3A: Are observations normalized?
│ │ ├─ If images [0, 255]: normalize to [0, 1] or [-1, 1]
│ │ ├─ Code: observation = observation / 255.0
│ │ └─ Retest
│ ├─ Check 3B: Is temporal info included? (For vision: frame stacking)
│ │ ├─ If using images: last 4 frames stacked?
│ │ ├─ If using states: includes velocity/derivatives?
│ │ └─ Missing temporal info → agent can't infer velocity
│ └─ Check 3C: Is observation Markovian?
│ ├─ Can optimal policy be derived from this observation?
│ ├─ If not: observation insufficient (red flag)
│ └─ Example: Only position, not velocity → agent can't control
├─ STEP 4: Run sanity check on simple environment
│ ├─ Switch to CartPole or equivalent simple env
│ ├─ Train with default hyperparameters
│ ├─ Does simple env learn? (Should learn in 1000-5000 steps)
│ ├─ YES → Your algorithm works, issue is your env/hyperparameters
│ └─ NO → Algorithm itself broken (rare, check algorithm implementation)
├─ STEP 5: Check exploration
│ ├─ Is agent exploring or stuck?
│ ├─ Log entropy (for stochastic policies)
│ ├─ If entropy → 0 early: agent exploiting before exploring
│ │ └─ Solution: Increase entropy regularization or ε
│ ├─ If entropy always high: too much exploration
│ │ └─ Solution: Decay entropy or ε more aggressively
│ └─ Visualize: Plot policy actions over time, should see diversity early
├─ STEP 6: Check learning rate
│ ├─ Is learning rate in [1e-5, 1e-3]? (typical range)
│ ├─ If > 1e-3: Try reducing (might be too aggressive)
│ ├─ If < 1e-5: Try increasing (might be too conservative)
│ ├─ Watch loss first step: If loss increases → LR too high
│ └─ Safe default: 3e-4
└─ STEP 7: Check network architecture
├─ For continuous control: small networks ok (1-2 hidden layers, 64-256 units)
├─ For vision: use CNN (don't use FC on pixels)
├─ Check if network has enough capacity
└─ Tip: Start with simple, add complexity if needed
```
**ROOT CAUSES in order of likelihood:**
1. **Reward scale wrong** (40% of cases)
2. **Environment broken** (25% of cases)
3. **Observation insufficient** (15% of cases)
4. **Learning rate too high/low** (12% of cases)
5. **Algorithm issue** (8% of cases)
### Diagnosis Tree 2: "Training Unstable"
**Symptom**: Loss bounces wildly, reward spikes then crashes, training oscillates.
```
START: Training Unstable
├─ STEP 1: Characterize the instability
│ ├─ Plot loss curve: Does it bounce at same magnitude or grow?
│ ├─ Plot reward curve: Does it oscillate around mean or trend down?
│ ├─ Compute: reward variance over 100 episodes
│ └─ This tells you: Is it normal variance or pathological instability?
├─ STEP 2: Check if environment is deterministic
│ ├─ Deterministic environment + stochastic policy = normal variance
│ ├─ Stochastic environment + any policy = high variance (expected)
│ ├─ If stochastic: Can you reduce randomness? Or accept higher variance?
│ └─ Some instability is normal; distinguish from pathological
├─ STEP 3: Check reward scale
│ ├─ If rewards >> 1: Gradient updates too large
│ │ ├─ Single step might overshoot optimum
│ │ ├─ Solution: Normalize rewards to [-1, 1]
│ │ └─ This often fixes instability immediately
│ ├─ If reward has outliers: Single large reward breaks training
│ │ ├─ Solution: Reward clipping or scaling
│ │ └─ Example: r = np.clip(reward, -1, 1)
│ └─ Check: Is reward scale consistent?
├─ STEP 4: Check learning rate (LR often causes instability)
│ ├─ If loss oscillates: LR likely too high
│ │ ├─ Try reducing by 2-5× (e.g., 1e-3 → 3e-4)
│ │ ├─ Watch first 100 steps: Loss should decrease monotonically
│ │ └─ If still oscillates: try 10× reduction
│ ├─ If you have LR scheduler: Check if it's too aggressive
│ │ ├─ Scheduler reducing LR too fast can cause steps
│ │ └─ Solution: Slower schedule (more steps to final LR)
│ └─ Test: Set LR very low (1e-5), see if training is smooth
│ ├─ YES → Increase LR gradually until instability starts
│ └─ This bracketing finds safe LR range
├─ STEP 5: Check batch size
│ ├─ Small batch (< 32): High gradient variance, bouncy updates
│ │ ├─ Solution: Increase batch size (32, 64, 128)
│ │ └─ But not too large: training becomes slow
│ ├─ Large batch (> 512): Might overfit, large gradient steps
│ │ ├─ Solution: Use gradient accumulation
│ │ └─ Or reduce learning rate slightly
│ └─ Start with batch_size=64, adjust if needed
├─ STEP 6: Check gradient clipping
│ ├─ Are gradients exploding? (Check max gradient norm)
│ │ ├─ If max grad norm > 100: Likely exploding gradients
│ │ ├─ Solution: Enable gradient clipping (max_norm=1.0)
│ │ └─ Code: torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
│ ├─ If max grad norm reasonable (< 10): Skip this step
│ └─ Watch grad norm over training: Should stay roughly constant
├─ STEP 7: Check algorithm-specific parameters
│ ├─ For PPO: Is clipping epsilon reasonable? (0.2 default)
│ │ ├─ Too high: Over-clips, doesn't update
│ │ └─ Too low: Allows large updates, instability
│ ├─ For DQN: Is target network update frequency appropriate?
│ │ ├─ Update too often: Target constantly changing
│ │ └─ Update too rarely: Stale targets
│ └─ For A3C/A2C: Check entropy coefficient
│ ├─ Too high: Too much exploration, policy noisy
│ └─ Too low: Premature convergence
└─ STEP 8: Check exploration decay
├─ Is exploration decaying too fast? (Policy becomes deterministic)
│ └─ If entropy→0 early: Agent exploits before exploring
├─ Is exploration decaying too slow? (Policy stays noisy)
│ └─ If entropy stays high: Too much randomness in later training
└─ Entropy should decay: high early, low late
└─ Plot entropy over training: should show clear decay curve
```
**ROOT CAUSES in order of likelihood:**
1. **Learning rate too high** (35% of cases)
2. **Reward scale too large** (25% of cases)
3. **Batch size too small** (15% of cases)
4. **Gradient explosion** (10% of cases)
5. **Algorithm parameters** (10% of cases)
6. **Environment stochasticity** (5% of cases)
### Diagnosis Tree 3: "Suboptimal Policy"
**Symptom**: Agent learned something but performs worse than expected. Better than random baseline, but not good enough.
```
START: Suboptimal Policy
├─ STEP 1: How suboptimal? (Quantify the gap)
│ ├─ Compute: Agent reward vs theoretical optimal
│ ├─ If 80% of optimal: Normal (RL usually gets 80-90% optimal)
│ ├─ If 50% of optimal: Significantly suboptimal
│ ├─ If 20% of optimal: Very bad
│ └─ This tells you: Is it "good enough" or truly broken?
├─ STEP 2: Is it stuck in local optimum?
│ ├─ Run multiple seeds: Do you get similar reward each seed?
│ ├─ If rewards similar across seeds: Consistent local optimum
│ ├─ If rewards vary wildly: High variance, need more training
│ └─ Solution if local optimum: More exploration or better reward shaping
├─ STEP 3: Check reward hacking
│ ├─ Visualize agent behavior: Does it match intent?
│ ├─ Example: Cart-pole reward is [0, 1] per timestep
│ │ ├─ Agent might learn: "Stay in center, don't move"
│ │ ├─ Policy is suboptimal but still gets reward
│ │ └─ Solution: Reward engineering (bonus for progress)
│ └─ Hacking signs:
│ ├─ Agent does something weird but gets reward
│ ├─ Behavior makes no intuitive sense
│ └─ Reward increases but performance bad
├─ STEP 4: Is exploration sufficient?
│ ├─ Check entropy: Does policy explore initially?
│ ├─ Check epsilon decay (if using ε-greedy): Does it decay appropriately?
│ ├─ Is agent exploring broadly or stuck in small region?
│ ├─ Solution: Slower exploration decay or intrinsic motivation
│ └─ Use RND/curiosity if environment has sparse rewards
├─ STEP 5: Check network capacity
│ ├─ Is network too small to represent optimal policy?
│ ├─ For vision: Use standard CNN (not tiny network)
│ ├─ For continuous control: 2-3 hidden layers, 128-256 units
│ ├─ Test: Double network size, does performance improve?
│ └─ If yes: Original network was too small
├─ STEP 6: Check data efficiency
│ ├─ Is agent training long enough?
│ ├─ RL usually needs: simple tasks 100k steps, complex tasks 1M+ steps
│ ├─ If training only 10k steps: Too short, agent didn't converge
│ ├─ Solution: Train longer (but check reward curve first)
│ └─ If reward plateaus early: Extend training won't help
├─ STEP 7: Check observation and action spaces
│ ├─ Is action space continuous or discrete?
│ ├─ Is action discretization appropriate?
│ │ ├─ Too coarse: Can't express fine control
│ │ ├─ Too fine: Huge action space, hard to learn
│ │ └─ Example: 100 actions for simple control = too many
│ ├─ Is observation sufficient? (See Diagnosis Tree 1, Step 3)
│ └─ Missing information in observation = impossible to be optimal
├─ STEP 8: Check reward structure
│ ├─ Is reward dense or sparse?
│ ├─ Sparse reward + suboptimal policy: Agent might not be exploring to good region
│ │ ├─ Solution: Reward shaping (bonus for progress)
│ │ └─ Or: Intrinsic motivation (RND/curiosity)
│ ├─ Dense reward + suboptimal: Possible misalignment with intent
│ └─ Can you improve by reshaping reward?
└─ STEP 9: Compare with baseline algorithm
├─ Run reference implementation on same env
├─ Does reference get better reward?
├─ YES → Your implementation has a bug
├─ NO → Problem is inherent to algorithm or environment
└─ This isolates: Implementation issue vs fundamental difficulty
```
**ROOT CAUSES in order of likelihood:**
1. **Exploration insufficient** (30% of cases)
2. **Training not long enough** (25% of cases)
3. **Reward hacking** (20% of cases)
4. **Network too small** (12% of cases)
5. **Observation insufficient** (8% of cases)
6. **Algorithm mismatch** (5% of cases)
## Part 3: What to Check First
### Critical Checks (Do These First)
#### Check 1: Reward Scale Analysis
**Why**: Reward scale is the MOST COMMON source of RL failures.
```python
# DIAGNOSTIC SCRIPT
import numpy as np
# Collect rewards from random policy
rewards = []
for episode in range(100):
state = env.reset()
for step in range(1000):
action = env.action_space.sample() # Random action
state, reward, done, _ = env.step(action)
rewards.append(reward)
if done:
break
rewards = np.array(rewards)
print(f"Reward statistics from random policy:")
print(f" Min: {rewards.min()}")
print(f" Max: {rewards.max()}")
print(f" Mean: {rewards.mean()}")
print(f" Std: {rewards.std()}")
print(f" Range: [{rewards.min()}, {rewards.max()}]")
# RED FLAGS
if abs(rewards.max()) > 100 or abs(rewards.min()) > 100:
print("⚠️ RED FLAG: Rewards >> 1, normalize them!")
if rewards.std() > 10:
print("⚠️ RED FLAG: High reward variance, normalize or clip")
if rewards.mean() == rewards.max():
print("⚠️ RED FLAG: Constant rewards, no signal to learn from!")
if (rewards > 1).any() and (rewards < -1).any():
print("✓ Reward scale looks reasonable ([-1, 1] range)")
```
**Action if scale is wrong:**
```python
# Normalize to [-1, 1]
reward = reward / max(abs(rewards.max()), abs(rewards.min()))
# Or clip
reward = np.clip(reward, -1, 1)
# Or shift and scale
reward = 2 * (reward - rewards.min()) / (rewards.max() - rewards.min()) - 1
```
#### Check 2: Environment Sanity Check
**Why**: Broken environment → no algorithm will work.
```python
# DIAGNOSTIC SCRIPT
def sanity_check_env(env, num_episodes=5):
"""Quick check if environment is sane."""
for episode in range(num_episodes):
state = env.reset()
print(f"\nEpisode {episode}:")
print(f" Initial state shape: {state.shape}, dtype: {state.dtype}")
print(f" Initial state range: [{state.min()}, {state.max()}]")
for step in range(10):
action = env.action_space.sample()
next_state, reward, done, info = env.step(action)
print(f" Step {step}: action={action}, reward={reward}, done={done}")
print(f" State shape: {next_state.shape}, range: [{next_state.min()}, {next_state.max()}]")
# Check for NaN
if np.isnan(next_state).any() or np.isnan(reward):
print(f" ⚠️ NaN detected!")
# Check for reasonable values
if np.abs(next_state).max() > 1e6:
print(f" ⚠️ State explosion (values > 1e6)")
if done:
break
print("\n✓ Environment check complete")
sanity_check_env(env)
```
**RED FLAGS:**
- NaN or inf in observations/rewards
- State values exploding (> 1e6)
- Reward always same (no signal)
- Done flag never true (infinite episodes)
- State never changes despite actions
#### Check 3: Can You Beat It Manually?
**Why**: If human can't solve it, agent won't either (unless reward hacking).
```python
# Manual policy: Hardcoded behavior
def manual_policy(state):
# Example for CartPole: if pole tilting right, push right
if state[2] > 0: # angle > 0
return 1 # Push right
else:
return 0 # Push left
# Test manual policy
total_reward = 0
for episode in range(10):
state = env.reset()
for step in range(500):
action = manual_policy(state)
state, reward, done, _ = env.step(action)
total_reward += reward
if done:
break
avg_reward = total_reward / 10
print(f"Manual policy average reward: {avg_reward}")
# If avg_reward > 0: Environment is learnable
# If avg_reward ≤ 0: Environment is broken or impossible
```
#### Check 4: Observation Normalization
**Why**: Non-normalized observations cause learning problems.
```python
# Check if observations are normalized
for episode in range(10):
state = env.reset()
print(f"Episode {episode}: state range [{state.min()}, {state.max()}]")
# For images: should be [0, 1] or [-1, 1]
# For physical states: should be roughly [-1, 1]
if state.min() < -10 or state.max() > 10:
print("⚠️ Observations not normalized!")
# Solution:
state = state / np.abs(state).max() # Normalize
```
## Part 4: Common RL Bugs Catalog
### Bug 1: Reward Scale > 1
**Symptom**: Training unstable, loss spikes, agent doesn't learn
**Root Cause**: Gradients too large due to reward scale
**Code Example**:
```python
# WRONG: Reward in [0, 1000]
reward = success_count * 1000
# CORRECT: Normalize to [-1, 1]
reward = success_count * 1000
reward = reward / max_possible_reward # Result: [-1, 1]
```
**Fix**: Divide rewards by max possible value
**Detection**:
```python
rewards = [collect 100 episodes]
if max(abs(r) for r in rewards) > 1:
print("⚠️ Reward scale issue detected")
```
### Bug 2: Environment Reset Broken
**Symptom**: Agent learns initial state but can't adapt
**Root Cause**: Reset doesn't randomize initial state or returns same state
**Code Example**:
```python
# WRONG: Reset always same state
def reset(self):
self.state = np.array([0, 0, 0, 0]) # Always [0,0,0,0]
return self.state
# CORRECT: Reset randomizes initial state
def reset(self):
self.state = np.random.uniform(-0.1, 0.1, size=4) # Random
return self.state
```
**Fix**: Make reset() randomize initial state
**Detection**:
```python
states = [env.reset() for _ in range(10)]
if len(set(map(tuple, states))) == 1:
print("⚠️ Reset broken, always same state")
```
### Bug 3: Observation Insufficient (Partial Observability)
**Symptom**: Agent can't learn because it doesn't see enough
**Root Cause**: Observation missing velocity, derivatives, or temporal info
**Code Example**:
```python
# WRONG: Only position, no velocity
state = np.array([position]) # Can't infer velocity from position alone
# CORRECT: Position + velocity
state = np.array([position, velocity])
# WRONG for images: Single frame
observation = env.render() # Single frame, no temporal info
# CORRECT for images: Stacked frames
frames = [frame_t-3, frame_t-2, frame_t-1, frame_t] # 4 frames
observation = np.stack(frames, axis=-1) # Shape: (84, 84, 4)
```
**Fix**: Add missing information to observation
**Detection**:
```python
# If agent converges to bad performance despite long training
# Check: Can you compute optimal action from observation?
# If no: Observation is insufficient
```
### Bug 4: Reward Always Same (No Signal)
**Symptom**: Loss decreases but doesn't improve over time, reward flat
**Root Cause**: Reward is constant or nearly constant
**Code Example**:
```python
# WRONG: Constant reward
reward = 1.0 # Every step gets +1, no differentiation
# CORRECT: Differentiate good and bad outcomes
if reached_goal:
reward = 1.0
else:
reward = 0.0 # Or -0.1 for living cost
```
**Fix**: Ensure reward differentiates outcomes
**Detection**:
```python
rewards = [collect random policy rewards]
if rewards.std() < 0.01:
print("⚠️ Reward has no variance, no signal to learn")
```
### Bug 5: Learning Rate Too High
**Symptom**: Loss oscillates or explodes, training unstable
**Root Cause**: Gradient updates too large, overshooting optimum
**Code Example**:
```python
# WRONG: Learning rate 1e-2 (too high)
optimizer = Adam(model.parameters(), lr=1e-2)
# CORRECT: Learning rate 3e-4 (safe default)
optimizer = Adam(model.parameters(), lr=3e-4)
```
**Fix**: Reduce learning rate by 2-5×
**Detection**:
```python
# Watch loss first 100 steps
# If loss increases first step: LR too high
# If loss decreases but oscillates: LR probably high
```
### Bug 6: Learning Rate Too Low
**Symptom**: Agent learns very slowly, training takes forever
**Root Cause**: Gradient updates too small, learning crawls
**Code Example**:
```python
# WRONG: Learning rate 1e-6 (too low)
optimizer = Adam(model.parameters(), lr=1e-6)
# CORRECT: Learning rate 3e-4
optimizer = Adam(model.parameters(), lr=3e-4)
```
**Fix**: Increase learning rate by 2-5×
**Detection**:
```python
# Training curve increases very slowly
# If training 1M steps and reward barely improved: LR too low
```
### Bug 7: No Exploration Decay
**Symptom**: Agent learns but remains noisy, doesn't fully exploit
**Root Cause**: Exploration (epsilon or entropy) not decaying
**Code Example**:
```python
# WRONG: Constant epsilon
epsilon = 0.3 # Forever
# CORRECT: Decay epsilon
epsilon = epsilon_linear(step, total_steps=1_000_000,
epsilon_start=1.0, epsilon_end=0.01)
```
**Fix**: Add exploration decay schedule
**Detection**:
```python
# Plot entropy or epsilon over training
# Should show clear decay from high to low
# If flat: not decaying
```
### Bug 8: Exploration Decay Too Fast
**Symptom**: Agent plateaus early, stuck in local optimum
**Root Cause**: Exploration stops before finding good policy
**Code Example**:
```python
# WRONG: Decays to zero in 10k steps (for 1M step training)
epsilon = 0.99 ** (step / 100) # Reaches 0 too fast
# CORRECT: Decays over full training
epsilon = epsilon_linear(step, total_steps=1_000_000,
epsilon_start=1.0, epsilon_end=0.01)
```
**Fix**: Use longer decay schedule
**Detection**:
```python
# Plot epsilon over training
# Should reach final value at 50-80% through training
# Not at 5%
```
### Bug 9: Reward Hacking
**Symptom**: Agent achieves high reward but behavior is useless
**Root Cause**: Agent found way to game reward not aligned with intent
**Code Example**:
```python
# WRONG: Reward for just staying alive
reward = 1.0 # Every timestep
# Agent learns: Stay in corner, don't move, get infinite reward
# CORRECT: Reward for progress + living cost
position_before = self.state[0]
self.state = compute_next_state(...)
position_after = self.state[0]
progress = position_after - position_before
reward = progress - 0.01 # Progress bonus, living cost
```
**Fix**: Reshape reward to align with intent
**Detection**:
```python
# Visualize agent behavior
# If behavior weird but reward high: hacking
# If reward increases but task performance bad: hacking
```
### Bug 10: Testing with Exploration
**Symptom**: Test performance much worse than training, high variance
**Root Cause**: Using stochastic policy at test time
**Code Example**:
```python
# WRONG: Test with epsilon > 0
for test_episode in range(100):
action = epsilon_greedy(q_values, epsilon=0.05) # Wrong!
# Agent still explores at test
# CORRECT: Test greedy
for test_episode in range(100):
action = np.argmax(q_values) # Deterministic
```
**Fix**: Use greedy/deterministic policy at test time
**Detection**:
```python
# Test reward variance high?
# Test reward < train reward?
# Check: Are you using exploration at test time?
```
## Part 5: Logging and Monitoring
### What Metrics to Track
```python
# Minimal set of metrics for RL debugging
class RLLogger:
def __init__(self):
self.episode_rewards = []
self.policy_losses = []
self.value_losses = []
self.entropies = []
self.gradient_norms = []
def log_episode(self, episode_reward):
self.episode_rewards.append(episode_reward)
def log_losses(self, policy_loss, value_loss, entropy):
self.policy_losses.append(policy_loss)
self.value_losses.append(value_loss)
self.entropies.append(entropy)
def log_gradient_norm(self, norm):
self.gradient_norms.append(norm)
def plot_training(self):
"""Visualize training progress."""
# Plot 1: Episode rewards over time (smoothed)
# Plot 2: Policy and value losses
# Plot 3: Entropy (should decay)
# Plot 4: Gradient norms
pass
```
### What Each Metric Means
#### Metric 1: Episode Reward
**What to look for**:
- Should trend upward over time
- Should have decreasing variance (less oscillation)
- Slight noise is normal
**Red flags**:
- Flat line: Not learning
- Downward trend: Getting worse
- Wild oscillations: Instability or unlucky randomness
**Code**:
```python
rewards = agent.get_episode_rewards()
reward_smoothed = np.convolve(rewards, np.ones(100)/100, mode='valid')
plt.plot(reward_smoothed) # Smooth to see trend
```
#### Metric 2: Policy Loss
**What to look for**:
- Should decrease over training
- Decrease should smooth out (not oscillating)
**Red flags**:
- Loss increasing: Learning rate too high
- Loss oscillating: Learning rate too high or reward scale wrong
- Loss = 0: Policy not updating
**Code**:
```python
if policy_loss > policy_loss_prev:
print("⚠️ Policy loss increased, LR might be too high")
```
#### Metric 3: Value Loss (for critic-based methods)
**What to look for**:
- Should decrease initially, then plateau
- Should not oscillate heavily
**Red flags**:
- Loss exploding: LR too high
- Loss not changing: Not updating
**Code**:
```python
value_loss_smoothed = np.convolve(value_losses, np.ones(100)/100)
if value_loss_smoothed[-1] > value_loss_smoothed[-100]:
print("⚠️ Value loss increasing recently")
```
#### Metric 4: Entropy (Policy Randomness)
**What to look for**:
- Should start high (exploring)
- Should decay to low (exploiting)
- Clear downward trend
**Red flags**:
- Entropy always high: Too much exploration
- Entropy drops to zero: Over-exploiting
- No decay: Entropy not decreasing
**Code**:
```python
if entropy[-1] > entropy[-100]:
print("⚠️ Entropy increasing, exploration not decaying")
```
#### Metric 5: Gradient Norms
**What to look for**:
- Should stay roughly constant over training
- Typical range: 0.1 to 10
**Red flags**:
- Gradient norms > 100: Exploding gradients
- Gradient norms < 0.001: Vanishing gradients
- Sudden spikes: Outlier data or numerical issue
**Code**:
```python
total_norm = 0
for p in model.parameters():
param_norm = p.grad.norm(2)
total_norm += param_norm ** 2
total_norm = total_norm ** 0.5
if total_norm > 100:
print("⚠️ Gradient explosion detected")
```
### Visualization Script
```python
import matplotlib.pyplot as plt
import numpy as np
def plot_rl_training(rewards, policy_losses, value_losses, entropies):
"""Plot training metrics for RL debugging."""
fig, axes = plt.subplots(2, 2, figsize=(15, 10))
# Plot 1: Episode rewards
ax = axes[0, 0]
ax.plot(rewards, alpha=0.3, label='Episode reward')
reward_smooth = np.convolve(rewards, np.ones(100)/100, mode='valid')
ax.plot(range(100, len(rewards)), reward_smooth, label='Smoothed (100 episodes)')
ax.set_xlabel('Episode')
ax.set_ylabel('Reward')
ax.set_title('Episode Rewards Over Time')
ax.legend()
ax.grid()
# Plot 2: Policy loss
ax = axes[0, 1]
ax.plot(policy_losses, alpha=0.3)
loss_smooth = np.convolve(policy_losses, np.ones(100)/100, mode='valid')
ax.plot(range(100, len(policy_losses)), loss_smooth, label='Smoothed')
ax.set_xlabel('Step')
ax.set_ylabel('Policy Loss')
ax.set_title('Policy Loss Over Time')
ax.legend()
ax.grid()
# Plot 3: Entropy
ax = axes[1, 0]
ax.plot(entropies, label='Policy entropy')
ax.set_xlabel('Step')
ax.set_ylabel('Entropy')
ax.set_title('Policy Entropy (Should Decrease)')
ax.legend()
ax.grid()
# Plot 4: Value loss
ax = axes[1, 1]
ax.plot(value_losses, alpha=0.3)
loss_smooth = np.convolve(value_losses, np.ones(100)/100, mode='valid')
ax.plot(range(100, len(value_losses)), loss_smooth, label='Smoothed')
ax.set_xlabel('Step')
ax.set_ylabel('Value Loss')
ax.set_title('Value Loss Over Time')
ax.legend()
ax.grid()
plt.tight_layout()
plt.show()
```
## Part 6: Common Pitfalls and Red Flags
### Pitfall 1: "Bigger Network = Better Learning"
**Wrong**: Oversized networks overfit and learn slowly
**Right**: Start with small network (2-3 hidden layers, 64-256 units)
**Red Flag**: Network has > 10M parameters for simple task
**Fix**:
```python
# Too big
model = nn.Sequential(
nn.Linear(4, 1024),
nn.ReLU(),
nn.Linear(1024, 1024),
nn.Linear(1024, 2)
)
# Right size
model = nn.Sequential(
nn.Linear(4, 128),
nn.ReLU(),
nn.Linear(128, 128),
nn.Linear(128, 2)
)
```
### Pitfall 2: "Random Seed Doesn't Matter"
**Wrong**: Different seeds give very different results (indicates instability)
**Right**: Results should be consistent across seeds (within reasonable variance)
**Red Flag**: Reward varies by 50%+ across 5 seeds
**Fix**:
```python
# Test across multiple seeds
rewards_by_seed = []
for seed in range(5):
np.random.seed(seed)
torch.manual_seed(seed)
reward = train_agent(seed)
rewards_by_seed.append(reward)
print(f"Mean: {np.mean(rewards_by_seed)}, Std: {np.std(rewards_by_seed)}")
if np.std(rewards_by_seed) > 0.5 * np.mean(rewards_by_seed):
print("⚠️ High variance across seeds, training unstable")
```
### Pitfall 3: "Skip Observation Normalization"
**Wrong**: Non-normalized observations (scale [-1e6, 1e6])
**Right**: Normalized observations (scale [-1, 1])
**Red Flag**: State values > 100 or < -100
**Fix**:
```python
# Normalize images
observation = observation.astype(np.float32) / 255.0
# Normalize states
observation = (observation - observation_mean) / observation_std
# Or standardize on-the-fly
normalized_obs = (obs - running_mean) / (running_std + 1e-8)
```
### Pitfall 4: "Ignore the Reward Curve Shape"
**Wrong**: Only look at final reward, ignore curve shape
**Right**: Curve shape tells you what's wrong
**Red Flag**: Curve shapes indicate:
- Flat then sudden jump: Long exploration, then found policy
- Oscillating: Unstable learning
- Decreasing after peak: Catastrophic forgetting
**Fix**:
```python
# Look at curve shape
if reward_curve is flat:
print("Not learning, check environment/reward")
elif reward_curve oscillates:
print("Unstable, check LR or reward scale")
elif reward_curve peaks then drops:
print("Overfitting or exploration decay wrong")
```
### Pitfall 5: "Skip the Random Baseline Check"
**Wrong**: Train agent without knowing what random baseline is
**Right**: Always compute random baseline first
**Red Flag**: Agent barely beats random (within 5% of baseline)
**Fix**:
```python
# Compute random baseline
random_rewards = []
for _ in range(100):
state = env.reset()
episode_reward = 0
for step in range(1000):
action = env.action_space.sample()
state, reward, done, _ = env.step(action)
episode_reward += reward
if done:
break
random_rewards.append(episode_reward)
random_baseline = np.mean(random_rewards)
print(f"Random baseline: {random_baseline}")
# Compare agent
agent_reward = train_agent()
improvement = (agent_reward - random_baseline) / random_baseline
print(f"Agent improvement: {improvement*100}%")
```
### Pitfall 6: "Changing Multiple Hyperparameters at Once"
**Wrong**: Change 5 things, training breaks, don't know which caused it
**Right**: Change one thing at a time, test, measure, iterate
**Red Flag**: Code has "TUNING" comments with 10 simultaneous changes
**Fix**:
```python
# Scientific method for debugging
def debug_lr():
for lr in [1e-5, 1e-4, 1e-3, 1e-2]:
reward = train_with_lr(lr)
print(f"LR={lr}: Reward={reward}")
# Only change LR, keep everything else same
def debug_batch_size():
for batch in [32, 64, 128, 256]:
reward = train_with_batch(batch)
print(f"Batch={batch}: Reward={reward}")
# Only change batch, keep everything else same
```
### Pitfall 7: "Using Training Metrics to Judge Performance"
**Wrong**: Trust training reward, test once at the end
**Right**: Monitor test reward during training (with exploration off)
**Red Flag**: Training reward high, test reward low (overfitting)
**Fix**:
```python
# Evaluate with greedy policy (no exploration)
def evaluate(agent, num_episodes=10):
episode_rewards = []
for _ in range(num_episodes):
state = env.reset()
episode_reward = 0
for step in range(1000):
action = agent.act(state, explore=False) # Greedy
state, reward, done, _ = env.step(action)
episode_reward += reward
if done:
break
episode_rewards.append(episode_reward)
return np.mean(episode_rewards)
# Monitor during training
for step in range(total_steps):
train_agent_step()
if step % 10000 == 0:
test_reward = evaluate(agent) # Evaluate periodically
print(f"Step {step}: Test reward={test_reward}")
```
## Part 7: Red Flags Checklist
```
CRITICAL RED FLAGS (Stop and debug immediately):
[ ] NaN in loss or rewards
→ Check: reward scale, gradients, network outputs
[ ] Gradient norms > 100 (exploding)
→ Check: Enable gradient clipping, reduce LR
[ ] Gradient norms < 1e-4 (vanishing)
→ Check: Increase LR, check network initialization
[ ] Reward always same
→ Check: Is reward function broken? No differentiation?
[ ] Agent never improves beyond random baseline
→ Check: Reward scale, environment, observation, exploration
[ ] Loss oscillates wildly
→ Check: Learning rate (likely too high), reward scale
[ ] Episode length decreases over training
→ Check: Agent learning bad behavior, poor reward shaping
[ ] Test reward >> training reward
→ Check: Training is lucky, test is representative
[ ] Training gets worse after improving
→ Check: Catastrophic forgetting, stability issue
IMPORTANT RED FLAGS (Debug within a few training runs):
[ ] Entropy not decaying (always high)
→ Check: Entropy regularization, exploration decay
[ ] Entropy goes to zero early
→ Check: Entropy coefficient too low, exploration too aggressive
[ ] Variance across seeds > 50% of mean
→ Check: Training is unstable or lucky, try more seeds
[ ] Network weights not changing
→ Check: Gradient zero, LR zero, network not connected
[ ] Loss = 0 (perfect fit)
→ Check: Network overfitting, reward too easy
MINOR RED FLAGS (Watch for patterns):
[ ] Training slower than expected
→ Check: LR too low, batch size too small, network too small
[ ] Occasional loss spikes
→ Check: Outlier data, reward outliers, clipping needed
[ ] Reward variance high
→ Check: Normal if environment stochastic, check if aligns with intent
[ ] Agent behavior seems random even late in training
→ Check: Entropy not decaying, exploration not stopping
```
## Part 8: Rationalization Resistance
| Rationalization | Reality | Counter-Guidance |
|-----------------|---------|------------------|
| "Higher learning rate will speed up learning" | Can cause instability, often slows learning | Start with 3e-4, measure effect, don't assume |
| "Bigger network always learns better" | Oversized networks overfit, slow training | Start small (64-256 units), increase only if needed |
| "Random seed doesn't matter, RL is random anyway" | High variance indicates instability, not inherent randomness | Run 5+ seeds, variance should be low, not high |
| "I'll try all hyperparameters (grid search)" | Combinatorial explosion, wastes time, no diagnosis | Check environment/reward FIRST, then tune one param at a time |
| "Adding regularization helps unstable training" | Regularization is for overfitting, not instability | Instability usually LR or reward scale, not overfitting |
| "My algorithm is broken" | 80% chance environment, reward, or observation is broken | Check those FIRST before blaming algorithm |
| "More training always helps" | If reward plateaus, more training won't help | Check if training converged, if not why |
| "Skip observation normalization, network will learn to normalize" | Network should not spend capacity learning normalization | Normalize observations before network |
| "Test with epsilon > 0 to reduce variance" | Test should use learned policy, exploration harms test | Use greedy policy at test time |
| "If loss doesn't decrease, algorithm is broken" | More likely: reward scale wrong, gradient clipping needed | Check reward scale, enable gradient clipping before changing algorithm |
## Key Takeaways
1. **Follow the systematic process**: Don't random tweak. Check environment → reward → observation → algorithm.
2. **80/20 rule**: Most failures are in environment, reward, or observation. Check those first.
3. **Reward scale is critical**: Most common bug. Normalize to [-1, 1].
4. **Diagnosis trees**: Use them. Different symptoms have different root causes.
5. **Metrics tell you everything**: Loss, entropy, gradient norms reveal what's wrong.
6. **Rationalization is the enemy**: Don't assume, measure. Plot curves, check outputs, verify.
7. **Simple environment first**: If agent can't learn CartPole, bigger environment won't help.
8. **One seed is not enough**: Run 5+ seeds, look at variance, not just mean.
This skill is about **systematic debugging**, not random tweaking. Apply the framework, follow diagnosis trees, and you'll find the bug.