1395 lines
43 KiB
Markdown
1395 lines
43 KiB
Markdown
|
||
# RL Debugging Methodology
|
||
|
||
## When to Use This Skill
|
||
|
||
Invoke this skill when you encounter:
|
||
|
||
- **Agent Won't Learn**: Reward stuck at baseline, not improving
|
||
- **Training Unstable**: Loss bouncing, reward highly variable
|
||
- **Suboptimal Policy**: Agent learned something but worse than expected
|
||
- **Reward Hacking**: Agent gaming the reward function
|
||
- **Exploration Issues**: Agent stuck in local optimum or exploring poorly
|
||
- **Hyperparameter Sensitivity**: Small changes break training
|
||
- **Learning Rate Tuning**: Not sure what value is right
|
||
- **Convergence Problems**: Loss doesn't decrease or decreases then stops
|
||
- **Environment vs Algorithm**: Unsure if problem is environment or RL algorithm
|
||
- **Logging Confusion**: Not sure what metrics to monitor
|
||
- **Gradual Performance Degradation**: Early training good, later poor
|
||
- **Sparse Reward Challenge**: Agent never finds reward signal
|
||
|
||
**Core Problem**: RL debugging often becomes random hyperparameter tweaking. Agents are complex systems with many failure modes. Systematic diagnosis finds root causes; random tweaking wastes time and leads to contradictory findings.
|
||
|
||
## Do NOT Use This Skill For
|
||
|
||
- **Learning RL theory** (route to rl-foundations for MDPs, Bellman equations, policy gradients)
|
||
- **Implementing new algorithms** (route to algorithm-specific skills like value-based-methods, policy-gradient-methods, actor-critic-methods)
|
||
- **Environment API questions** (route to rl-environments for Gym/Gymnasium API, custom environments, wrappers)
|
||
- **Evaluation methodology** (route to rl-evaluation for rigorous statistical testing, generalization assessment)
|
||
- **Initial algorithm selection** (route to using-deep-rl router or rl-foundations for choosing the right algorithm family)
|
||
|
||
|
||
## Core Principle: The 80/20 Rule
|
||
|
||
**The most important insight in RL debugging:**
|
||
|
||
```
|
||
80% of RL failures are in:
|
||
1. Environment design (agent can't see true state)
|
||
2. Reward function (misaligned or wrong scale)
|
||
3. Observation/action representation (missing information)
|
||
|
||
15% are in:
|
||
4. Hyperparameters (learning rate, batch size, etc.)
|
||
5. Exploration strategy (too much or too little)
|
||
|
||
5% are in:
|
||
6. Algorithm selection (wrong algorithm for problem)
|
||
```
|
||
|
||
**Consequence**: If training fails, check environment and reward FIRST. Changing the algorithm last.
|
||
|
||
### Why This Order?
|
||
|
||
**Scenario 1: Broken Environment**
|
||
|
||
```python
|
||
# BROKEN ENVIRONMENT: Agent can't win no matter what algorithm
|
||
class BrokenEnv:
|
||
def reset(self):
|
||
self.state = random_state() # Agent can't control this
|
||
return self.state
|
||
|
||
def step(self, action):
|
||
# Reward independent of action!
|
||
reward = random.random()
|
||
return self.state, reward
|
||
|
||
# No amount of PPO, DQN, SAC can learn from random reward
|
||
|
||
# CORRECT ENVIRONMENT: Agent can win with right policy
|
||
class CorrectEnv:
|
||
def reset(self):
|
||
self.state = initial_state
|
||
return self.state
|
||
|
||
def step(self, action):
|
||
# Reward depends on action
|
||
reward = compute_reward(self.state, action)
|
||
self.state = compute_next_state(self.state, action)
|
||
return self.state, reward
|
||
```
|
||
|
||
**If environment is broken, no algorithm will learn.**
|
||
|
||
**Scenario 2: Reward Scale Issue**
|
||
|
||
```python
|
||
# WRONG SCALE: Reward in [0, 1000000]
|
||
# Algorithm gradient updates: param = param - lr * grad
|
||
# If gradient huge (due to reward scale), single step breaks everything
|
||
|
||
# CORRECT SCALE: Reward in [-1, 1]
|
||
# Gradients are reasonable, learning stable
|
||
|
||
# Fix is simple: divide reward by scale factor
|
||
# But if you don't know to check reward scale, you'll try 10 learning rates instead
|
||
```
|
||
|
||
**Consequence: Always check reward scale before tuning learning rate.**
|
||
|
||
|
||
## Part 1: Systematic Debugging Framework
|
||
|
||
### The Debugging Process (Not Random Tweaking)
|
||
|
||
```
|
||
START: Agent not learning (or training unstable, or suboptimal)
|
||
|
||
Step 1: ENVIRONMENT CHECK (Does agent have what it needs?)
|
||
├─ Can agent see the state? (Is observation sufficient?)
|
||
├─ Is environment deterministic or stochastic? (Affects algorithm choice)
|
||
├─ Can agent actually win? (Does optimal policy exist?)
|
||
└─ Is environment reset working? (Fresh episode each reset?)
|
||
|
||
Step 2: REWARD SCALE CHECK (Is reward in reasonable range?)
|
||
├─ What's the range of rewards? (Min, max, typical)
|
||
├─ Are rewards normalized? (Should be ≈ [-1, 1])
|
||
├─ Is reward aligned with desired behavior? (No reward hacking)
|
||
└─ Are rewards sparse or dense? (Affects exploration strategy)
|
||
|
||
Step 3: OBSERVATION REPRESENTATION (Is information preserved?)
|
||
├─ Are observations normalized? (Images: [0, 255] or [0, 1]?)
|
||
├─ Is temporal information included? (Frame stacking for Atari?)
|
||
├─ Are observations consistent? (Same format each episode?)
|
||
└─ Is observation sufficient to solve problem? (Can human win from this info?)
|
||
|
||
Step 4: BASIC ALGORITHM CHECK (Is the RL algorithm working at all?)
|
||
├─ Run on simple environment (CartPole, simple task)
|
||
├─ Can algorithm learn on simple env? (If not: algorithm issue)
|
||
├─ Can algorithm beat random baseline? (If not: something is broken)
|
||
└─ Does loss decrease? (If not: learning not happening)
|
||
|
||
Step 5: HYPERPARAMETER TUNING (Only after above passed)
|
||
├─ Is learning rate in reasonable range? (1e-5 to 1e-3 typical)
|
||
├─ Is batch size appropriate? (Power of 2: 32, 64, 128, 256)
|
||
├─ Is exploration sufficient? (Epsilon decaying? Entropy positive?)
|
||
└─ Are network layers reasonable? (3 hidden layers typical)
|
||
|
||
Step 6: LOGGING ANALYSIS (What do the metrics say?)
|
||
├─ Policy loss: decreasing? exploding? zero?
|
||
├─ Value loss: decreasing? stable?
|
||
├─ Reward curve: trending up? flat? oscillating?
|
||
├─ Entropy: decreasing over time? (Exploration → exploitation)
|
||
└─ Gradient norms: reasonable? exploding? vanishing?
|
||
|
||
Step 7: IDENTIFY ROOT CAUSE (Synthesize findings)
|
||
└─ Where is the actual problem? (Environment, reward, algorithm, hyperparameters)
|
||
```
|
||
|
||
### Why This Order Matters
|
||
|
||
**Common mistake: Jump to Step 5 (hyperparameter tuning)**
|
||
|
||
```python
|
||
# Agent not learning. Frustration sets in.
|
||
# "I'll try learning rate 1e-4" (Step 5, skipped 1-4)
|
||
# Doesn't work.
|
||
# "I'll try batch size 64" (more Step 5 tweaking)
|
||
# Doesn't work.
|
||
# "I'll try a bigger network" (still Step 5)
|
||
# Doesn't work.
|
||
# Hours wasted.
|
||
|
||
# Correct approach: Follow Steps 1-4 first.
|
||
# Step 1: Oh! Environment reset is broken, always same initial state
|
||
# Fix environment.
|
||
# Now agent learns immediately with default hyperparameters.
|
||
```
|
||
|
||
**The order reflects probability**: It's more likely the environment is broken than the algorithm; more likely the reward scale is wrong than learning rate is wrong.
|
||
|
||
|
||
## Part 2: Diagnosis Trees by Symptom
|
||
|
||
### Diagnosis Tree 1: "Agent Won't Learn"
|
||
|
||
**Symptom**: Reward stuck near random baseline. Loss doesn't decrease meaningfully.
|
||
|
||
```
|
||
START: Agent Won't Learn
|
||
|
||
├─ STEP 1: Can agent beat random baseline?
|
||
│ ├─ YES → Skip to STEP 4
|
||
│ └─ NO → Environment issue likely
|
||
│ ├─ Check 1A: Is environment output sane?
|
||
│ │ ├─ Print first 5 episodes: state, action, reward, next_state
|
||
│ │ ├─ Verify types match (shapes, ranges, dtypes)
|
||
│ │ └─ Is reward always same? Always zero? (Red flag: no signal)
|
||
│ ├─ Check 1B: Can you beat it manually?
|
||
│ │ ├─ Play environment by hand (hardcode a policy)
|
||
│ │ ├─ Can you get >0 reward? (If not: environment is broken)
|
||
│ │ └─ If yes: Agent is missing something
|
||
│ └─ Check 1C: Is reset working?
|
||
│ ├─ Call reset() twice, check states differ
|
||
│ └─ If states same: reset is broken, fix it
|
||
|
||
├─ STEP 2: Is reward scale reasonable?
|
||
│ ├─ Compute: min, max, mean, std of rewards from random policy
|
||
│ ├─ If range >> 1 (e.g., [0, 10000]):
|
||
│ │ ├─ Action: Normalize rewards to [-1, 1]
|
||
│ │ ├─ Code: reward = reward / max_possible_reward
|
||
│ │ └─ Retest: Usually fixes "won't learn"
|
||
│ ├─ If range << 1 (e.g., [0, 0.001]):
|
||
│ │ ├─ Action: Scale up rewards
|
||
│ │ ├─ Code: reward = reward * 1000
|
||
│ │ └─ Or increase network capacity (more signal needed)
|
||
│ └─ If reward is [0, 1] (looks fine):
|
||
│ └─ Continue to STEP 3
|
||
|
||
├─ STEP 3: Is observation sufficient?
|
||
│ ├─ Check 3A: Are observations normalized?
|
||
│ │ ├─ If images [0, 255]: normalize to [0, 1] or [-1, 1]
|
||
│ │ ├─ Code: observation = observation / 255.0
|
||
│ │ └─ Retest
|
||
│ ├─ Check 3B: Is temporal info included? (For vision: frame stacking)
|
||
│ │ ├─ If using images: last 4 frames stacked?
|
||
│ │ ├─ If using states: includes velocity/derivatives?
|
||
│ │ └─ Missing temporal info → agent can't infer velocity
|
||
│ └─ Check 3C: Is observation Markovian?
|
||
│ ├─ Can optimal policy be derived from this observation?
|
||
│ ├─ If not: observation insufficient (red flag)
|
||
│ └─ Example: Only position, not velocity → agent can't control
|
||
|
||
├─ STEP 4: Run sanity check on simple environment
|
||
│ ├─ Switch to CartPole or equivalent simple env
|
||
│ ├─ Train with default hyperparameters
|
||
│ ├─ Does simple env learn? (Should learn in 1000-5000 steps)
|
||
│ ├─ YES → Your algorithm works, issue is your env/hyperparameters
|
||
│ └─ NO → Algorithm itself broken (rare, check algorithm implementation)
|
||
|
||
├─ STEP 5: Check exploration
|
||
│ ├─ Is agent exploring or stuck?
|
||
│ ├─ Log entropy (for stochastic policies)
|
||
│ ├─ If entropy → 0 early: agent exploiting before exploring
|
||
│ │ └─ Solution: Increase entropy regularization or ε
|
||
│ ├─ If entropy always high: too much exploration
|
||
│ │ └─ Solution: Decay entropy or ε more aggressively
|
||
│ └─ Visualize: Plot policy actions over time, should see diversity early
|
||
|
||
├─ STEP 6: Check learning rate
|
||
│ ├─ Is learning rate in [1e-5, 1e-3]? (typical range)
|
||
│ ├─ If > 1e-3: Try reducing (might be too aggressive)
|
||
│ ├─ If < 1e-5: Try increasing (might be too conservative)
|
||
│ ├─ Watch loss first step: If loss increases → LR too high
|
||
│ └─ Safe default: 3e-4
|
||
|
||
└─ STEP 7: Check network architecture
|
||
├─ For continuous control: small networks ok (1-2 hidden layers, 64-256 units)
|
||
├─ For vision: use CNN (don't use FC on pixels)
|
||
├─ Check if network has enough capacity
|
||
└─ Tip: Start with simple, add complexity if needed
|
||
```
|
||
|
||
**ROOT CAUSES in order of likelihood:**
|
||
|
||
1. **Reward scale wrong** (40% of cases)
|
||
2. **Environment broken** (25% of cases)
|
||
3. **Observation insufficient** (15% of cases)
|
||
4. **Learning rate too high/low** (12% of cases)
|
||
5. **Algorithm issue** (8% of cases)
|
||
|
||
|
||
### Diagnosis Tree 2: "Training Unstable"
|
||
|
||
**Symptom**: Loss bounces wildly, reward spikes then crashes, training oscillates.
|
||
|
||
```
|
||
START: Training Unstable
|
||
|
||
├─ STEP 1: Characterize the instability
|
||
│ ├─ Plot loss curve: Does it bounce at same magnitude or grow?
|
||
│ ├─ Plot reward curve: Does it oscillate around mean or trend down?
|
||
│ ├─ Compute: reward variance over 100 episodes
|
||
│ └─ This tells you: Is it normal variance or pathological instability?
|
||
|
||
├─ STEP 2: Check if environment is deterministic
|
||
│ ├─ Deterministic environment + stochastic policy = normal variance
|
||
│ ├─ Stochastic environment + any policy = high variance (expected)
|
||
│ ├─ If stochastic: Can you reduce randomness? Or accept higher variance?
|
||
│ └─ Some instability is normal; distinguish from pathological
|
||
|
||
├─ STEP 3: Check reward scale
|
||
│ ├─ If rewards >> 1: Gradient updates too large
|
||
│ │ ├─ Single step might overshoot optimum
|
||
│ │ ├─ Solution: Normalize rewards to [-1, 1]
|
||
│ │ └─ This often fixes instability immediately
|
||
│ ├─ If reward has outliers: Single large reward breaks training
|
||
│ │ ├─ Solution: Reward clipping or scaling
|
||
│ │ └─ Example: r = np.clip(reward, -1, 1)
|
||
│ └─ Check: Is reward scale consistent?
|
||
|
||
├─ STEP 4: Check learning rate (LR often causes instability)
|
||
│ ├─ If loss oscillates: LR likely too high
|
||
│ │ ├─ Try reducing by 2-5× (e.g., 1e-3 → 3e-4)
|
||
│ │ ├─ Watch first 100 steps: Loss should decrease monotonically
|
||
│ │ └─ If still oscillates: try 10× reduction
|
||
│ ├─ If you have LR scheduler: Check if it's too aggressive
|
||
│ │ ├─ Scheduler reducing LR too fast can cause steps
|
||
│ │ └─ Solution: Slower schedule (more steps to final LR)
|
||
│ └─ Test: Set LR very low (1e-5), see if training is smooth
|
||
│ ├─ YES → Increase LR gradually until instability starts
|
||
│ └─ This bracketing finds safe LR range
|
||
|
||
├─ STEP 5: Check batch size
|
||
│ ├─ Small batch (< 32): High gradient variance, bouncy updates
|
||
│ │ ├─ Solution: Increase batch size (32, 64, 128)
|
||
│ │ └─ But not too large: training becomes slow
|
||
│ ├─ Large batch (> 512): Might overfit, large gradient steps
|
||
│ │ ├─ Solution: Use gradient accumulation
|
||
│ │ └─ Or reduce learning rate slightly
|
||
│ └─ Start with batch_size=64, adjust if needed
|
||
|
||
├─ STEP 6: Check gradient clipping
|
||
│ ├─ Are gradients exploding? (Check max gradient norm)
|
||
│ │ ├─ If max grad norm > 100: Likely exploding gradients
|
||
│ │ ├─ Solution: Enable gradient clipping (max_norm=1.0)
|
||
│ │ └─ Code: torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
|
||
│ ├─ If max grad norm reasonable (< 10): Skip this step
|
||
│ └─ Watch grad norm over training: Should stay roughly constant
|
||
|
||
├─ STEP 7: Check algorithm-specific parameters
|
||
│ ├─ For PPO: Is clipping epsilon reasonable? (0.2 default)
|
||
│ │ ├─ Too high: Over-clips, doesn't update
|
||
│ │ └─ Too low: Allows large updates, instability
|
||
│ ├─ For DQN: Is target network update frequency appropriate?
|
||
│ │ ├─ Update too often: Target constantly changing
|
||
│ │ └─ Update too rarely: Stale targets
|
||
│ └─ For A3C/A2C: Check entropy coefficient
|
||
│ ├─ Too high: Too much exploration, policy noisy
|
||
│ └─ Too low: Premature convergence
|
||
|
||
└─ STEP 8: Check exploration decay
|
||
├─ Is exploration decaying too fast? (Policy becomes deterministic)
|
||
│ └─ If entropy→0 early: Agent exploits before exploring
|
||
├─ Is exploration decaying too slow? (Policy stays noisy)
|
||
│ └─ If entropy stays high: Too much randomness in later training
|
||
└─ Entropy should decay: high early, low late
|
||
└─ Plot entropy over training: should show clear decay curve
|
||
```
|
||
|
||
**ROOT CAUSES in order of likelihood:**
|
||
|
||
1. **Learning rate too high** (35% of cases)
|
||
2. **Reward scale too large** (25% of cases)
|
||
3. **Batch size too small** (15% of cases)
|
||
4. **Gradient explosion** (10% of cases)
|
||
5. **Algorithm parameters** (10% of cases)
|
||
6. **Environment stochasticity** (5% of cases)
|
||
|
||
|
||
### Diagnosis Tree 3: "Suboptimal Policy"
|
||
|
||
**Symptom**: Agent learned something but performs worse than expected. Better than random baseline, but not good enough.
|
||
|
||
```
|
||
START: Suboptimal Policy
|
||
|
||
├─ STEP 1: How suboptimal? (Quantify the gap)
|
||
│ ├─ Compute: Agent reward vs theoretical optimal
|
||
│ ├─ If 80% of optimal: Normal (RL usually gets 80-90% optimal)
|
||
│ ├─ If 50% of optimal: Significantly suboptimal
|
||
│ ├─ If 20% of optimal: Very bad
|
||
│ └─ This tells you: Is it "good enough" or truly broken?
|
||
|
||
├─ STEP 2: Is it stuck in local optimum?
|
||
│ ├─ Run multiple seeds: Do you get similar reward each seed?
|
||
│ ├─ If rewards similar across seeds: Consistent local optimum
|
||
│ ├─ If rewards vary wildly: High variance, need more training
|
||
│ └─ Solution if local optimum: More exploration or better reward shaping
|
||
|
||
├─ STEP 3: Check reward hacking
|
||
│ ├─ Visualize agent behavior: Does it match intent?
|
||
│ ├─ Example: Cart-pole reward is [0, 1] per timestep
|
||
│ │ ├─ Agent might learn: "Stay in center, don't move"
|
||
│ │ ├─ Policy is suboptimal but still gets reward
|
||
│ │ └─ Solution: Reward engineering (bonus for progress)
|
||
│ └─ Hacking signs:
|
||
│ ├─ Agent does something weird but gets reward
|
||
│ ├─ Behavior makes no intuitive sense
|
||
│ └─ Reward increases but performance bad
|
||
|
||
├─ STEP 4: Is exploration sufficient?
|
||
│ ├─ Check entropy: Does policy explore initially?
|
||
│ ├─ Check epsilon decay (if using ε-greedy): Does it decay appropriately?
|
||
│ ├─ Is agent exploring broadly or stuck in small region?
|
||
│ ├─ Solution: Slower exploration decay or intrinsic motivation
|
||
│ └─ Use RND/curiosity if environment has sparse rewards
|
||
|
||
├─ STEP 5: Check network capacity
|
||
│ ├─ Is network too small to represent optimal policy?
|
||
│ ├─ For vision: Use standard CNN (not tiny network)
|
||
│ ├─ For continuous control: 2-3 hidden layers, 128-256 units
|
||
│ ├─ Test: Double network size, does performance improve?
|
||
│ └─ If yes: Original network was too small
|
||
|
||
├─ STEP 6: Check data efficiency
|
||
│ ├─ Is agent training long enough?
|
||
│ ├─ RL usually needs: simple tasks 100k steps, complex tasks 1M+ steps
|
||
│ ├─ If training only 10k steps: Too short, agent didn't converge
|
||
│ ├─ Solution: Train longer (but check reward curve first)
|
||
│ └─ If reward plateaus early: Extend training won't help
|
||
|
||
├─ STEP 7: Check observation and action spaces
|
||
│ ├─ Is action space continuous or discrete?
|
||
│ ├─ Is action discretization appropriate?
|
||
│ │ ├─ Too coarse: Can't express fine control
|
||
│ │ ├─ Too fine: Huge action space, hard to learn
|
||
│ │ └─ Example: 100 actions for simple control = too many
|
||
│ ├─ Is observation sufficient? (See Diagnosis Tree 1, Step 3)
|
||
│ └─ Missing information in observation = impossible to be optimal
|
||
|
||
├─ STEP 8: Check reward structure
|
||
│ ├─ Is reward dense or sparse?
|
||
│ ├─ Sparse reward + suboptimal policy: Agent might not be exploring to good region
|
||
│ │ ├─ Solution: Reward shaping (bonus for progress)
|
||
│ │ └─ Or: Intrinsic motivation (RND/curiosity)
|
||
│ ├─ Dense reward + suboptimal: Possible misalignment with intent
|
||
│ └─ Can you improve by reshaping reward?
|
||
|
||
└─ STEP 9: Compare with baseline algorithm
|
||
├─ Run reference implementation on same env
|
||
├─ Does reference get better reward?
|
||
├─ YES → Your implementation has a bug
|
||
├─ NO → Problem is inherent to algorithm or environment
|
||
└─ This isolates: Implementation issue vs fundamental difficulty
|
||
```
|
||
|
||
**ROOT CAUSES in order of likelihood:**
|
||
|
||
1. **Exploration insufficient** (30% of cases)
|
||
2. **Training not long enough** (25% of cases)
|
||
3. **Reward hacking** (20% of cases)
|
||
4. **Network too small** (12% of cases)
|
||
5. **Observation insufficient** (8% of cases)
|
||
6. **Algorithm mismatch** (5% of cases)
|
||
|
||
|
||
## Part 3: What to Check First
|
||
|
||
### Critical Checks (Do These First)
|
||
|
||
#### Check 1: Reward Scale Analysis
|
||
|
||
**Why**: Reward scale is the MOST COMMON source of RL failures.
|
||
|
||
```python
|
||
# DIAGNOSTIC SCRIPT
|
||
import numpy as np
|
||
|
||
# Collect rewards from random policy
|
||
rewards = []
|
||
for episode in range(100):
|
||
state = env.reset()
|
||
for step in range(1000):
|
||
action = env.action_space.sample() # Random action
|
||
state, reward, done, _ = env.step(action)
|
||
rewards.append(reward)
|
||
if done:
|
||
break
|
||
|
||
rewards = np.array(rewards)
|
||
|
||
print(f"Reward statistics from random policy:")
|
||
print(f" Min: {rewards.min()}")
|
||
print(f" Max: {rewards.max()}")
|
||
print(f" Mean: {rewards.mean()}")
|
||
print(f" Std: {rewards.std()}")
|
||
print(f" Range: [{rewards.min()}, {rewards.max()}]")
|
||
|
||
# RED FLAGS
|
||
if abs(rewards.max()) > 100 or abs(rewards.min()) > 100:
|
||
print("⚠️ RED FLAG: Rewards >> 1, normalize them!")
|
||
|
||
if rewards.std() > 10:
|
||
print("⚠️ RED FLAG: High reward variance, normalize or clip")
|
||
|
||
if rewards.mean() == rewards.max():
|
||
print("⚠️ RED FLAG: Constant rewards, no signal to learn from!")
|
||
|
||
if (rewards > 1).any() and (rewards < -1).any():
|
||
print("✓ Reward scale looks reasonable ([-1, 1] range)")
|
||
```
|
||
|
||
**Action if scale is wrong:**
|
||
|
||
```python
|
||
# Normalize to [-1, 1]
|
||
reward = reward / max(abs(rewards.max()), abs(rewards.min()))
|
||
|
||
# Or clip
|
||
reward = np.clip(reward, -1, 1)
|
||
|
||
# Or shift and scale
|
||
reward = 2 * (reward - rewards.min()) / (rewards.max() - rewards.min()) - 1
|
||
```
|
||
|
||
#### Check 2: Environment Sanity Check
|
||
|
||
**Why**: Broken environment → no algorithm will work.
|
||
|
||
```python
|
||
# DIAGNOSTIC SCRIPT
|
||
def sanity_check_env(env, num_episodes=5):
|
||
"""Quick check if environment is sane."""
|
||
|
||
for episode in range(num_episodes):
|
||
state = env.reset()
|
||
print(f"\nEpisode {episode}:")
|
||
print(f" Initial state shape: {state.shape}, dtype: {state.dtype}")
|
||
print(f" Initial state range: [{state.min()}, {state.max()}]")
|
||
|
||
for step in range(10):
|
||
action = env.action_space.sample()
|
||
next_state, reward, done, info = env.step(action)
|
||
|
||
print(f" Step {step}: action={action}, reward={reward}, done={done}")
|
||
print(f" State shape: {next_state.shape}, range: [{next_state.min()}, {next_state.max()}]")
|
||
|
||
# Check for NaN
|
||
if np.isnan(next_state).any() or np.isnan(reward):
|
||
print(f" ⚠️ NaN detected!")
|
||
|
||
# Check for reasonable values
|
||
if np.abs(next_state).max() > 1e6:
|
||
print(f" ⚠️ State explosion (values > 1e6)")
|
||
|
||
if done:
|
||
break
|
||
|
||
print("\n✓ Environment check complete")
|
||
|
||
sanity_check_env(env)
|
||
```
|
||
|
||
**RED FLAGS:**
|
||
|
||
- NaN or inf in observations/rewards
|
||
- State values exploding (> 1e6)
|
||
- Reward always same (no signal)
|
||
- Done flag never true (infinite episodes)
|
||
- State never changes despite actions
|
||
|
||
#### Check 3: Can You Beat It Manually?
|
||
|
||
**Why**: If human can't solve it, agent won't either (unless reward hacking).
|
||
|
||
```python
|
||
# Manual policy: Hardcoded behavior
|
||
def manual_policy(state):
|
||
# Example for CartPole: if pole tilting right, push right
|
||
if state[2] > 0: # angle > 0
|
||
return 1 # Push right
|
||
else:
|
||
return 0 # Push left
|
||
|
||
# Test manual policy
|
||
total_reward = 0
|
||
for episode in range(10):
|
||
state = env.reset()
|
||
for step in range(500):
|
||
action = manual_policy(state)
|
||
state, reward, done, _ = env.step(action)
|
||
total_reward += reward
|
||
if done:
|
||
break
|
||
|
||
avg_reward = total_reward / 10
|
||
print(f"Manual policy average reward: {avg_reward}")
|
||
|
||
# If avg_reward > 0: Environment is learnable
|
||
# If avg_reward ≤ 0: Environment is broken or impossible
|
||
```
|
||
|
||
#### Check 4: Observation Normalization
|
||
|
||
**Why**: Non-normalized observations cause learning problems.
|
||
|
||
```python
|
||
# Check if observations are normalized
|
||
for episode in range(10):
|
||
state = env.reset()
|
||
print(f"Episode {episode}: state range [{state.min()}, {state.max()}]")
|
||
|
||
# For images: should be [0, 1] or [-1, 1]
|
||
# For physical states: should be roughly [-1, 1]
|
||
|
||
if state.min() < -10 or state.max() > 10:
|
||
print("⚠️ Observations not normalized!")
|
||
# Solution:
|
||
state = state / np.abs(state).max() # Normalize
|
||
```
|
||
|
||
|
||
## Part 4: Common RL Bugs Catalog
|
||
|
||
### Bug 1: Reward Scale > 1
|
||
|
||
**Symptom**: Training unstable, loss spikes, agent doesn't learn
|
||
|
||
**Root Cause**: Gradients too large due to reward scale
|
||
|
||
**Code Example**:
|
||
|
||
```python
|
||
# WRONG: Reward in [0, 1000]
|
||
reward = success_count * 1000
|
||
|
||
# CORRECT: Normalize to [-1, 1]
|
||
reward = success_count * 1000
|
||
reward = reward / max_possible_reward # Result: [-1, 1]
|
||
```
|
||
|
||
**Fix**: Divide rewards by max possible value
|
||
|
||
**Detection**:
|
||
|
||
```python
|
||
rewards = [collect 100 episodes]
|
||
if max(abs(r) for r in rewards) > 1:
|
||
print("⚠️ Reward scale issue detected")
|
||
```
|
||
|
||
|
||
### Bug 2: Environment Reset Broken
|
||
|
||
**Symptom**: Agent learns initial state but can't adapt
|
||
|
||
**Root Cause**: Reset doesn't randomize initial state or returns same state
|
||
|
||
**Code Example**:
|
||
|
||
```python
|
||
# WRONG: Reset always same state
|
||
def reset(self):
|
||
self.state = np.array([0, 0, 0, 0]) # Always [0,0,0,0]
|
||
return self.state
|
||
|
||
# CORRECT: Reset randomizes initial state
|
||
def reset(self):
|
||
self.state = np.random.uniform(-0.1, 0.1, size=4) # Random
|
||
return self.state
|
||
```
|
||
|
||
**Fix**: Make reset() randomize initial state
|
||
|
||
**Detection**:
|
||
|
||
```python
|
||
states = [env.reset() for _ in range(10)]
|
||
if len(set(map(tuple, states))) == 1:
|
||
print("⚠️ Reset broken, always same state")
|
||
```
|
||
|
||
|
||
### Bug 3: Observation Insufficient (Partial Observability)
|
||
|
||
**Symptom**: Agent can't learn because it doesn't see enough
|
||
|
||
**Root Cause**: Observation missing velocity, derivatives, or temporal info
|
||
|
||
**Code Example**:
|
||
|
||
```python
|
||
# WRONG: Only position, no velocity
|
||
state = np.array([position]) # Can't infer velocity from position alone
|
||
|
||
# CORRECT: Position + velocity
|
||
state = np.array([position, velocity])
|
||
|
||
# WRONG for images: Single frame
|
||
observation = env.render() # Single frame, no temporal info
|
||
|
||
# CORRECT for images: Stacked frames
|
||
frames = [frame_t-3, frame_t-2, frame_t-1, frame_t] # 4 frames
|
||
observation = np.stack(frames, axis=-1) # Shape: (84, 84, 4)
|
||
```
|
||
|
||
**Fix**: Add missing information to observation
|
||
|
||
**Detection**:
|
||
|
||
```python
|
||
# If agent converges to bad performance despite long training
|
||
# Check: Can you compute optimal action from observation?
|
||
# If no: Observation is insufficient
|
||
```
|
||
|
||
|
||
### Bug 4: Reward Always Same (No Signal)
|
||
|
||
**Symptom**: Loss decreases but doesn't improve over time, reward flat
|
||
|
||
**Root Cause**: Reward is constant or nearly constant
|
||
|
||
**Code Example**:
|
||
|
||
```python
|
||
# WRONG: Constant reward
|
||
reward = 1.0 # Every step gets +1, no differentiation
|
||
|
||
# CORRECT: Differentiate good and bad outcomes
|
||
if reached_goal:
|
||
reward = 1.0
|
||
else:
|
||
reward = 0.0 # Or -0.1 for living cost
|
||
```
|
||
|
||
**Fix**: Ensure reward differentiates outcomes
|
||
|
||
**Detection**:
|
||
|
||
```python
|
||
rewards = [collect random policy rewards]
|
||
if rewards.std() < 0.01:
|
||
print("⚠️ Reward has no variance, no signal to learn")
|
||
```
|
||
|
||
|
||
### Bug 5: Learning Rate Too High
|
||
|
||
**Symptom**: Loss oscillates or explodes, training unstable
|
||
|
||
**Root Cause**: Gradient updates too large, overshooting optimum
|
||
|
||
**Code Example**:
|
||
|
||
```python
|
||
# WRONG: Learning rate 1e-2 (too high)
|
||
optimizer = Adam(model.parameters(), lr=1e-2)
|
||
|
||
# CORRECT: Learning rate 3e-4 (safe default)
|
||
optimizer = Adam(model.parameters(), lr=3e-4)
|
||
```
|
||
|
||
**Fix**: Reduce learning rate by 2-5×
|
||
|
||
**Detection**:
|
||
|
||
```python
|
||
# Watch loss first 100 steps
|
||
# If loss increases first step: LR too high
|
||
# If loss decreases but oscillates: LR probably high
|
||
```
|
||
|
||
|
||
### Bug 6: Learning Rate Too Low
|
||
|
||
**Symptom**: Agent learns very slowly, training takes forever
|
||
|
||
**Root Cause**: Gradient updates too small, learning crawls
|
||
|
||
**Code Example**:
|
||
|
||
```python
|
||
# WRONG: Learning rate 1e-6 (too low)
|
||
optimizer = Adam(model.parameters(), lr=1e-6)
|
||
|
||
# CORRECT: Learning rate 3e-4
|
||
optimizer = Adam(model.parameters(), lr=3e-4)
|
||
```
|
||
|
||
**Fix**: Increase learning rate by 2-5×
|
||
|
||
**Detection**:
|
||
|
||
```python
|
||
# Training curve increases very slowly
|
||
# If training 1M steps and reward barely improved: LR too low
|
||
```
|
||
|
||
|
||
### Bug 7: No Exploration Decay
|
||
|
||
**Symptom**: Agent learns but remains noisy, doesn't fully exploit
|
||
|
||
**Root Cause**: Exploration (epsilon or entropy) not decaying
|
||
|
||
**Code Example**:
|
||
|
||
```python
|
||
# WRONG: Constant epsilon
|
||
epsilon = 0.3 # Forever
|
||
|
||
# CORRECT: Decay epsilon
|
||
epsilon = epsilon_linear(step, total_steps=1_000_000,
|
||
epsilon_start=1.0, epsilon_end=0.01)
|
||
```
|
||
|
||
**Fix**: Add exploration decay schedule
|
||
|
||
**Detection**:
|
||
|
||
```python
|
||
# Plot entropy or epsilon over training
|
||
# Should show clear decay from high to low
|
||
# If flat: not decaying
|
||
```
|
||
|
||
|
||
### Bug 8: Exploration Decay Too Fast
|
||
|
||
**Symptom**: Agent plateaus early, stuck in local optimum
|
||
|
||
**Root Cause**: Exploration stops before finding good policy
|
||
|
||
**Code Example**:
|
||
|
||
```python
|
||
# WRONG: Decays to zero in 10k steps (for 1M step training)
|
||
epsilon = 0.99 ** (step / 100) # Reaches 0 too fast
|
||
|
||
# CORRECT: Decays over full training
|
||
epsilon = epsilon_linear(step, total_steps=1_000_000,
|
||
epsilon_start=1.0, epsilon_end=0.01)
|
||
```
|
||
|
||
**Fix**: Use longer decay schedule
|
||
|
||
**Detection**:
|
||
|
||
```python
|
||
# Plot epsilon over training
|
||
# Should reach final value at 50-80% through training
|
||
# Not at 5%
|
||
```
|
||
|
||
|
||
### Bug 9: Reward Hacking
|
||
|
||
**Symptom**: Agent achieves high reward but behavior is useless
|
||
|
||
**Root Cause**: Agent found way to game reward not aligned with intent
|
||
|
||
**Code Example**:
|
||
|
||
```python
|
||
# WRONG: Reward for just staying alive
|
||
reward = 1.0 # Every timestep
|
||
# Agent learns: Stay in corner, don't move, get infinite reward
|
||
|
||
# CORRECT: Reward for progress + living cost
|
||
position_before = self.state[0]
|
||
self.state = compute_next_state(...)
|
||
position_after = self.state[0]
|
||
progress = position_after - position_before
|
||
|
||
reward = progress - 0.01 # Progress bonus, living cost
|
||
```
|
||
|
||
**Fix**: Reshape reward to align with intent
|
||
|
||
**Detection**:
|
||
|
||
```python
|
||
# Visualize agent behavior
|
||
# If behavior weird but reward high: hacking
|
||
# If reward increases but task performance bad: hacking
|
||
```
|
||
|
||
|
||
### Bug 10: Testing with Exploration
|
||
|
||
**Symptom**: Test performance much worse than training, high variance
|
||
|
||
**Root Cause**: Using stochastic policy at test time
|
||
|
||
**Code Example**:
|
||
|
||
```python
|
||
# WRONG: Test with epsilon > 0
|
||
for test_episode in range(100):
|
||
action = epsilon_greedy(q_values, epsilon=0.05) # Wrong!
|
||
# Agent still explores at test
|
||
|
||
# CORRECT: Test greedy
|
||
for test_episode in range(100):
|
||
action = np.argmax(q_values) # Deterministic
|
||
```
|
||
|
||
**Fix**: Use greedy/deterministic policy at test time
|
||
|
||
**Detection**:
|
||
|
||
```python
|
||
# Test reward variance high?
|
||
# Test reward < train reward?
|
||
# Check: Are you using exploration at test time?
|
||
```
|
||
|
||
|
||
## Part 5: Logging and Monitoring
|
||
|
||
### What Metrics to Track
|
||
|
||
```python
|
||
# Minimal set of metrics for RL debugging
|
||
class RLLogger:
|
||
def __init__(self):
|
||
self.episode_rewards = []
|
||
self.policy_losses = []
|
||
self.value_losses = []
|
||
self.entropies = []
|
||
self.gradient_norms = []
|
||
|
||
def log_episode(self, episode_reward):
|
||
self.episode_rewards.append(episode_reward)
|
||
|
||
def log_losses(self, policy_loss, value_loss, entropy):
|
||
self.policy_losses.append(policy_loss)
|
||
self.value_losses.append(value_loss)
|
||
self.entropies.append(entropy)
|
||
|
||
def log_gradient_norm(self, norm):
|
||
self.gradient_norms.append(norm)
|
||
|
||
def plot_training(self):
|
||
"""Visualize training progress."""
|
||
# Plot 1: Episode rewards over time (smoothed)
|
||
# Plot 2: Policy and value losses
|
||
# Plot 3: Entropy (should decay)
|
||
# Plot 4: Gradient norms
|
||
pass
|
||
```
|
||
|
||
### What Each Metric Means
|
||
|
||
#### Metric 1: Episode Reward
|
||
|
||
**What to look for**:
|
||
|
||
- Should trend upward over time
|
||
- Should have decreasing variance (less oscillation)
|
||
- Slight noise is normal
|
||
|
||
**Red flags**:
|
||
|
||
- Flat line: Not learning
|
||
- Downward trend: Getting worse
|
||
- Wild oscillations: Instability or unlucky randomness
|
||
|
||
**Code**:
|
||
|
||
```python
|
||
rewards = agent.get_episode_rewards()
|
||
reward_smoothed = np.convolve(rewards, np.ones(100)/100, mode='valid')
|
||
plt.plot(reward_smoothed) # Smooth to see trend
|
||
```
|
||
|
||
#### Metric 2: Policy Loss
|
||
|
||
**What to look for**:
|
||
|
||
- Should decrease over training
|
||
- Decrease should smooth out (not oscillating)
|
||
|
||
**Red flags**:
|
||
|
||
- Loss increasing: Learning rate too high
|
||
- Loss oscillating: Learning rate too high or reward scale wrong
|
||
- Loss = 0: Policy not updating
|
||
|
||
**Code**:
|
||
|
||
```python
|
||
if policy_loss > policy_loss_prev:
|
||
print("⚠️ Policy loss increased, LR might be too high")
|
||
```
|
||
|
||
#### Metric 3: Value Loss (for critic-based methods)
|
||
|
||
**What to look for**:
|
||
|
||
- Should decrease initially, then plateau
|
||
- Should not oscillate heavily
|
||
|
||
**Red flags**:
|
||
|
||
- Loss exploding: LR too high
|
||
- Loss not changing: Not updating
|
||
|
||
**Code**:
|
||
|
||
```python
|
||
value_loss_smoothed = np.convolve(value_losses, np.ones(100)/100)
|
||
if value_loss_smoothed[-1] > value_loss_smoothed[-100]:
|
||
print("⚠️ Value loss increasing recently")
|
||
```
|
||
|
||
#### Metric 4: Entropy (Policy Randomness)
|
||
|
||
**What to look for**:
|
||
|
||
- Should start high (exploring)
|
||
- Should decay to low (exploiting)
|
||
- Clear downward trend
|
||
|
||
**Red flags**:
|
||
|
||
- Entropy always high: Too much exploration
|
||
- Entropy drops to zero: Over-exploiting
|
||
- No decay: Entropy not decreasing
|
||
|
||
**Code**:
|
||
|
||
```python
|
||
if entropy[-1] > entropy[-100]:
|
||
print("⚠️ Entropy increasing, exploration not decaying")
|
||
```
|
||
|
||
#### Metric 5: Gradient Norms
|
||
|
||
**What to look for**:
|
||
|
||
- Should stay roughly constant over training
|
||
- Typical range: 0.1 to 10
|
||
|
||
**Red flags**:
|
||
|
||
- Gradient norms > 100: Exploding gradients
|
||
- Gradient norms < 0.001: Vanishing gradients
|
||
- Sudden spikes: Outlier data or numerical issue
|
||
|
||
**Code**:
|
||
|
||
```python
|
||
total_norm = 0
|
||
for p in model.parameters():
|
||
param_norm = p.grad.norm(2)
|
||
total_norm += param_norm ** 2
|
||
total_norm = total_norm ** 0.5
|
||
|
||
if total_norm > 100:
|
||
print("⚠️ Gradient explosion detected")
|
||
```
|
||
|
||
### Visualization Script
|
||
|
||
```python
|
||
import matplotlib.pyplot as plt
|
||
import numpy as np
|
||
|
||
def plot_rl_training(rewards, policy_losses, value_losses, entropies):
|
||
"""Plot training metrics for RL debugging."""
|
||
|
||
fig, axes = plt.subplots(2, 2, figsize=(15, 10))
|
||
|
||
# Plot 1: Episode rewards
|
||
ax = axes[0, 0]
|
||
ax.plot(rewards, alpha=0.3, label='Episode reward')
|
||
reward_smooth = np.convolve(rewards, np.ones(100)/100, mode='valid')
|
||
ax.plot(range(100, len(rewards)), reward_smooth, label='Smoothed (100 episodes)')
|
||
ax.set_xlabel('Episode')
|
||
ax.set_ylabel('Reward')
|
||
ax.set_title('Episode Rewards Over Time')
|
||
ax.legend()
|
||
ax.grid()
|
||
|
||
# Plot 2: Policy loss
|
||
ax = axes[0, 1]
|
||
ax.plot(policy_losses, alpha=0.3)
|
||
loss_smooth = np.convolve(policy_losses, np.ones(100)/100, mode='valid')
|
||
ax.plot(range(100, len(policy_losses)), loss_smooth, label='Smoothed')
|
||
ax.set_xlabel('Step')
|
||
ax.set_ylabel('Policy Loss')
|
||
ax.set_title('Policy Loss Over Time')
|
||
ax.legend()
|
||
ax.grid()
|
||
|
||
# Plot 3: Entropy
|
||
ax = axes[1, 0]
|
||
ax.plot(entropies, label='Policy entropy')
|
||
ax.set_xlabel('Step')
|
||
ax.set_ylabel('Entropy')
|
||
ax.set_title('Policy Entropy (Should Decrease)')
|
||
ax.legend()
|
||
ax.grid()
|
||
|
||
# Plot 4: Value loss
|
||
ax = axes[1, 1]
|
||
ax.plot(value_losses, alpha=0.3)
|
||
loss_smooth = np.convolve(value_losses, np.ones(100)/100, mode='valid')
|
||
ax.plot(range(100, len(value_losses)), loss_smooth, label='Smoothed')
|
||
ax.set_xlabel('Step')
|
||
ax.set_ylabel('Value Loss')
|
||
ax.set_title('Value Loss Over Time')
|
||
ax.legend()
|
||
ax.grid()
|
||
|
||
plt.tight_layout()
|
||
plt.show()
|
||
```
|
||
|
||
|
||
## Part 6: Common Pitfalls and Red Flags
|
||
|
||
### Pitfall 1: "Bigger Network = Better Learning"
|
||
|
||
**Wrong**: Oversized networks overfit and learn slowly
|
||
|
||
**Right**: Start with small network (2-3 hidden layers, 64-256 units)
|
||
|
||
**Red Flag**: Network has > 10M parameters for simple task
|
||
|
||
**Fix**:
|
||
|
||
```python
|
||
# Too big
|
||
model = nn.Sequential(
|
||
nn.Linear(4, 1024),
|
||
nn.ReLU(),
|
||
nn.Linear(1024, 1024),
|
||
nn.Linear(1024, 2)
|
||
)
|
||
|
||
# Right size
|
||
model = nn.Sequential(
|
||
nn.Linear(4, 128),
|
||
nn.ReLU(),
|
||
nn.Linear(128, 128),
|
||
nn.Linear(128, 2)
|
||
)
|
||
```
|
||
|
||
|
||
### Pitfall 2: "Random Seed Doesn't Matter"
|
||
|
||
**Wrong**: Different seeds give very different results (indicates instability)
|
||
|
||
**Right**: Results should be consistent across seeds (within reasonable variance)
|
||
|
||
**Red Flag**: Reward varies by 50%+ across 5 seeds
|
||
|
||
**Fix**:
|
||
|
||
```python
|
||
# Test across multiple seeds
|
||
rewards_by_seed = []
|
||
for seed in range(5):
|
||
np.random.seed(seed)
|
||
torch.manual_seed(seed)
|
||
reward = train_agent(seed)
|
||
rewards_by_seed.append(reward)
|
||
|
||
print(f"Mean: {np.mean(rewards_by_seed)}, Std: {np.std(rewards_by_seed)}")
|
||
if np.std(rewards_by_seed) > 0.5 * np.mean(rewards_by_seed):
|
||
print("⚠️ High variance across seeds, training unstable")
|
||
```
|
||
|
||
|
||
### Pitfall 3: "Skip Observation Normalization"
|
||
|
||
**Wrong**: Non-normalized observations (scale [-1e6, 1e6])
|
||
|
||
**Right**: Normalized observations (scale [-1, 1])
|
||
|
||
**Red Flag**: State values > 100 or < -100
|
||
|
||
**Fix**:
|
||
|
||
```python
|
||
# Normalize images
|
||
observation = observation.astype(np.float32) / 255.0
|
||
|
||
# Normalize states
|
||
observation = (observation - observation_mean) / observation_std
|
||
|
||
# Or standardize on-the-fly
|
||
normalized_obs = (obs - running_mean) / (running_std + 1e-8)
|
||
```
|
||
|
||
|
||
### Pitfall 4: "Ignore the Reward Curve Shape"
|
||
|
||
**Wrong**: Only look at final reward, ignore curve shape
|
||
|
||
**Right**: Curve shape tells you what's wrong
|
||
|
||
**Red Flag**: Curve shapes indicate:
|
||
|
||
- Flat then sudden jump: Long exploration, then found policy
|
||
- Oscillating: Unstable learning
|
||
- Decreasing after peak: Catastrophic forgetting
|
||
|
||
**Fix**:
|
||
|
||
```python
|
||
# Look at curve shape
|
||
if reward_curve is flat:
|
||
print("Not learning, check environment/reward")
|
||
elif reward_curve oscillates:
|
||
print("Unstable, check LR or reward scale")
|
||
elif reward_curve peaks then drops:
|
||
print("Overfitting or exploration decay wrong")
|
||
```
|
||
|
||
|
||
### Pitfall 5: "Skip the Random Baseline Check"
|
||
|
||
**Wrong**: Train agent without knowing what random baseline is
|
||
|
||
**Right**: Always compute random baseline first
|
||
|
||
**Red Flag**: Agent barely beats random (within 5% of baseline)
|
||
|
||
**Fix**:
|
||
|
||
```python
|
||
# Compute random baseline
|
||
random_rewards = []
|
||
for _ in range(100):
|
||
state = env.reset()
|
||
episode_reward = 0
|
||
for step in range(1000):
|
||
action = env.action_space.sample()
|
||
state, reward, done, _ = env.step(action)
|
||
episode_reward += reward
|
||
if done:
|
||
break
|
||
random_rewards.append(episode_reward)
|
||
|
||
random_baseline = np.mean(random_rewards)
|
||
print(f"Random baseline: {random_baseline}")
|
||
|
||
# Compare agent
|
||
agent_reward = train_agent()
|
||
improvement = (agent_reward - random_baseline) / random_baseline
|
||
print(f"Agent improvement: {improvement*100}%")
|
||
```
|
||
|
||
|
||
### Pitfall 6: "Changing Multiple Hyperparameters at Once"
|
||
|
||
**Wrong**: Change 5 things, training breaks, don't know which caused it
|
||
|
||
**Right**: Change one thing at a time, test, measure, iterate
|
||
|
||
**Red Flag**: Code has "TUNING" comments with 10 simultaneous changes
|
||
|
||
**Fix**:
|
||
|
||
```python
|
||
# Scientific method for debugging
|
||
def debug_lr():
|
||
for lr in [1e-5, 1e-4, 1e-3, 1e-2]:
|
||
reward = train_with_lr(lr)
|
||
print(f"LR={lr}: Reward={reward}")
|
||
# Only change LR, keep everything else same
|
||
|
||
def debug_batch_size():
|
||
for batch in [32, 64, 128, 256]:
|
||
reward = train_with_batch(batch)
|
||
print(f"Batch={batch}: Reward={reward}")
|
||
# Only change batch, keep everything else same
|
||
```
|
||
|
||
|
||
### Pitfall 7: "Using Training Metrics to Judge Performance"
|
||
|
||
**Wrong**: Trust training reward, test once at the end
|
||
|
||
**Right**: Monitor test reward during training (with exploration off)
|
||
|
||
**Red Flag**: Training reward high, test reward low (overfitting)
|
||
|
||
**Fix**:
|
||
|
||
```python
|
||
# Evaluate with greedy policy (no exploration)
|
||
def evaluate(agent, num_episodes=10):
|
||
episode_rewards = []
|
||
for _ in range(num_episodes):
|
||
state = env.reset()
|
||
episode_reward = 0
|
||
for step in range(1000):
|
||
action = agent.act(state, explore=False) # Greedy
|
||
state, reward, done, _ = env.step(action)
|
||
episode_reward += reward
|
||
if done:
|
||
break
|
||
episode_rewards.append(episode_reward)
|
||
return np.mean(episode_rewards)
|
||
|
||
# Monitor during training
|
||
for step in range(total_steps):
|
||
train_agent_step()
|
||
|
||
if step % 10000 == 0:
|
||
test_reward = evaluate(agent) # Evaluate periodically
|
||
print(f"Step {step}: Test reward={test_reward}")
|
||
```
|
||
|
||
|
||
## Part 7: Red Flags Checklist
|
||
|
||
```
|
||
CRITICAL RED FLAGS (Stop and debug immediately):
|
||
|
||
[ ] NaN in loss or rewards
|
||
→ Check: reward scale, gradients, network outputs
|
||
|
||
[ ] Gradient norms > 100 (exploding)
|
||
→ Check: Enable gradient clipping, reduce LR
|
||
|
||
[ ] Gradient norms < 1e-4 (vanishing)
|
||
→ Check: Increase LR, check network initialization
|
||
|
||
[ ] Reward always same
|
||
→ Check: Is reward function broken? No differentiation?
|
||
|
||
[ ] Agent never improves beyond random baseline
|
||
→ Check: Reward scale, environment, observation, exploration
|
||
|
||
[ ] Loss oscillates wildly
|
||
→ Check: Learning rate (likely too high), reward scale
|
||
|
||
[ ] Episode length decreases over training
|
||
→ Check: Agent learning bad behavior, poor reward shaping
|
||
|
||
[ ] Test reward >> training reward
|
||
→ Check: Training is lucky, test is representative
|
||
|
||
[ ] Training gets worse after improving
|
||
→ Check: Catastrophic forgetting, stability issue
|
||
|
||
|
||
IMPORTANT RED FLAGS (Debug within a few training runs):
|
||
|
||
[ ] Entropy not decaying (always high)
|
||
→ Check: Entropy regularization, exploration decay
|
||
|
||
[ ] Entropy goes to zero early
|
||
→ Check: Entropy coefficient too low, exploration too aggressive
|
||
|
||
[ ] Variance across seeds > 50% of mean
|
||
→ Check: Training is unstable or lucky, try more seeds
|
||
|
||
[ ] Network weights not changing
|
||
→ Check: Gradient zero, LR zero, network not connected
|
||
|
||
[ ] Loss = 0 (perfect fit)
|
||
→ Check: Network overfitting, reward too easy
|
||
|
||
|
||
MINOR RED FLAGS (Watch for patterns):
|
||
|
||
[ ] Training slower than expected
|
||
→ Check: LR too low, batch size too small, network too small
|
||
|
||
[ ] Occasional loss spikes
|
||
→ Check: Outlier data, reward outliers, clipping needed
|
||
|
||
[ ] Reward variance high
|
||
→ Check: Normal if environment stochastic, check if aligns with intent
|
||
|
||
[ ] Agent behavior seems random even late in training
|
||
→ Check: Entropy not decaying, exploration not stopping
|
||
```
|
||
|
||
|
||
## Part 8: Rationalization Resistance
|
||
|
||
| Rationalization | Reality | Counter-Guidance |
|
||
|-----------------|---------|------------------|
|
||
| "Higher learning rate will speed up learning" | Can cause instability, often slows learning | Start with 3e-4, measure effect, don't assume |
|
||
| "Bigger network always learns better" | Oversized networks overfit, slow training | Start small (64-256 units), increase only if needed |
|
||
| "Random seed doesn't matter, RL is random anyway" | High variance indicates instability, not inherent randomness | Run 5+ seeds, variance should be low, not high |
|
||
| "I'll try all hyperparameters (grid search)" | Combinatorial explosion, wastes time, no diagnosis | Check environment/reward FIRST, then tune one param at a time |
|
||
| "Adding regularization helps unstable training" | Regularization is for overfitting, not instability | Instability usually LR or reward scale, not overfitting |
|
||
| "My algorithm is broken" | 80% chance environment, reward, or observation is broken | Check those FIRST before blaming algorithm |
|
||
| "More training always helps" | If reward plateaus, more training won't help | Check if training converged, if not why |
|
||
| "Skip observation normalization, network will learn to normalize" | Network should not spend capacity learning normalization | Normalize observations before network |
|
||
| "Test with epsilon > 0 to reduce variance" | Test should use learned policy, exploration harms test | Use greedy policy at test time |
|
||
| "If loss doesn't decrease, algorithm is broken" | More likely: reward scale wrong, gradient clipping needed | Check reward scale, enable gradient clipping before changing algorithm |
|
||
|
||
|
||
## Key Takeaways
|
||
|
||
1. **Follow the systematic process**: Don't random tweak. Check environment → reward → observation → algorithm.
|
||
|
||
2. **80/20 rule**: Most failures are in environment, reward, or observation. Check those first.
|
||
|
||
3. **Reward scale is critical**: Most common bug. Normalize to [-1, 1].
|
||
|
||
4. **Diagnosis trees**: Use them. Different symptoms have different root causes.
|
||
|
||
5. **Metrics tell you everything**: Loss, entropy, gradient norms reveal what's wrong.
|
||
|
||
6. **Rationalization is the enemy**: Don't assume, measure. Plot curves, check outputs, verify.
|
||
|
||
7. **Simple environment first**: If agent can't learn CartPole, bigger environment won't help.
|
||
|
||
8. **One seed is not enough**: Run 5+ seeds, look at variance, not just mean.
|
||
|
||
This skill is about **systematic debugging**, not random tweaking. Apply the framework, follow diagnosis trees, and you'll find the bug.
|