# RL Debugging Methodology ## When to Use This Skill Invoke this skill when you encounter: - **Agent Won't Learn**: Reward stuck at baseline, not improving - **Training Unstable**: Loss bouncing, reward highly variable - **Suboptimal Policy**: Agent learned something but worse than expected - **Reward Hacking**: Agent gaming the reward function - **Exploration Issues**: Agent stuck in local optimum or exploring poorly - **Hyperparameter Sensitivity**: Small changes break training - **Learning Rate Tuning**: Not sure what value is right - **Convergence Problems**: Loss doesn't decrease or decreases then stops - **Environment vs Algorithm**: Unsure if problem is environment or RL algorithm - **Logging Confusion**: Not sure what metrics to monitor - **Gradual Performance Degradation**: Early training good, later poor - **Sparse Reward Challenge**: Agent never finds reward signal **Core Problem**: RL debugging often becomes random hyperparameter tweaking. Agents are complex systems with many failure modes. Systematic diagnosis finds root causes; random tweaking wastes time and leads to contradictory findings. ## Do NOT Use This Skill For - **Learning RL theory** (route to rl-foundations for MDPs, Bellman equations, policy gradients) - **Implementing new algorithms** (route to algorithm-specific skills like value-based-methods, policy-gradient-methods, actor-critic-methods) - **Environment API questions** (route to rl-environments for Gym/Gymnasium API, custom environments, wrappers) - **Evaluation methodology** (route to rl-evaluation for rigorous statistical testing, generalization assessment) - **Initial algorithm selection** (route to using-deep-rl router or rl-foundations for choosing the right algorithm family) ## Core Principle: The 80/20 Rule **The most important insight in RL debugging:** ``` 80% of RL failures are in: 1. Environment design (agent can't see true state) 2. Reward function (misaligned or wrong scale) 3. Observation/action representation (missing information) 15% are in: 4. Hyperparameters (learning rate, batch size, etc.) 5. Exploration strategy (too much or too little) 5% are in: 6. Algorithm selection (wrong algorithm for problem) ``` **Consequence**: If training fails, check environment and reward FIRST. Changing the algorithm last. ### Why This Order? **Scenario 1: Broken Environment** ```python # BROKEN ENVIRONMENT: Agent can't win no matter what algorithm class BrokenEnv: def reset(self): self.state = random_state() # Agent can't control this return self.state def step(self, action): # Reward independent of action! reward = random.random() return self.state, reward # No amount of PPO, DQN, SAC can learn from random reward # CORRECT ENVIRONMENT: Agent can win with right policy class CorrectEnv: def reset(self): self.state = initial_state return self.state def step(self, action): # Reward depends on action reward = compute_reward(self.state, action) self.state = compute_next_state(self.state, action) return self.state, reward ``` **If environment is broken, no algorithm will learn.** **Scenario 2: Reward Scale Issue** ```python # WRONG SCALE: Reward in [0, 1000000] # Algorithm gradient updates: param = param - lr * grad # If gradient huge (due to reward scale), single step breaks everything # CORRECT SCALE: Reward in [-1, 1] # Gradients are reasonable, learning stable # Fix is simple: divide reward by scale factor # But if you don't know to check reward scale, you'll try 10 learning rates instead ``` **Consequence: Always check reward scale before tuning learning rate.** ## Part 1: Systematic Debugging Framework ### The Debugging Process (Not Random Tweaking) ``` START: Agent not learning (or training unstable, or suboptimal) Step 1: ENVIRONMENT CHECK (Does agent have what it needs?) ├─ Can agent see the state? (Is observation sufficient?) ├─ Is environment deterministic or stochastic? (Affects algorithm choice) ├─ Can agent actually win? (Does optimal policy exist?) └─ Is environment reset working? (Fresh episode each reset?) Step 2: REWARD SCALE CHECK (Is reward in reasonable range?) ├─ What's the range of rewards? (Min, max, typical) ├─ Are rewards normalized? (Should be ≈ [-1, 1]) ├─ Is reward aligned with desired behavior? (No reward hacking) └─ Are rewards sparse or dense? (Affects exploration strategy) Step 3: OBSERVATION REPRESENTATION (Is information preserved?) ├─ Are observations normalized? (Images: [0, 255] or [0, 1]?) ├─ Is temporal information included? (Frame stacking for Atari?) ├─ Are observations consistent? (Same format each episode?) └─ Is observation sufficient to solve problem? (Can human win from this info?) Step 4: BASIC ALGORITHM CHECK (Is the RL algorithm working at all?) ├─ Run on simple environment (CartPole, simple task) ├─ Can algorithm learn on simple env? (If not: algorithm issue) ├─ Can algorithm beat random baseline? (If not: something is broken) └─ Does loss decrease? (If not: learning not happening) Step 5: HYPERPARAMETER TUNING (Only after above passed) ├─ Is learning rate in reasonable range? (1e-5 to 1e-3 typical) ├─ Is batch size appropriate? (Power of 2: 32, 64, 128, 256) ├─ Is exploration sufficient? (Epsilon decaying? Entropy positive?) └─ Are network layers reasonable? (3 hidden layers typical) Step 6: LOGGING ANALYSIS (What do the metrics say?) ├─ Policy loss: decreasing? exploding? zero? ├─ Value loss: decreasing? stable? ├─ Reward curve: trending up? flat? oscillating? ├─ Entropy: decreasing over time? (Exploration → exploitation) └─ Gradient norms: reasonable? exploding? vanishing? Step 7: IDENTIFY ROOT CAUSE (Synthesize findings) └─ Where is the actual problem? (Environment, reward, algorithm, hyperparameters) ``` ### Why This Order Matters **Common mistake: Jump to Step 5 (hyperparameter tuning)** ```python # Agent not learning. Frustration sets in. # "I'll try learning rate 1e-4" (Step 5, skipped 1-4) # Doesn't work. # "I'll try batch size 64" (more Step 5 tweaking) # Doesn't work. # "I'll try a bigger network" (still Step 5) # Doesn't work. # Hours wasted. # Correct approach: Follow Steps 1-4 first. # Step 1: Oh! Environment reset is broken, always same initial state # Fix environment. # Now agent learns immediately with default hyperparameters. ``` **The order reflects probability**: It's more likely the environment is broken than the algorithm; more likely the reward scale is wrong than learning rate is wrong. ## Part 2: Diagnosis Trees by Symptom ### Diagnosis Tree 1: "Agent Won't Learn" **Symptom**: Reward stuck near random baseline. Loss doesn't decrease meaningfully. ``` START: Agent Won't Learn ├─ STEP 1: Can agent beat random baseline? │ ├─ YES → Skip to STEP 4 │ └─ NO → Environment issue likely │ ├─ Check 1A: Is environment output sane? │ │ ├─ Print first 5 episodes: state, action, reward, next_state │ │ ├─ Verify types match (shapes, ranges, dtypes) │ │ └─ Is reward always same? Always zero? (Red flag: no signal) │ ├─ Check 1B: Can you beat it manually? │ │ ├─ Play environment by hand (hardcode a policy) │ │ ├─ Can you get >0 reward? (If not: environment is broken) │ │ └─ If yes: Agent is missing something │ └─ Check 1C: Is reset working? │ ├─ Call reset() twice, check states differ │ └─ If states same: reset is broken, fix it ├─ STEP 2: Is reward scale reasonable? │ ├─ Compute: min, max, mean, std of rewards from random policy │ ├─ If range >> 1 (e.g., [0, 10000]): │ │ ├─ Action: Normalize rewards to [-1, 1] │ │ ├─ Code: reward = reward / max_possible_reward │ │ └─ Retest: Usually fixes "won't learn" │ ├─ If range << 1 (e.g., [0, 0.001]): │ │ ├─ Action: Scale up rewards │ │ ├─ Code: reward = reward * 1000 │ │ └─ Or increase network capacity (more signal needed) │ └─ If reward is [0, 1] (looks fine): │ └─ Continue to STEP 3 ├─ STEP 3: Is observation sufficient? │ ├─ Check 3A: Are observations normalized? │ │ ├─ If images [0, 255]: normalize to [0, 1] or [-1, 1] │ │ ├─ Code: observation = observation / 255.0 │ │ └─ Retest │ ├─ Check 3B: Is temporal info included? (For vision: frame stacking) │ │ ├─ If using images: last 4 frames stacked? │ │ ├─ If using states: includes velocity/derivatives? │ │ └─ Missing temporal info → agent can't infer velocity │ └─ Check 3C: Is observation Markovian? │ ├─ Can optimal policy be derived from this observation? │ ├─ If not: observation insufficient (red flag) │ └─ Example: Only position, not velocity → agent can't control ├─ STEP 4: Run sanity check on simple environment │ ├─ Switch to CartPole or equivalent simple env │ ├─ Train with default hyperparameters │ ├─ Does simple env learn? (Should learn in 1000-5000 steps) │ ├─ YES → Your algorithm works, issue is your env/hyperparameters │ └─ NO → Algorithm itself broken (rare, check algorithm implementation) ├─ STEP 5: Check exploration │ ├─ Is agent exploring or stuck? │ ├─ Log entropy (for stochastic policies) │ ├─ If entropy → 0 early: agent exploiting before exploring │ │ └─ Solution: Increase entropy regularization or ε │ ├─ If entropy always high: too much exploration │ │ └─ Solution: Decay entropy or ε more aggressively │ └─ Visualize: Plot policy actions over time, should see diversity early ├─ STEP 6: Check learning rate │ ├─ Is learning rate in [1e-5, 1e-3]? (typical range) │ ├─ If > 1e-3: Try reducing (might be too aggressive) │ ├─ If < 1e-5: Try increasing (might be too conservative) │ ├─ Watch loss first step: If loss increases → LR too high │ └─ Safe default: 3e-4 └─ STEP 7: Check network architecture ├─ For continuous control: small networks ok (1-2 hidden layers, 64-256 units) ├─ For vision: use CNN (don't use FC on pixels) ├─ Check if network has enough capacity └─ Tip: Start with simple, add complexity if needed ``` **ROOT CAUSES in order of likelihood:** 1. **Reward scale wrong** (40% of cases) 2. **Environment broken** (25% of cases) 3. **Observation insufficient** (15% of cases) 4. **Learning rate too high/low** (12% of cases) 5. **Algorithm issue** (8% of cases) ### Diagnosis Tree 2: "Training Unstable" **Symptom**: Loss bounces wildly, reward spikes then crashes, training oscillates. ``` START: Training Unstable ├─ STEP 1: Characterize the instability │ ├─ Plot loss curve: Does it bounce at same magnitude or grow? │ ├─ Plot reward curve: Does it oscillate around mean or trend down? │ ├─ Compute: reward variance over 100 episodes │ └─ This tells you: Is it normal variance or pathological instability? ├─ STEP 2: Check if environment is deterministic │ ├─ Deterministic environment + stochastic policy = normal variance │ ├─ Stochastic environment + any policy = high variance (expected) │ ├─ If stochastic: Can you reduce randomness? Or accept higher variance? │ └─ Some instability is normal; distinguish from pathological ├─ STEP 3: Check reward scale │ ├─ If rewards >> 1: Gradient updates too large │ │ ├─ Single step might overshoot optimum │ │ ├─ Solution: Normalize rewards to [-1, 1] │ │ └─ This often fixes instability immediately │ ├─ If reward has outliers: Single large reward breaks training │ │ ├─ Solution: Reward clipping or scaling │ │ └─ Example: r = np.clip(reward, -1, 1) │ └─ Check: Is reward scale consistent? ├─ STEP 4: Check learning rate (LR often causes instability) │ ├─ If loss oscillates: LR likely too high │ │ ├─ Try reducing by 2-5× (e.g., 1e-3 → 3e-4) │ │ ├─ Watch first 100 steps: Loss should decrease monotonically │ │ └─ If still oscillates: try 10× reduction │ ├─ If you have LR scheduler: Check if it's too aggressive │ │ ├─ Scheduler reducing LR too fast can cause steps │ │ └─ Solution: Slower schedule (more steps to final LR) │ └─ Test: Set LR very low (1e-5), see if training is smooth │ ├─ YES → Increase LR gradually until instability starts │ └─ This bracketing finds safe LR range ├─ STEP 5: Check batch size │ ├─ Small batch (< 32): High gradient variance, bouncy updates │ │ ├─ Solution: Increase batch size (32, 64, 128) │ │ └─ But not too large: training becomes slow │ ├─ Large batch (> 512): Might overfit, large gradient steps │ │ ├─ Solution: Use gradient accumulation │ │ └─ Or reduce learning rate slightly │ └─ Start with batch_size=64, adjust if needed ├─ STEP 6: Check gradient clipping │ ├─ Are gradients exploding? (Check max gradient norm) │ │ ├─ If max grad norm > 100: Likely exploding gradients │ │ ├─ Solution: Enable gradient clipping (max_norm=1.0) │ │ └─ Code: torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0) │ ├─ If max grad norm reasonable (< 10): Skip this step │ └─ Watch grad norm over training: Should stay roughly constant ├─ STEP 7: Check algorithm-specific parameters │ ├─ For PPO: Is clipping epsilon reasonable? (0.2 default) │ │ ├─ Too high: Over-clips, doesn't update │ │ └─ Too low: Allows large updates, instability │ ├─ For DQN: Is target network update frequency appropriate? │ │ ├─ Update too often: Target constantly changing │ │ └─ Update too rarely: Stale targets │ └─ For A3C/A2C: Check entropy coefficient │ ├─ Too high: Too much exploration, policy noisy │ └─ Too low: Premature convergence └─ STEP 8: Check exploration decay ├─ Is exploration decaying too fast? (Policy becomes deterministic) │ └─ If entropy→0 early: Agent exploits before exploring ├─ Is exploration decaying too slow? (Policy stays noisy) │ └─ If entropy stays high: Too much randomness in later training └─ Entropy should decay: high early, low late └─ Plot entropy over training: should show clear decay curve ``` **ROOT CAUSES in order of likelihood:** 1. **Learning rate too high** (35% of cases) 2. **Reward scale too large** (25% of cases) 3. **Batch size too small** (15% of cases) 4. **Gradient explosion** (10% of cases) 5. **Algorithm parameters** (10% of cases) 6. **Environment stochasticity** (5% of cases) ### Diagnosis Tree 3: "Suboptimal Policy" **Symptom**: Agent learned something but performs worse than expected. Better than random baseline, but not good enough. ``` START: Suboptimal Policy ├─ STEP 1: How suboptimal? (Quantify the gap) │ ├─ Compute: Agent reward vs theoretical optimal │ ├─ If 80% of optimal: Normal (RL usually gets 80-90% optimal) │ ├─ If 50% of optimal: Significantly suboptimal │ ├─ If 20% of optimal: Very bad │ └─ This tells you: Is it "good enough" or truly broken? ├─ STEP 2: Is it stuck in local optimum? │ ├─ Run multiple seeds: Do you get similar reward each seed? │ ├─ If rewards similar across seeds: Consistent local optimum │ ├─ If rewards vary wildly: High variance, need more training │ └─ Solution if local optimum: More exploration or better reward shaping ├─ STEP 3: Check reward hacking │ ├─ Visualize agent behavior: Does it match intent? │ ├─ Example: Cart-pole reward is [0, 1] per timestep │ │ ├─ Agent might learn: "Stay in center, don't move" │ │ ├─ Policy is suboptimal but still gets reward │ │ └─ Solution: Reward engineering (bonus for progress) │ └─ Hacking signs: │ ├─ Agent does something weird but gets reward │ ├─ Behavior makes no intuitive sense │ └─ Reward increases but performance bad ├─ STEP 4: Is exploration sufficient? │ ├─ Check entropy: Does policy explore initially? │ ├─ Check epsilon decay (if using ε-greedy): Does it decay appropriately? │ ├─ Is agent exploring broadly or stuck in small region? │ ├─ Solution: Slower exploration decay or intrinsic motivation │ └─ Use RND/curiosity if environment has sparse rewards ├─ STEP 5: Check network capacity │ ├─ Is network too small to represent optimal policy? │ ├─ For vision: Use standard CNN (not tiny network) │ ├─ For continuous control: 2-3 hidden layers, 128-256 units │ ├─ Test: Double network size, does performance improve? │ └─ If yes: Original network was too small ├─ STEP 6: Check data efficiency │ ├─ Is agent training long enough? │ ├─ RL usually needs: simple tasks 100k steps, complex tasks 1M+ steps │ ├─ If training only 10k steps: Too short, agent didn't converge │ ├─ Solution: Train longer (but check reward curve first) │ └─ If reward plateaus early: Extend training won't help ├─ STEP 7: Check observation and action spaces │ ├─ Is action space continuous or discrete? │ ├─ Is action discretization appropriate? │ │ ├─ Too coarse: Can't express fine control │ │ ├─ Too fine: Huge action space, hard to learn │ │ └─ Example: 100 actions for simple control = too many │ ├─ Is observation sufficient? (See Diagnosis Tree 1, Step 3) │ └─ Missing information in observation = impossible to be optimal ├─ STEP 8: Check reward structure │ ├─ Is reward dense or sparse? │ ├─ Sparse reward + suboptimal policy: Agent might not be exploring to good region │ │ ├─ Solution: Reward shaping (bonus for progress) │ │ └─ Or: Intrinsic motivation (RND/curiosity) │ ├─ Dense reward + suboptimal: Possible misalignment with intent │ └─ Can you improve by reshaping reward? └─ STEP 9: Compare with baseline algorithm ├─ Run reference implementation on same env ├─ Does reference get better reward? ├─ YES → Your implementation has a bug ├─ NO → Problem is inherent to algorithm or environment └─ This isolates: Implementation issue vs fundamental difficulty ``` **ROOT CAUSES in order of likelihood:** 1. **Exploration insufficient** (30% of cases) 2. **Training not long enough** (25% of cases) 3. **Reward hacking** (20% of cases) 4. **Network too small** (12% of cases) 5. **Observation insufficient** (8% of cases) 6. **Algorithm mismatch** (5% of cases) ## Part 3: What to Check First ### Critical Checks (Do These First) #### Check 1: Reward Scale Analysis **Why**: Reward scale is the MOST COMMON source of RL failures. ```python # DIAGNOSTIC SCRIPT import numpy as np # Collect rewards from random policy rewards = [] for episode in range(100): state = env.reset() for step in range(1000): action = env.action_space.sample() # Random action state, reward, done, _ = env.step(action) rewards.append(reward) if done: break rewards = np.array(rewards) print(f"Reward statistics from random policy:") print(f" Min: {rewards.min()}") print(f" Max: {rewards.max()}") print(f" Mean: {rewards.mean()}") print(f" Std: {rewards.std()}") print(f" Range: [{rewards.min()}, {rewards.max()}]") # RED FLAGS if abs(rewards.max()) > 100 or abs(rewards.min()) > 100: print("⚠️ RED FLAG: Rewards >> 1, normalize them!") if rewards.std() > 10: print("⚠️ RED FLAG: High reward variance, normalize or clip") if rewards.mean() == rewards.max(): print("⚠️ RED FLAG: Constant rewards, no signal to learn from!") if (rewards > 1).any() and (rewards < -1).any(): print("✓ Reward scale looks reasonable ([-1, 1] range)") ``` **Action if scale is wrong:** ```python # Normalize to [-1, 1] reward = reward / max(abs(rewards.max()), abs(rewards.min())) # Or clip reward = np.clip(reward, -1, 1) # Or shift and scale reward = 2 * (reward - rewards.min()) / (rewards.max() - rewards.min()) - 1 ``` #### Check 2: Environment Sanity Check **Why**: Broken environment → no algorithm will work. ```python # DIAGNOSTIC SCRIPT def sanity_check_env(env, num_episodes=5): """Quick check if environment is sane.""" for episode in range(num_episodes): state = env.reset() print(f"\nEpisode {episode}:") print(f" Initial state shape: {state.shape}, dtype: {state.dtype}") print(f" Initial state range: [{state.min()}, {state.max()}]") for step in range(10): action = env.action_space.sample() next_state, reward, done, info = env.step(action) print(f" Step {step}: action={action}, reward={reward}, done={done}") print(f" State shape: {next_state.shape}, range: [{next_state.min()}, {next_state.max()}]") # Check for NaN if np.isnan(next_state).any() or np.isnan(reward): print(f" ⚠️ NaN detected!") # Check for reasonable values if np.abs(next_state).max() > 1e6: print(f" ⚠️ State explosion (values > 1e6)") if done: break print("\n✓ Environment check complete") sanity_check_env(env) ``` **RED FLAGS:** - NaN or inf in observations/rewards - State values exploding (> 1e6) - Reward always same (no signal) - Done flag never true (infinite episodes) - State never changes despite actions #### Check 3: Can You Beat It Manually? **Why**: If human can't solve it, agent won't either (unless reward hacking). ```python # Manual policy: Hardcoded behavior def manual_policy(state): # Example for CartPole: if pole tilting right, push right if state[2] > 0: # angle > 0 return 1 # Push right else: return 0 # Push left # Test manual policy total_reward = 0 for episode in range(10): state = env.reset() for step in range(500): action = manual_policy(state) state, reward, done, _ = env.step(action) total_reward += reward if done: break avg_reward = total_reward / 10 print(f"Manual policy average reward: {avg_reward}") # If avg_reward > 0: Environment is learnable # If avg_reward ≤ 0: Environment is broken or impossible ``` #### Check 4: Observation Normalization **Why**: Non-normalized observations cause learning problems. ```python # Check if observations are normalized for episode in range(10): state = env.reset() print(f"Episode {episode}: state range [{state.min()}, {state.max()}]") # For images: should be [0, 1] or [-1, 1] # For physical states: should be roughly [-1, 1] if state.min() < -10 or state.max() > 10: print("⚠️ Observations not normalized!") # Solution: state = state / np.abs(state).max() # Normalize ``` ## Part 4: Common RL Bugs Catalog ### Bug 1: Reward Scale > 1 **Symptom**: Training unstable, loss spikes, agent doesn't learn **Root Cause**: Gradients too large due to reward scale **Code Example**: ```python # WRONG: Reward in [0, 1000] reward = success_count * 1000 # CORRECT: Normalize to [-1, 1] reward = success_count * 1000 reward = reward / max_possible_reward # Result: [-1, 1] ``` **Fix**: Divide rewards by max possible value **Detection**: ```python rewards = [collect 100 episodes] if max(abs(r) for r in rewards) > 1: print("⚠️ Reward scale issue detected") ``` ### Bug 2: Environment Reset Broken **Symptom**: Agent learns initial state but can't adapt **Root Cause**: Reset doesn't randomize initial state or returns same state **Code Example**: ```python # WRONG: Reset always same state def reset(self): self.state = np.array([0, 0, 0, 0]) # Always [0,0,0,0] return self.state # CORRECT: Reset randomizes initial state def reset(self): self.state = np.random.uniform(-0.1, 0.1, size=4) # Random return self.state ``` **Fix**: Make reset() randomize initial state **Detection**: ```python states = [env.reset() for _ in range(10)] if len(set(map(tuple, states))) == 1: print("⚠️ Reset broken, always same state") ``` ### Bug 3: Observation Insufficient (Partial Observability) **Symptom**: Agent can't learn because it doesn't see enough **Root Cause**: Observation missing velocity, derivatives, or temporal info **Code Example**: ```python # WRONG: Only position, no velocity state = np.array([position]) # Can't infer velocity from position alone # CORRECT: Position + velocity state = np.array([position, velocity]) # WRONG for images: Single frame observation = env.render() # Single frame, no temporal info # CORRECT for images: Stacked frames frames = [frame_t-3, frame_t-2, frame_t-1, frame_t] # 4 frames observation = np.stack(frames, axis=-1) # Shape: (84, 84, 4) ``` **Fix**: Add missing information to observation **Detection**: ```python # If agent converges to bad performance despite long training # Check: Can you compute optimal action from observation? # If no: Observation is insufficient ``` ### Bug 4: Reward Always Same (No Signal) **Symptom**: Loss decreases but doesn't improve over time, reward flat **Root Cause**: Reward is constant or nearly constant **Code Example**: ```python # WRONG: Constant reward reward = 1.0 # Every step gets +1, no differentiation # CORRECT: Differentiate good and bad outcomes if reached_goal: reward = 1.0 else: reward = 0.0 # Or -0.1 for living cost ``` **Fix**: Ensure reward differentiates outcomes **Detection**: ```python rewards = [collect random policy rewards] if rewards.std() < 0.01: print("⚠️ Reward has no variance, no signal to learn") ``` ### Bug 5: Learning Rate Too High **Symptom**: Loss oscillates or explodes, training unstable **Root Cause**: Gradient updates too large, overshooting optimum **Code Example**: ```python # WRONG: Learning rate 1e-2 (too high) optimizer = Adam(model.parameters(), lr=1e-2) # CORRECT: Learning rate 3e-4 (safe default) optimizer = Adam(model.parameters(), lr=3e-4) ``` **Fix**: Reduce learning rate by 2-5× **Detection**: ```python # Watch loss first 100 steps # If loss increases first step: LR too high # If loss decreases but oscillates: LR probably high ``` ### Bug 6: Learning Rate Too Low **Symptom**: Agent learns very slowly, training takes forever **Root Cause**: Gradient updates too small, learning crawls **Code Example**: ```python # WRONG: Learning rate 1e-6 (too low) optimizer = Adam(model.parameters(), lr=1e-6) # CORRECT: Learning rate 3e-4 optimizer = Adam(model.parameters(), lr=3e-4) ``` **Fix**: Increase learning rate by 2-5× **Detection**: ```python # Training curve increases very slowly # If training 1M steps and reward barely improved: LR too low ``` ### Bug 7: No Exploration Decay **Symptom**: Agent learns but remains noisy, doesn't fully exploit **Root Cause**: Exploration (epsilon or entropy) not decaying **Code Example**: ```python # WRONG: Constant epsilon epsilon = 0.3 # Forever # CORRECT: Decay epsilon epsilon = epsilon_linear(step, total_steps=1_000_000, epsilon_start=1.0, epsilon_end=0.01) ``` **Fix**: Add exploration decay schedule **Detection**: ```python # Plot entropy or epsilon over training # Should show clear decay from high to low # If flat: not decaying ``` ### Bug 8: Exploration Decay Too Fast **Symptom**: Agent plateaus early, stuck in local optimum **Root Cause**: Exploration stops before finding good policy **Code Example**: ```python # WRONG: Decays to zero in 10k steps (for 1M step training) epsilon = 0.99 ** (step / 100) # Reaches 0 too fast # CORRECT: Decays over full training epsilon = epsilon_linear(step, total_steps=1_000_000, epsilon_start=1.0, epsilon_end=0.01) ``` **Fix**: Use longer decay schedule **Detection**: ```python # Plot epsilon over training # Should reach final value at 50-80% through training # Not at 5% ``` ### Bug 9: Reward Hacking **Symptom**: Agent achieves high reward but behavior is useless **Root Cause**: Agent found way to game reward not aligned with intent **Code Example**: ```python # WRONG: Reward for just staying alive reward = 1.0 # Every timestep # Agent learns: Stay in corner, don't move, get infinite reward # CORRECT: Reward for progress + living cost position_before = self.state[0] self.state = compute_next_state(...) position_after = self.state[0] progress = position_after - position_before reward = progress - 0.01 # Progress bonus, living cost ``` **Fix**: Reshape reward to align with intent **Detection**: ```python # Visualize agent behavior # If behavior weird but reward high: hacking # If reward increases but task performance bad: hacking ``` ### Bug 10: Testing with Exploration **Symptom**: Test performance much worse than training, high variance **Root Cause**: Using stochastic policy at test time **Code Example**: ```python # WRONG: Test with epsilon > 0 for test_episode in range(100): action = epsilon_greedy(q_values, epsilon=0.05) # Wrong! # Agent still explores at test # CORRECT: Test greedy for test_episode in range(100): action = np.argmax(q_values) # Deterministic ``` **Fix**: Use greedy/deterministic policy at test time **Detection**: ```python # Test reward variance high? # Test reward < train reward? # Check: Are you using exploration at test time? ``` ## Part 5: Logging and Monitoring ### What Metrics to Track ```python # Minimal set of metrics for RL debugging class RLLogger: def __init__(self): self.episode_rewards = [] self.policy_losses = [] self.value_losses = [] self.entropies = [] self.gradient_norms = [] def log_episode(self, episode_reward): self.episode_rewards.append(episode_reward) def log_losses(self, policy_loss, value_loss, entropy): self.policy_losses.append(policy_loss) self.value_losses.append(value_loss) self.entropies.append(entropy) def log_gradient_norm(self, norm): self.gradient_norms.append(norm) def plot_training(self): """Visualize training progress.""" # Plot 1: Episode rewards over time (smoothed) # Plot 2: Policy and value losses # Plot 3: Entropy (should decay) # Plot 4: Gradient norms pass ``` ### What Each Metric Means #### Metric 1: Episode Reward **What to look for**: - Should trend upward over time - Should have decreasing variance (less oscillation) - Slight noise is normal **Red flags**: - Flat line: Not learning - Downward trend: Getting worse - Wild oscillations: Instability or unlucky randomness **Code**: ```python rewards = agent.get_episode_rewards() reward_smoothed = np.convolve(rewards, np.ones(100)/100, mode='valid') plt.plot(reward_smoothed) # Smooth to see trend ``` #### Metric 2: Policy Loss **What to look for**: - Should decrease over training - Decrease should smooth out (not oscillating) **Red flags**: - Loss increasing: Learning rate too high - Loss oscillating: Learning rate too high or reward scale wrong - Loss = 0: Policy not updating **Code**: ```python if policy_loss > policy_loss_prev: print("⚠️ Policy loss increased, LR might be too high") ``` #### Metric 3: Value Loss (for critic-based methods) **What to look for**: - Should decrease initially, then plateau - Should not oscillate heavily **Red flags**: - Loss exploding: LR too high - Loss not changing: Not updating **Code**: ```python value_loss_smoothed = np.convolve(value_losses, np.ones(100)/100) if value_loss_smoothed[-1] > value_loss_smoothed[-100]: print("⚠️ Value loss increasing recently") ``` #### Metric 4: Entropy (Policy Randomness) **What to look for**: - Should start high (exploring) - Should decay to low (exploiting) - Clear downward trend **Red flags**: - Entropy always high: Too much exploration - Entropy drops to zero: Over-exploiting - No decay: Entropy not decreasing **Code**: ```python if entropy[-1] > entropy[-100]: print("⚠️ Entropy increasing, exploration not decaying") ``` #### Metric 5: Gradient Norms **What to look for**: - Should stay roughly constant over training - Typical range: 0.1 to 10 **Red flags**: - Gradient norms > 100: Exploding gradients - Gradient norms < 0.001: Vanishing gradients - Sudden spikes: Outlier data or numerical issue **Code**: ```python total_norm = 0 for p in model.parameters(): param_norm = p.grad.norm(2) total_norm += param_norm ** 2 total_norm = total_norm ** 0.5 if total_norm > 100: print("⚠️ Gradient explosion detected") ``` ### Visualization Script ```python import matplotlib.pyplot as plt import numpy as np def plot_rl_training(rewards, policy_losses, value_losses, entropies): """Plot training metrics for RL debugging.""" fig, axes = plt.subplots(2, 2, figsize=(15, 10)) # Plot 1: Episode rewards ax = axes[0, 0] ax.plot(rewards, alpha=0.3, label='Episode reward') reward_smooth = np.convolve(rewards, np.ones(100)/100, mode='valid') ax.plot(range(100, len(rewards)), reward_smooth, label='Smoothed (100 episodes)') ax.set_xlabel('Episode') ax.set_ylabel('Reward') ax.set_title('Episode Rewards Over Time') ax.legend() ax.grid() # Plot 2: Policy loss ax = axes[0, 1] ax.plot(policy_losses, alpha=0.3) loss_smooth = np.convolve(policy_losses, np.ones(100)/100, mode='valid') ax.plot(range(100, len(policy_losses)), loss_smooth, label='Smoothed') ax.set_xlabel('Step') ax.set_ylabel('Policy Loss') ax.set_title('Policy Loss Over Time') ax.legend() ax.grid() # Plot 3: Entropy ax = axes[1, 0] ax.plot(entropies, label='Policy entropy') ax.set_xlabel('Step') ax.set_ylabel('Entropy') ax.set_title('Policy Entropy (Should Decrease)') ax.legend() ax.grid() # Plot 4: Value loss ax = axes[1, 1] ax.plot(value_losses, alpha=0.3) loss_smooth = np.convolve(value_losses, np.ones(100)/100, mode='valid') ax.plot(range(100, len(value_losses)), loss_smooth, label='Smoothed') ax.set_xlabel('Step') ax.set_ylabel('Value Loss') ax.set_title('Value Loss Over Time') ax.legend() ax.grid() plt.tight_layout() plt.show() ``` ## Part 6: Common Pitfalls and Red Flags ### Pitfall 1: "Bigger Network = Better Learning" **Wrong**: Oversized networks overfit and learn slowly **Right**: Start with small network (2-3 hidden layers, 64-256 units) **Red Flag**: Network has > 10M parameters for simple task **Fix**: ```python # Too big model = nn.Sequential( nn.Linear(4, 1024), nn.ReLU(), nn.Linear(1024, 1024), nn.Linear(1024, 2) ) # Right size model = nn.Sequential( nn.Linear(4, 128), nn.ReLU(), nn.Linear(128, 128), nn.Linear(128, 2) ) ``` ### Pitfall 2: "Random Seed Doesn't Matter" **Wrong**: Different seeds give very different results (indicates instability) **Right**: Results should be consistent across seeds (within reasonable variance) **Red Flag**: Reward varies by 50%+ across 5 seeds **Fix**: ```python # Test across multiple seeds rewards_by_seed = [] for seed in range(5): np.random.seed(seed) torch.manual_seed(seed) reward = train_agent(seed) rewards_by_seed.append(reward) print(f"Mean: {np.mean(rewards_by_seed)}, Std: {np.std(rewards_by_seed)}") if np.std(rewards_by_seed) > 0.5 * np.mean(rewards_by_seed): print("⚠️ High variance across seeds, training unstable") ``` ### Pitfall 3: "Skip Observation Normalization" **Wrong**: Non-normalized observations (scale [-1e6, 1e6]) **Right**: Normalized observations (scale [-1, 1]) **Red Flag**: State values > 100 or < -100 **Fix**: ```python # Normalize images observation = observation.astype(np.float32) / 255.0 # Normalize states observation = (observation - observation_mean) / observation_std # Or standardize on-the-fly normalized_obs = (obs - running_mean) / (running_std + 1e-8) ``` ### Pitfall 4: "Ignore the Reward Curve Shape" **Wrong**: Only look at final reward, ignore curve shape **Right**: Curve shape tells you what's wrong **Red Flag**: Curve shapes indicate: - Flat then sudden jump: Long exploration, then found policy - Oscillating: Unstable learning - Decreasing after peak: Catastrophic forgetting **Fix**: ```python # Look at curve shape if reward_curve is flat: print("Not learning, check environment/reward") elif reward_curve oscillates: print("Unstable, check LR or reward scale") elif reward_curve peaks then drops: print("Overfitting or exploration decay wrong") ``` ### Pitfall 5: "Skip the Random Baseline Check" **Wrong**: Train agent without knowing what random baseline is **Right**: Always compute random baseline first **Red Flag**: Agent barely beats random (within 5% of baseline) **Fix**: ```python # Compute random baseline random_rewards = [] for _ in range(100): state = env.reset() episode_reward = 0 for step in range(1000): action = env.action_space.sample() state, reward, done, _ = env.step(action) episode_reward += reward if done: break random_rewards.append(episode_reward) random_baseline = np.mean(random_rewards) print(f"Random baseline: {random_baseline}") # Compare agent agent_reward = train_agent() improvement = (agent_reward - random_baseline) / random_baseline print(f"Agent improvement: {improvement*100}%") ``` ### Pitfall 6: "Changing Multiple Hyperparameters at Once" **Wrong**: Change 5 things, training breaks, don't know which caused it **Right**: Change one thing at a time, test, measure, iterate **Red Flag**: Code has "TUNING" comments with 10 simultaneous changes **Fix**: ```python # Scientific method for debugging def debug_lr(): for lr in [1e-5, 1e-4, 1e-3, 1e-2]: reward = train_with_lr(lr) print(f"LR={lr}: Reward={reward}") # Only change LR, keep everything else same def debug_batch_size(): for batch in [32, 64, 128, 256]: reward = train_with_batch(batch) print(f"Batch={batch}: Reward={reward}") # Only change batch, keep everything else same ``` ### Pitfall 7: "Using Training Metrics to Judge Performance" **Wrong**: Trust training reward, test once at the end **Right**: Monitor test reward during training (with exploration off) **Red Flag**: Training reward high, test reward low (overfitting) **Fix**: ```python # Evaluate with greedy policy (no exploration) def evaluate(agent, num_episodes=10): episode_rewards = [] for _ in range(num_episodes): state = env.reset() episode_reward = 0 for step in range(1000): action = agent.act(state, explore=False) # Greedy state, reward, done, _ = env.step(action) episode_reward += reward if done: break episode_rewards.append(episode_reward) return np.mean(episode_rewards) # Monitor during training for step in range(total_steps): train_agent_step() if step % 10000 == 0: test_reward = evaluate(agent) # Evaluate periodically print(f"Step {step}: Test reward={test_reward}") ``` ## Part 7: Red Flags Checklist ``` CRITICAL RED FLAGS (Stop and debug immediately): [ ] NaN in loss or rewards → Check: reward scale, gradients, network outputs [ ] Gradient norms > 100 (exploding) → Check: Enable gradient clipping, reduce LR [ ] Gradient norms < 1e-4 (vanishing) → Check: Increase LR, check network initialization [ ] Reward always same → Check: Is reward function broken? No differentiation? [ ] Agent never improves beyond random baseline → Check: Reward scale, environment, observation, exploration [ ] Loss oscillates wildly → Check: Learning rate (likely too high), reward scale [ ] Episode length decreases over training → Check: Agent learning bad behavior, poor reward shaping [ ] Test reward >> training reward → Check: Training is lucky, test is representative [ ] Training gets worse after improving → Check: Catastrophic forgetting, stability issue IMPORTANT RED FLAGS (Debug within a few training runs): [ ] Entropy not decaying (always high) → Check: Entropy regularization, exploration decay [ ] Entropy goes to zero early → Check: Entropy coefficient too low, exploration too aggressive [ ] Variance across seeds > 50% of mean → Check: Training is unstable or lucky, try more seeds [ ] Network weights not changing → Check: Gradient zero, LR zero, network not connected [ ] Loss = 0 (perfect fit) → Check: Network overfitting, reward too easy MINOR RED FLAGS (Watch for patterns): [ ] Training slower than expected → Check: LR too low, batch size too small, network too small [ ] Occasional loss spikes → Check: Outlier data, reward outliers, clipping needed [ ] Reward variance high → Check: Normal if environment stochastic, check if aligns with intent [ ] Agent behavior seems random even late in training → Check: Entropy not decaying, exploration not stopping ``` ## Part 8: Rationalization Resistance | Rationalization | Reality | Counter-Guidance | |-----------------|---------|------------------| | "Higher learning rate will speed up learning" | Can cause instability, often slows learning | Start with 3e-4, measure effect, don't assume | | "Bigger network always learns better" | Oversized networks overfit, slow training | Start small (64-256 units), increase only if needed | | "Random seed doesn't matter, RL is random anyway" | High variance indicates instability, not inherent randomness | Run 5+ seeds, variance should be low, not high | | "I'll try all hyperparameters (grid search)" | Combinatorial explosion, wastes time, no diagnosis | Check environment/reward FIRST, then tune one param at a time | | "Adding regularization helps unstable training" | Regularization is for overfitting, not instability | Instability usually LR or reward scale, not overfitting | | "My algorithm is broken" | 80% chance environment, reward, or observation is broken | Check those FIRST before blaming algorithm | | "More training always helps" | If reward plateaus, more training won't help | Check if training converged, if not why | | "Skip observation normalization, network will learn to normalize" | Network should not spend capacity learning normalization | Normalize observations before network | | "Test with epsilon > 0 to reduce variance" | Test should use learned policy, exploration harms test | Use greedy policy at test time | | "If loss doesn't decrease, algorithm is broken" | More likely: reward scale wrong, gradient clipping needed | Check reward scale, enable gradient clipping before changing algorithm | ## Key Takeaways 1. **Follow the systematic process**: Don't random tweak. Check environment → reward → observation → algorithm. 2. **80/20 rule**: Most failures are in environment, reward, or observation. Check those first. 3. **Reward scale is critical**: Most common bug. Normalize to [-1, 1]. 4. **Diagnosis trees**: Use them. Different symptoms have different root causes. 5. **Metrics tell you everything**: Loss, entropy, gradient norms reveal what's wrong. 6. **Rationalization is the enemy**: Don't assume, measure. Plot curves, check outputs, verify. 7. **Simple environment first**: If agent can't learn CartPole, bigger environment won't help. 8. **One seed is not enough**: Run 5+ seeds, look at variance, not just mean. This skill is about **systematic debugging**, not random tweaking. Apply the framework, follow diagnosis trees, and you'll find the bug.