# Exploration Strategies in Deep RL ## When to Use This Skill Invoke this skill when you encounter: - **Exploration-Exploitation Problem**: Agent stuck in local optimum, not finding sparse rewards - **ε-Greedy Tuning**: Designing or debugging epsilon decay schedules - **Sparse Reward Environments**: Montezuma's Revenge, goal-conditioned tasks, minimal feedback - **Large State Spaces**: Too many states for random exploration to be effective - **Curiosity-Driven Learning**: Implementing or understanding intrinsic motivation - **RND (Random Network Distillation)**: Novelty-based exploration for sparse rewards - **Count-Based Exploration**: Encouraging discovery in discrete/tabular domains - **Exploration Stability**: Agent explores too much/little, inconsistent performance - **Method Selection**: Which exploration strategy for this problem? - **Computational Cost**: Balancing exploration sophistication vs overhead - **Boltzmann Exploration**: Softmax-based action selection and temperature tuning **Core Problem:** Many RL agents get stuck exploiting a local optimum, never finding sparse rewards or exploring high-dimensional state spaces effectively. Choosing the right exploration strategy is fundamental to success. ## Do NOT Use This Skill For - **Algorithm selection** (route to rl-foundations or specific algorithm skills like value-based-methods, policy-gradient-methods) - **Reward design issues** (route to reward-shaping-engineering) - **Environment bugs causing poor exploration** (route to rl-debugging first to verify environment works correctly) - **Basic RL concepts** (route to rl-foundations for MDPs, value functions, Bellman equations) - **Training instability unrelated to exploration** (route to appropriate algorithm skill or rl-debugging) ## Core Principle: The Exploration-Exploitation Tradeoff ### The Fundamental Tension In reinforcement learning, every action selection is a decision: - **Exploit**: Take the action with highest estimated value (maximize immediate reward) - **Explore**: Try a different action to learn about its value (find better actions) ``` Exploitation Extreme: - Only take the best-known action - High immediate reward (in training) - BUT: Stuck in local optimum if initial action wasn't optimal - Risk: Never find the actual best reward Exploration Extreme: - Take random actions uniformly - Will eventually find any reward - BUT: Wasting resources on clearly bad actions - Risk: No learning because too much randomness Optimal Balance: - Explore enough to find good actions - Exploit enough to benefit from learning ``` ### Why Exploration Matters **Scenario 1: Sparse Reward Environment** Imagine an agent in Montezuma's Revenge (classic exploration benchmark): - Most states give reward = 0 - First coin gives +1 (at step 500+) - Without exploring systematically, random actions won't find that coin in millions of steps Without exploration strategy: ``` Steps 0-1,000: Random actions, no reward signal Steps 1,000-10,000: Learned to get to the coin, finally seeing reward Problem: Took 1,000 steps of pure random exploration! With smart exploration (RND): Steps 0-100: RND detects novel states, guides toward unexplored areas Steps 100-500: Finds coin much faster because exploring strategically Result: Reward found in 10% of steps ``` **Scenario 2: Local Optimum Trap** Agent finds a small reward (+1) from a simple policy: ``` Without decay: - Agent learns exploit_policy achieves +1 - ε-greedy with ε=0.3: Still 30% random (good, explores) - BUT: 70% exploiting suboptimal policy indefinitely With decay: - Step 0: ε=1.0, 100% explore - Step 100k: ε=0.05, 5% explore - Step 500k: ε=0.01, 1% explore - Result: Enough exploration to find +5 reward, then exploit it ``` ### Core Rule **Exploration is an investment with declining returns.** - Early training: Exploration critical (don't know anything yet) - Mid training: Balanced (learning but not confident) - Late training: Exploitation dominant (confident in good actions) ## Part 1: ε-Greedy Exploration ### The Baseline Method ε-Greedy is the simplest exploration strategy: with probability ε, take a random action; otherwise, take the greedy (best-known) action. ```python import numpy as np def epsilon_greedy_action(q_values, epsilon): """ Select action using ε-greedy. Args: q_values: Q(s, *) - values for all actions epsilon: exploration probability [0, 1] Returns: action: int (0 to num_actions-1) """ if np.random.random() < epsilon: # Explore: random action return np.random.randint(len(q_values)) else: # Exploit: best action return np.argmax(q_values) ``` ### Why ε-Greedy Works 1. **Simple**: Easy to implement and understand 2. **Guaranteed Convergence**: Will eventually visit all states (if ε > 0) 3. **Effective Baseline**: Works surprisingly well for many tasks 4. **Interpretable**: ε has clear meaning (probability of random action) ### When ε-Greedy Fails ``` Problem Space → Exploration Effectiveness: Small discrete spaces (< 100 actions): - ε-greedy: Excellent ✓ - Reason: Random exploration covers space quickly Large discrete spaces (100-10,000 actions): - ε-greedy: Poor ✗ - Reason: Random action is almost always bad - Example: Game with 500 actions, random 1/500 chance is right action Continuous action spaces: - ε-greedy: Terrible ✗ - Reason: Random action in [-∞, ∞] is meaningless noise - Alternative: Gaussian noise on action (not true ε-greedy) Sparse rewards, large state spaces: - ε-greedy: Hopeless ✗ - Reason: Random exploration won't find rare reward before heat death - Alternative: Curiosity, RND, intrinsic motivation ``` ### ε-Decay Schedules The key insight: ε should decay over time. Explore early, exploit late. #### Linear Decay ```python def epsilon_linear(step, total_steps, epsilon_start=1.0, epsilon_end=0.1): """ Linear decay from epsilon_start to epsilon_end. ε(t) = ε_start - (ε_start - ε_end) * t / T """ t = min(step, total_steps) return epsilon_start - (epsilon_start - epsilon_end) * t / total_steps ``` **Properties:** - Simple, predictable, easy to tune - Equal exploration reduction per step - Good for most tasks **Guidance:** - Use if no special knowledge about task - `epsilon_start = 1.0` (explore fully initially) - `epsilon_end = 0.01` to `0.1` (small residual exploration) - `total_steps = 1,000,000` (typical deep RL) #### Exponential Decay ```python def epsilon_exponential(step, decay_rate=0.9995): """ Exponential decay with constant rate. ε(t) = ε_0 * decay_rate^t """ return 1.0 * (decay_rate ** step) ``` **Properties:** - Fast initial decay, slow tail - Aggressive early exploration cutoff - Exploration drops exponentially **Guidance:** - Use if task rewards are found quickly - `decay_rate = 0.9995` is gentle (1% per 100 steps) - `decay_rate = 0.999` is aggressive (1% per step) - Watch for premature convergence to local optimum #### Polynomial Decay ```python def epsilon_polynomial(step, total_steps, epsilon_start=1.0, epsilon_end=0.01, power=2.0): """ Polynomial decay: ε(t) = ε_start * (1 - t/T)^p power=1: Linear power=2: Quadratic (faster early decay) power=0.5: Slower decay """ t = min(step, total_steps) fraction = t / total_steps return epsilon_start * (1 - fraction) ** power ``` **Properties:** - Smooth, tunable decay curve - Power > 1: Fast early decay, slow tail - Power < 1: Slow early decay, fast tail **Guidance:** - `power = 2.0`: Quadratic (balanced, common) - `power = 3.0`: Cubic (aggressive early decay) - `power = 0.5`: Slower (gentle early decay) ### Practical Guidance: Choosing Epsilon Parameters ``` Rule of Thumb: - epsilon_start = 1.0 (explore uniformly initially) - epsilon_end = 0.01 to 0.1 (maintain minimal exploration) - 0.01: For large action spaces (need some exploration) - 0.05: Default choice - 0.1: For small action spaces (can afford random actions) - total_steps: Based on training duration - Usually 500k to 1M steps - Longer if rewards are sparse or delayed Task-Specific Adjustments: - Sparse rewards: Longer decay (explore for more steps) - Dense rewards: Shorter decay (can exploit earlier) - Large action space: Higher epsilon_end (maintain exploration) - Small action space: Lower epsilon_end (exploitation is cheap) ``` ### ε-Greedy Pitfall 1: Decay Too Fast ```python # WRONG: Decays to 0 in just 10k steps epsilon_final = 0.01 decay_steps = 10_000 epsilon = epsilon_final ** (step / decay_steps) # ← BUG # CORRECT: Decays gently over training total_steps = 1_000_000 epsilon_linear(step, total_steps, epsilon_start=1.0, epsilon_end=0.01) ``` **Symptom:** Agent plateaus early, never improves past initial local optimum **Fix:** Use longer decay schedule, ensure epsilon_end > 0 ### ε-Greedy Pitfall 2: Never Decays (Constant ε) ```python # WRONG: Fixed epsilon forever epsilon = 0.3 # Constant # CORRECT: Decay epsilon over time epsilon = epsilon_linear(step, total_steps=1_000_000) ``` **Symptom:** Agent learns but performance noisy, can't fully exploit learned policy **Fix:** Add epsilon decay schedule ### ε-Greedy Pitfall 3: Epsilon on Continuous Actions ```python # WRONG: Discrete epsilon-greedy on continuous actions action = np.random.uniform(-1, 1) if random() < epsilon else greedy_action # CORRECT: Gaussian noise on continuous actions def continuous_exploration(action, exploration_std=0.1): return action + np.random.normal(0, exploration_std, action.shape) ``` **Symptom:** Continuous action spaces don't benefit from ε-greedy (random action is meaningless) **Fix:** Use Gaussian noise or other continuous exploration methods ## Part 2: Boltzmann Exploration ### Temperature-Based Action Selection Instead of deterministic greedy action, select actions proportional to their Q-values using softmax with temperature T. ```python def boltzmann_exploration(q_values, temperature=1.0): """ Select action using Boltzmann distribution. P(a) = exp(Q(s,a) / T) / Σ exp(Q(s,a') / T) Args: q_values: Q(s, *) - values for all actions temperature: Exploration parameter T → 0: Becomes deterministic (greedy) T → ∞: Becomes uniform random Returns: action: int (sampled from distribution) """ # Subtract max for numerical stability q_shifted = q_values - np.max(q_values) # Compute probabilities probabilities = np.exp(q_shifted / temperature) probabilities = probabilities / np.sum(probabilities) # Sample action return np.random.choice(len(q_values), p=probabilities) ``` ### Properties vs ε-Greedy | Feature | ε-Greedy | Boltzmann | |---------|----------|-----------| | Good actions | Probability: 1-ε | Probability: higher (proportional to Q) | | Bad actions | Probability: ε/(n-1) | Probability: lower (proportional to Q) | | Action selection | Deterministic or random | Stochastic distribution | | Exploration | Uniform random | Biased toward better actions | | Tuning | ε (1 parameter) | T (1 parameter) | **Key Advantage:** Boltzmann balances better—good actions are preferred but still get chances. ``` Example: Three actions with Q=[10, 0, -10] ε-Greedy (ε=0.2): - Action 0: P=0.8 (exploit best) - Action 1: P=0.1 (random) - Action 2: P=0.1 (random) - Problem: Good actions (Q=0, -10) barely sampled Boltzmann (T=2): - Action 0: P=0.88 (exp(10/2)=e^5 ≈ 148) - Action 1: P=0.11 (exp(0/2)=1) - Action 2: P=0.01 (exp(-10/2)≈0.007) - Better: Action 1 still gets 11% (not negligible) ``` ### Temperature Decay Schedule Like epsilon, temperature should decay: start high (explore), end low (exploit). ```python def temperature_decay(step, total_steps, temp_start=1.0, temp_end=0.1): """ Linear temperature decay. T(t) = T_start - (T_start - T_end) * t / T_total """ t = min(step, total_steps) return temp_start - (temp_start - temp_end) * t / total_steps # Usage in training loop for step in range(total_steps): T = temperature_decay(step, total_steps) action = boltzmann_exploration(q_values, temperature=T) # ... ``` ### When to Use Boltzmann vs ε-Greedy ``` Choose ε-Greedy if: - Simple implementation preferred - Discrete action space - Task has clear good/bad actions (wide Q-value spread) Choose Boltzmann if: - Actions have similar Q-values (nuanced exploration) - Want to bias exploration toward promising actions - Fine-grained control over exploration desired ``` ## Part 3: UCB (Upper Confidence Bound) ### Theoretical Optimality UCB is provably optimal for the multi-armed bandit problem: ```python def ucb_action(q_values, action_counts, total_visits, c=1.0): """ Select action using Upper Confidence Bound. UCB(a) = Q(a) + c * sqrt(ln(N) / N(a)) Args: q_values: Current Q-value estimates action_counts: N(a) - times each action visited total_visits: N - total visits to state c: Exploration constant (usually 1.0 or sqrt(2)) Returns: action: int (maximizing UCB) """ # Avoid division by zero action_counts = np.maximum(action_counts, 1) # Compute exploration bonus exploration_bonus = c * np.sqrt(np.log(total_visits) / action_counts) # Upper confidence bound ucb = q_values + exploration_bonus return np.argmax(ucb) ``` ### Why UCB Works UCB balances exploitation and exploration via **optimism under uncertainty**: - If Q(a) is high → exploit it - If Q(a) is uncertain (rarely visited) → exploration bonus makes UCB high ``` Example: Bandit with 2 arms - Arm A: Visited 100 times, estimated Q=2.0 - Arm B: Visited 10 times, estimated Q=1.5 UCB(A) = 2.0 + 1.0 * sqrt(ln(110) / 100) ≈ 2.0 + 0.26 = 2.26 UCB(B) = 1.5 + 1.0 * sqrt(ln(110) / 10) ≈ 1.5 + 0.82 = 2.32 Result: Try Arm B despite lower Q estimate (less certain) ``` ### Critical Limitation: Doesn't Scale to Deep RL UCB assumes **tabular setting** (small, discrete state space where you can count visits): ```python # WORKS: Tabular Q-learning state_action_counts = defaultdict(int) # N(s, a) state_counts = defaultdict(int) # N(s) # BREAKS in deep RL: # With function approximation, states don't repeat exactly # Can't count "how many times visited state X" in continuous/image observations ``` **Practical Issue:** In image-based RL (Atari, vision), never see the same pixel image twice. State counting is impossible. ### When UCB Applies ``` Use UCB if: ✓ Discrete action space (< 100 actions) ✓ Discrete state space (< 10,000 states) ✓ Tabular Q-learning (no function approximation) ✓ Rewards come quickly (don't need long-term planning) Examples: Simple bandits, small Gridworlds, discrete card games DO NOT use UCB if: ✗ Using neural networks (state approximation) ✗ Continuous actions or large state space ✗ Image observations (pixel space too large) ✗ Sparse rewards (need different methods) ``` ### Connection to Deep RL For deep RL, need to estimate **uncertainty** without explicit counts: ```python def deep_ucb_approximation(mean_q, uncertainty, c=1.0): """ Approximate UCB using learned uncertainty (not action counts). Used in methods like: - Deep Ensembles: Use ensemble variance as uncertainty - Dropout: Use MC-dropout variance - Bootstrap DQN: Ensemble of Q-networks UCB ≈ Q(s,a) + c * uncertainty(s,a) """ return mean_q + c * uncertainty ``` **Modern Approach:** Instead of counting visits, learn uncertainty through: - **Ensemble Methods**: Train multiple Q-networks, use disagreement - **Bayesian Methods**: Learn posterior over Q-values - **Bootstrap DQN**: Separate Q-networks give uncertainty estimates These adapt UCB principles to deep RL. ## Part 4: Curiosity-Driven Exploration (ICM) ### The Core Insight **Prediction Error as Exploration Signal** Agent is "curious" about states where it can't predict the next state well: ``` Intuition: If I can't predict what will happen, I probably haven't learned about this state yet. Let me explore here! Intrinsic Reward = ||next_state - predicted_next_state||^2 ``` ### Intrinsic Curiosity Module (ICM) ```python import torch import torch.nn as nn class IntrinsicCuriosityModule(nn.Module): """ ICM = Forward Model + Inverse Model Forward Model: Predicts next state from (state, action) - Input: current state + action taken - Output: predicted next state - Error: prediction error = surprise Inverse Model: Predicts action from (state, next_state) - Input: current state and next state - Output: predicted action taken - Purpose: Learn representation that distinguishes states """ def __init__(self, state_dim, action_dim, hidden_dim=128): super().__init__() # Inverse model: (s, s') → a self.inverse = nn.Sequential( nn.Linear(2 * state_dim, hidden_dim), nn.ReLU(), nn.Linear(hidden_dim, action_dim) ) # Forward model: (s, a) → s' self.forward = nn.Sequential( nn.Linear(state_dim + action_dim, hidden_dim), nn.ReLU(), nn.Linear(hidden_dim, state_dim) ) def compute_intrinsic_reward(self, state, action, next_state): """ Curiosity reward = prediction error of forward model. high_error → Unseen state → Reward exploration low_error → Seen state → Ignore (already learned) """ # Predict next state predicted_next = self.forward(torch.cat([state, action], dim=-1)) # Compute prediction error prediction_error = torch.norm(next_state - predicted_next, dim=-1) # Intrinsic reward is prediction error (exploration bonus) return prediction_error def loss(self, state, action, next_state, action_pred_logits): """ Combine forward and inverse losses. Forward loss: Forward model prediction error Inverse loss: Inverse model action prediction error """ # Forward loss predicted_next = self.forward(torch.cat([state, action], dim=-1)) forward_loss = torch.mean((next_state - predicted_next) ** 2) # Inverse loss predicted_action = action_pred_logits inverse_loss = torch.mean((action - predicted_action) ** 2) return forward_loss + inverse_loss ``` ### Why Both Forward and Inverse Models? ``` Forward model alone: - Can predict next state without learning features - Might just memorize (Q: Do pixels change when I do action X?) - Doesn't necessarily learn task-relevant state representation Inverse model: - Forces feature learning that distinguishes states - Can only predict action if states are well-represented - Improves forward model's learned representation Together: Forward + Inverse - Better feature learning (inverse helps) - Better prediction (forward is primary) ``` ### Critical Pitfall: Random Environment Trap ```python # WRONG: Using curiosity in stochastic environment # Environment: Atari with pixel randomness/motion artifacts # Agent gets reward for predicting pixel noise # Prediction error = pixels changed randomly # Intrinsic reward goes to the noisiest state! # Result: Agent learns nothing about task, just explores random pixels # CORRECT: Use RND instead (next section) # RND uses FROZEN random network, doesn't get reward for actual noise ``` **Key Distinction:** - ICM: Learns to predict environment (breaks if environment has noise/randomness) - RND: Uses frozen random network (robust to environment randomness) ### Computational Cost ```python # ICM adds significant overhead: # - Forward model network (encoder + layers + output) # - Inverse model network (encoder + layers + output) # - Training both networks every step # Overhead estimate: # Base agent: 1 network (policy/value) # With ICM: 3+ networks (policy + forward + inverse) # Training time: ~2-3× longer # Memory: ~3× larger # When justified: # - Sparse rewards (ICM critical) # - Large state spaces (ICM helps) # # When NOT justified: # - Dense rewards (environment signal sufficient) # - Continuous control with simple rewards (ε-greedy enough) ``` ## Part 5: RND (Random Network Distillation) ### The Elegant Solution RND is simpler and more robust than ICM: ```python class RandomNetworkDistillation(nn.Module): """ RND: Intrinsic reward = prediction error of target network Key innovation: Target network is RANDOM and FROZEN (never updated) Two networks: 1. Target (random, frozen): f_target(s) - fixed throughout training 2. Predictor (trained): f_predict(s) - learns to predict target Intrinsic reward = ||f_target(s) - f_predict(s)||^2 New state (s not seen) → high prediction error → reward exploration Seen state (s familiar) → low prediction error → ignore """ def __init__(self, state_dim, embedding_dim=128): super().__init__() # Target network: random, never updates self.target = nn.Sequential( nn.Linear(state_dim, embedding_dim), nn.ReLU(), nn.Linear(embedding_dim, embedding_dim) ) # Predictor network: learns to mimic target self.predictor = nn.Sequential( nn.Linear(state_dim, embedding_dim), nn.ReLU(), nn.Linear(embedding_dim, embedding_dim) ) # Freeze target network for param in self.target.parameters(): param.requires_grad = False def compute_intrinsic_reward(self, state, scale=1.0): """ Intrinsic reward = prediction error of target network. Args: state: Current observation scale: Scale factor for reward (usually 0.1-1.0) Returns: Intrinsic reward (novelty signal) """ with torch.no_grad(): target_features = self.target(state) predicted_features = self.predictor(state) # L2 prediction error prediction_error = torch.norm( target_features - predicted_features, dim=-1, p=2 ) return scale * prediction_error def predictor_loss(self, state): """ Loss for predictor: minimize prediction error. Only update predictor (target stays frozen). """ with torch.no_grad(): target_features = self.target(state) predicted_features = self.predictor(state) # MSE loss return torch.mean((target_features - predicted_features) ** 2) ``` ### Why RND is Elegant 1. **No Environment Model**: Doesn't need to model dynamics (unlike ICM) 2. **Robust to Randomness**: Random network isn't trying to predict anything real, so environment noise doesn't fool it 3. **Simple**: Just predict random features 4. **Fast**: Train only predictor (target frozen) ### RND vs ICM Comparison | Aspect | ICM | RND | |--------|-----|-----| | Networks | Forward + Inverse | Target (frozen) + Predictor | | Learns | Environment dynamics | Random feature prediction | | Robust to noise | No (breaks with stochastic envs) | Yes (random target immune) | | Complexity | High (3+ networks, 2 losses) | Medium (2 networks, 1 loss) | | Computation | 2-3× base agent | 1.5-2× base agent | | When to use | Dense features, clean env | Sparse rewards, noisy env | ### RND Pitfall: Training Instability ```python # WRONG: High learning rate, large reward scale rnd_loss = rnd.predictor_loss(state) optimizer.zero_grad() rnd_loss.backward() optimizer.step() # ← high learning rate causes divergence # CORRECT: Careful hyperparameter tuning rnd_lr = 1e-4 # Much smaller than main agent rnd_optimizer = Adam(rnd.predictor.parameters(), lr=rnd_lr) # Scale intrinsic reward appropriately intrinsic_reward = rnd.compute_intrinsic_reward(state, scale=0.01) ``` **Symptom:** RND rewards explode, agent overfits to novelty **Fix:** Lower learning rate for RND, scale intrinsic rewards carefully ## Part 6: Count-Based Exploration ### State Visitation Counts For **discrete/tabular** environments, track how many times each state visited: ```python from collections import defaultdict class CountBasedExploration: """ Count-based exploration: encourage visiting rarely-seen states. Works for: ✓ Tabular (small discrete state space) ✓ Gridworlds, simple games Doesn't work for: ✗ Continuous spaces ✗ Image observations (never see same image twice) ✗ Large state spaces """ def __init__(self): self.state_counts = defaultdict(int) def compute_intrinsic_reward(self, state, reward_scale=1.0): """ Intrinsic reward inversely proportional to state visitation. intrinsic_reward = reward_scale / sqrt(N(s)) Rarely visited states (small N) → high intrinsic reward Frequently visited states (large N) → low intrinsic reward """ count = max(self.state_counts[state], 1) # Avoid division by zero return reward_scale / np.sqrt(count) def update_counts(self, state): """Increment visitation count for state.""" self.state_counts[state] += 1 ``` ### Example: Gridworld with Sparse Reward ```python # Gridworld: 10×10 grid, reward at (9, 9), start at (0, 0) # Without exploration: Random walking takes exponential time # With count-based: Directed toward unexplored cells # Pseudocode: for episode in range(episodes): state = env.reset() for step in range(max_steps): # Compute exploration bonus intrinsic_reward = count_explorer.compute_intrinsic_reward(state) # Combine with task reward combined_reward = env_reward + lambda * intrinsic_reward # Q-learning with combined reward action = epsilon_greedy(q_values[state], epsilon) next_state, env_reward = env.step(action) q_values[state][action] += alpha * ( combined_reward + gamma * max(q_values[next_state]) - q_values[state][action] ) # Update counts count_explorer.update_counts(next_state) state = next_state ``` ### Critical Limitation: Doesn't Scale ```python # Works: Small state space state_space_size = 100 # 10×10 grid # Can track counts for all states # Fails: Large/continuous state space state_space_size = 10^18 # Image observations # Can't track visitation counts for 10^18 unique states! ``` ## Part 7: When Exploration is Critical ### Decision Framework **Exploration matters when:** 1. **Sparse Rewards** (rewards rare, hard to find) - Examples: Montezuma's Revenge, goal-conditioned tasks, real robotics - No dense reward signal to guide learning - Agent must explore to find any reward - Solution: Intrinsic motivation (curiosity, RND) 2. **Large State Spaces** (too many possible states) - Examples: Image-based RL, continuous control - Random exploration covers infinitesimal fraction - Systematic exploration essential - Solution: Curiosity-driven or RND 3. **Long Horizons** (many steps before reward) - Examples: Multi-goal tasks, planning problems - Temporal credit assignment hard - Need to explore systematically to connect actions to delayed rewards - Solution: Sophisticated exploration strategy 4. **Deceptive Reward Landscape** (local optima common) - Examples: Multiple solutions, trade-offs - Easy to get stuck in suboptimal policy - Exploration helps escape local optima - Solution: Slow decay schedule, maintain exploration ### Decision Framework (Quick Check) ``` Do you have SPARSE rewards? YES → Use intrinsic motivation (curiosity, RND) NO → Continue Is state space large (images, continuous)? YES → Use curiosity-driven or RND NO → Continue Is exploration reasonably efficient with ε-greedy? YES → Use ε-greedy + appropriate decay schedule NO → Use curiosity-driven or RND ``` ### Example: Reward Structure Analysis ```python def analyze_reward_structure(rewards): """Determine if exploration strategy needed.""" # Check sparsity nonzero_rewards = np.count_nonzero(rewards) sparsity = 1 - (nonzero_rewards / len(rewards)) if sparsity > 0.95: print("SPARSE REWARDS detected") print(" → Use: Intrinsic motivation (RND or curiosity)") print(" → Why: Reward signal too rare to guide learning") # Check reward magnitude reward_std = np.std(rewards) reward_mean = np.mean(rewards) if reward_std < 0.1: print("WEAK/NOISY REWARDS detected") print(" → Use: Intrinsic motivation") print(" → Why: Reward signal insufficient to learn from") # Check reward coverage episode_length = len(rewards) if episode_length > 1000: print("LONG HORIZONS detected") print(" → Use: Strong exploration decay or intrinsic motivation") print(" → Why: Temporal credit assignment difficult") ``` ## Part 8: Combining Exploration with Task Rewards ### Combining Intrinsic and Extrinsic Rewards When using intrinsic motivation, balance with task reward: ```python def combine_rewards(extrinsic_reward, intrinsic_reward, intrinsic_scale=0.01): """ Combine extrinsic (task) and intrinsic (curiosity) rewards. r_total = r_extrinsic + λ * r_intrinsic λ controls tradeoff: - λ = 0: Ignore intrinsic reward (no exploration) - λ = 0.01: Curiosity helps, task reward primary (typical) - λ = 0.1: Curiosity significant - λ = 1.0: Curiosity dominates (might ignore task) """ return extrinsic_reward + intrinsic_scale * intrinsic_reward ``` ### Challenges: Reward Hacking ```python # PROBLEM: Intrinsic reward encourages anything novel # Even if novel thing is useless for task # Example: Atari with RND # If game has pixel randomness, RND rewards exploring random pixels # Instead of exploring to find coins/power-ups # SOLUTION: Scale intrinsic reward carefully # Make it significant but not dominant # SOLUTION 2: Curriculum learning # Start with high intrinsic reward (discover environment) # Gradually reduce as agent finds reward signals ``` ### Intrinsic Reward Scale Tuning ```python # Quick tuning procedure: for intrinsic_scale in [0.001, 0.01, 0.1, 1.0]: agent = RL_Agent(intrinsic_reward_scale=intrinsic_scale) for episode in episodes: performance = train_episode(agent) print(f"Scale={intrinsic_scale}: Performance={performance}") # Find scale where agent learns task well AND explores # Usually 0.01-0.1 is sweet spot ``` ## Part 9: Common Pitfalls and Debugging ### Pitfall 1: Epsilon Decay Too Fast **Symptom:** Agent plateaus at poor performance early in training **Root Cause:** Epsilon decays to near-zero before agent finds good actions ```python # WRONG: Decays in 10k steps epsilon_final = 0.0 epsilon_decay = 0.9999 # Per-step decay # After 10k steps: ε ≈ 0, almost no exploration left # CORRECT: Decay over full training total_training_steps = 1_000_000 epsilon_linear(step, total_training_steps, epsilon_start=1.0, epsilon_end=0.01) ``` **Diagnosis:** - Plot epsilon over training: does it reach 0 too early? - Check if performance improves after epsilon reaches low values **Fix:** - Use longer decay (more steps) - Use higher epsilon_end (never go to pure exploitation) ### Pitfall 2: Intrinsic Reward Too Strong **Symptom:** Agent explores forever, ignores task reward **Root Cause:** Intrinsic reward scale too high ```python # WRONG: Intrinsic reward dominates r_total = r_task + 1.0 * r_intrinsic # Agent optimizes novelty, ignores task # CORRECT: Intrinsic reward is small bonus r_total = r_task + 0.01 * r_intrinsic # Task reward primary, intrinsic helps exploration ``` **Diagnosis:** - Agent explores everywhere but doesn't collect task rewards - Intrinsic reward signal going to seemingly useless states **Fix:** - Reduce intrinsic_reward_scale (try 0.01, 0.001) - Verify agent eventually starts collecting task rewards ### Pitfall 3: ε-Greedy on Continuous Actions **Symptom:** Exploration ineffective, agent doesn't learn **Root Cause:** Random action in continuous space is meaningless ```python # WRONG: ε-greedy on continuous actions if random() < epsilon: action = np.random.uniform(-1, 1) # Random in action space else: action = network(state) # Neural network action # Random action is far from learned policy, completely unhelpful # CORRECT: Gaussian noise on action action = network(state) noisy_action = action + np.random.normal(0, exploration_std) noisy_action = np.clip(noisy_action, -1, 1) ``` **Diagnosis:** - Continuous action space and using ε-greedy - Agent not learning effectively **Fix:** - Use Gaussian noise: action + N(0, σ) - Decay exploration_std over time (like epsilon decay) ### Pitfall 4: Forgetting to Decay Exploration **Symptom:** Training loss decreases but policy doesn't improve, noisy behavior **Root Cause:** Agent keeps exploring randomly instead of exploiting learned policy ```python # WRONG: Constant exploration forever epsilon = 0.3 # CORRECT: Decaying exploration epsilon = epsilon_linear(step, total_steps) ``` **Diagnosis:** - No epsilon decay schedule mentioned in code - Agent behaves randomly even after many training steps **Fix:** - Add decay schedule (linear, exponential, polynomial) ### Pitfall 5: Using Exploration at Test Time **Symptom:** Test performance worse than training, highly variable **Root Cause:** Applying exploration strategy (ε > 0) at test time ```python # WRONG: Test with exploration for test_episode in test_episodes: action = epsilon_greedy(q_values, epsilon=0.05) # Wrong! # Agent still explores at test time # CORRECT: Test with greedy policy for test_episode in test_episodes: action = np.argmax(q_values) # Deterministic, no exploration ``` **Diagnosis:** - Test performance has high variance - Test performance < training performance (exploration hurts) **Fix:** - At test time, use greedy/deterministic policy - No ε-greedy, no Boltzmann, no exploration noise ### Pitfall 6: RND Predictor Overfitting **Symptom:** RND loss decreases but intrinsic rewards still large everywhere **Root Cause:** Predictor overfits to training data, doesn't generalize to new states ```python # WRONG: High learning rate, no regularization rnd_optimizer = Adam(rnd.predictor.parameters(), lr=0.001) rnd_loss.backward() rnd_optimizer.step() # Predictor fits perfectly to seen states but doesn't generalize # CORRECT: Lower learning rate, regularization rnd_optimizer = Adam(rnd.predictor.parameters(), lr=0.0001) # Add weight decay for regularization ``` **Diagnosis:** - RND training loss is low (close to 0) - But intrinsic rewards still high for most states - Suggests predictor fitted to training states but not generalizing **Fix:** - Reduce RND learning rate - Add weight decay (L2 regularization) - Use batch normalization in predictor ### Pitfall 7: Count-Based on Non-Tabular Problems **Symptom:** Exploration ineffective, agent keeps revisiting similar states **Root Cause:** State counting doesn't work for continuous/image spaces ```python # WRONG: Counting state IDs in image-based RL state = env.render(mode='rgb_array') # 84x84 image state_id = hash(state.tobytes()) # Different hash every time! count_based_explorer.update_counts(state_id) # Every frame is "new" because of slight pixel differences # State counting broken # CORRECT: Use RND or curiosity instead rnd = RandomNetworkDistillation(state_dim) # RND handles high-dimensional states ``` **Diagnosis:** - Using count-based exploration with images/continuous observations - Exploration not working effectively **Fix:** - Switch to RND or curiosity-driven methods - Count-based only for small discrete state spaces ## Part 10: Red Flags and Pressure Tests ### Red Flags Checklist - [ ] **Constant epsilon (no decay)**: Agent explores forever, can't exploit - [ ] **Epsilon decays to zero (pure exploitation)**: Agent stuck in local optimum - [ ] **ε-greedy on continuous actions**: Exploration unguided, learning slow - [ ] **Intrinsic reward >> task reward**: Agent ignores task - [ ] **No exploration specified**: Might be using default (often insufficient) - [ ] **Using RND in random/stochastic environment**: Breaks because predicting noise - [ ] **Count-based on images**: Can't count unique images, worthless - [ ] **Exploration at test time**: Test performance degrades from exploration noise - [ ] **ICM on pixel-randomness environment**: Forward model tries to predict noise - [ ] **Ignoring sparse rewards**: Using ε-greedy only, no intrinsic motivation ### Pressure Test Scenario 1: "My Agent Stuck in Local Optimum" **User Query:** "I trained my agent and it found a solution giving +10 reward. But I know the optimal solution is +100. Agent is stuck at +10. How do I fix it?" **What Skill Must Provide:** 1. Diagnosis: Likely exploration problem (local optimum) 2. Solution: Increase exploration (slower decay, or intrinsic motivation) 3. Prevention: Design exploration strategy before training **Skill Response:** "Your agent found a local optimum (+10) and exploitation is too strong. This is a classic exploration-exploitation problem. **Quick Fixes:** 1. Check epsilon decay: Is epsilon decaying too fast? Use slower decay schedule. 2. Check epsilon_end: Is epsilon_end = 0? Change to 0.01-0.05. 3. Add intrinsic motivation: Your environment might need curiosity-driven exploration. **Systematic Approach:** - If rewards are dense (frequent signals): ε-greedy with slow decay - If rewards are sparse (rare signals): Add RND or curiosity - If state space huge: Definitely use RND **Example Fix:** ``` # Instead of epsilon decaying to 0.01 in 100k steps # Decay over 1M steps with epsilon_end=0.05 epsilon = epsilon_linear(step, total_steps=1_000_000, epsilon_start=1.0, epsilon_end=0.05) ```" ### Pressure Test Scenario 2: "Which Exploration for Montezuma's Revenge?" **User Query:** "I'm training on Montezuma's Revenge (classic sparse reward benchmark). ε-Greedy isn't working (agent never finds first coin). What exploration strategy should I use?" **What Skill Must Provide:** 1. Recognition: This is THE benchmark for intrinsic motivation 2. Solution: RND is standard, ICM is alternative 3. Implementation: Code/guidance for using RND **Skill Response:** "Montezuma's Revenge is the canonical example where ε-greedy fails. The first coin is at step 500+ with no reward signal. Random exploration won't find it. **Recommended:** RND (Random Network Distillation) Why RND: - Robust to stochasticity (doesn't try to predict environment) - Simpler than ICM (no environment model needed) - Proven to work on Montezuma's Revenge **Implementation:** ```python rnd = RandomNetworkDistillation(state_dim=...) lambda_intrinsic = 0.01 # Scale intrinsic reward for step in training_steps: state = env.observation # Get intrinsic reward r_intrinsic = rnd.compute_intrinsic_reward(state) # Combine with task reward r_total = r_task + lambda_intrinsic * r_intrinsic # Train agent with combined reward action = agent.select_action(state) next_state, r_task = env.step(action) # Update RND predictor rnd_loss = rnd.predictor_loss(state) rnd_optimizer.zero_grad() rnd_loss.backward() rnd_optimizer.step() ``` **Alternative:** Curiosity-driven (ICM) also works if environment deterministic. **What to tune:** - lambda_intrinsic: Start 0.01, increase if agent doesn't explore enough - RND learning rate: Usually 1e-4 (much smaller than main agent)" ### Pressure Test Scenario 3: "ε-Greedy Decay Schedule" **User Query:** "I'm using ε-greedy but not sure about the decay schedule. I have 100k training steps. Should I decay epsilon linearly? Over how many steps? To what final value?" **What Skill Must Provide:** 1. Formula: Linear, exponential, polynomial decay 2. Guidance: Typical values for different problem types 3. Tuning: How to adjust based on task **Skill Response:** "**Quick Answer:** ```python total_steps = 100_000 epsilon = epsilon_linear(step, total_steps, epsilon_start=1.0, epsilon_end=0.05) ``` **Reasoning:** - `epsilon_start=1.0`: Explore fully initially (know nothing) - `epsilon_end=0.05`: Keep 5% exploration always (helps with stochasticity) - Linear decay: Simple, works well for most tasks **Decision Tree:** 1. **How dense are rewards?** - Sparse: Use slower decay (maybe 0.1 epsilon_end) - Dense: Use faster decay (0.01 epsilon_end) 2. **How many steps?** - 100k steps: Decay over 50k-100k (half to full training) - 1M steps: Decay over 500k-1M - Rule: Decay over 50-100% of training time 3. **Which decay curve?** - Linear (ε_linear): Safe default - Exponential (ε_exponential): Faster early decay - Polynomial with power=2: Balanced **Example Tuning:** ```python # Sparse rewards: More exploration needed epsilon = epsilon_linear(step, total_steps=100_000, epsilon_start=1.0, epsilon_end=0.1) # Very sparse rewards: Even more epsilon = epsilon_linear(step, total_steps=100_000, epsilon_start=1.0, epsilon_end=0.2) # Dense rewards: Faster decay to exploitation epsilon = epsilon_linear(step, total_steps=50_000, epsilon_start=1.0, epsilon_end=0.01) ```" ### Pressure Test Scenario 4: "ICM vs RND vs ε-Greedy" **User Query:** "I have sparse rewards and a deterministic environment. Should I use ε-greedy, ICM curiosity, or RND? How do I decide?" **What Skill Must Provide:** 1. Comparison table: Pros/cons of each 2. Decision framework: Which for which situation 3. Implementation difficulty: Is it worth the complexity? **Skill Response:** "**Comparison:** | Method | Complexity | Robustness | When to Use | |--------|-----------|-----------|------------| | ε-Greedy | Very Simple | Good on dense rewards | Dense rewards, simple env | | ICM | Complex (2 models) | Poor on stochastic | Sparse, deterministic, learning important | | RND | Medium (1 model) | Excellent | Sparse, any environment | **Decision Framework:** 1. **Sparse rewards + Deterministic + Lots of learning:** → Try ICM first (forward model learns useful features) 2. **Sparse rewards + Any environment:** → Use RND (robust, simpler) 3. **Sparse rewards + Large pixel space:** → RND (better for high-dimensional observations) 4. **Dense rewards:** → Just use ε-greedy (intrinsic motivation not needed) 5. **Not sure, want safe choice:** → Use RND (works almost everywhere) **Implementation Difficulty:** - ε-Greedy: 5 lines of code - ICM: 100+ lines (forward + inverse models, 2 losses) - RND: 50 lines (target + predictor) **My Recommendation:** Start ε-greedy. If not working after tuning decay schedules, switch to RND. Only use ICM if environment is fully deterministic and you care about learning dynamics." ### Pressure Test Scenario 5: "Computational Cost of Curiosity" **User Query:** "I want to add curiosity-driven exploration (ICM) to my agent. What's the computational overhead? Will it slow down training significantly?" **What Skill Must Provide:** 1. Overhead estimate: Exact computation cost 2. Tradeoff: When overhead is justified 3. Alternatives: Lighter-weight options **Skill Response:** "**Computational Overhead Estimate:** Base Agent: - 1 policy/value network - Training time: 1.0× With ICM: - Policy/value network + Forward model + Inverse model - Forward model: Encoder + hidden layers + state-dim output - Inverse model: Concatenate states + hidden + action-dim output - Training time: ~2.5-3.0× **Memory:** - Base: X MB - With ICM: ~3-4× X MB - GPU memory often bottleneck **When Overhead is Justified:** - Sparse rewards (ICM critical to success) - Large state space (intrinsic motivation helps) - Willing to wait longer for better exploration **When Overhead is NOT Justified:** - Dense rewards (ε-greedy sufficient) - Real-time training constraints - Limited GPU memory **Lighter Alternative:** Use RND instead of ICM: - ~1.5-2.0× overhead (vs 2.5-3.0× for ICM) - Same exploration benefits - Simpler to implement **Scaling to Large Models:** ```python # ICM with huge state encoders can be prohibitive # Example: Vision transformer encoder → ICM # That's very expensive # RND scales better: predictor can be small # Don't need sophisticated encoder ``` **Bottom Line:** ICM costs 2-3× training time. If you can afford it and rewards are very sparse, worth it. Otherwise try RND or even ε-greedy with slower decay first." ## Part 11: Rationalization Resistance Table | Rationalization | Reality | Counter-Guidance | Red Flag | |-----------------|---------|------------------|----------| | "ε-Greedy works everywhere" | Fails on sparse rewards, large spaces | Use ε-greedy for dense/small, intrinsic motivation for sparse/large | Applying ε-greedy to Montezuma's Revenge | | "Higher epsilon is better" | High ε → too random, doesn't exploit | Use decay schedule (ε high early, low late) | Using constant ε=0.5 throughout training | | "Decay epsilon to zero" | Agent needs residual exploration | Keep ε_end=0.01-0.1 always | Setting ε_final=0 (pure exploitation) | | "Curiosity always helps" | Can break with stochasticity (model tries to predict noise) | Use RND for stochastic, ICM for deterministic | Agent learns to explore random noise instead of task | | "RND is just ICM simplified" | RND is fundamentally different (frozen random vs learned model) | Understand frozen network prevents overfitting/noise | Not grasping why RND frozen network matters | | "More intrinsic reward = faster exploration" | Too much intrinsic reward drowns out task signal | Balance with λ=0.01-0.1, tune on task performance | Agent explores forever, ignores task | | "Count-based works anywhere" | Only works tabular (can't count unique images) | Use RND for continuous/high-dimensional spaces | Trying count-based on Atari images | | "Boltzmann is always better than ε-greedy" | Boltzmann smoother but harder to tune | Use ε-greedy for simplicity (it works well) | Switching to Boltzmann without clear benefit | | "Test with ε>0 for exploration" | Test should use learned policy, not explore | ε=0 or greedy policy at test time | Variable test performance from exploration | | "Longer decay is always better" | Very slow decay wastes time in early training | Match decay to task difficulty (faster for easy, slower for hard) | Decaying over 10M steps when training only 1M | | "Skip exploration, increase learning rate" | Learning rate is for optimization, exploration for coverage | Use both: exploration strategy + learning rate | Agent oscillates without exploration | | "ICM is the SOTA exploration" | RND simpler and more robust | Use RND unless you need environment model | Implementing ICM when RND would suffice | ## Part 12: Summary and Decision Framework ### Quick Decision Tree ``` START: Need exploration strategy? ├─ Are rewards sparse? (rare reward signal) │ ├─ YES → Need intrinsic motivation │ │ ├─ Environment stochastic? │ │ │ ├─ YES → RND │ │ │ └─ NO → ICM (or RND for simplicity) │ │ └─ Choose RND for safety │ │ │ └─ NO → Dense rewards │ └─ Use ε-greedy + decay schedule ├─ Is state space large? (images, continuous) │ ├─ YES → Intrinsic motivation (RND/curiosity) │ └─ NO → ε-greedy usually sufficient └─ Choosing decay schedule: ├─ Sparse rewards → slower decay (ε_end=0.05-0.1) ├─ Dense rewards → faster decay (ε_end=0.01) └─ Default: Linear decay over 50% of training ``` ### Implementation Checklist - [ ] Define reward structure (dense vs sparse) - [ ] Estimate state space size (discrete vs continuous) - [ ] Choose exploration method (ε-greedy, curiosity, RND, UCB, count-based) - [ ] Set epsilon/temperature parameters (start, end) - [ ] Choose decay schedule (linear, exponential, polynomial) - [ ] If using intrinsic motivation: set λ (usually 0.01) - [ ] Use greedy policy at test time (ε=0) - [ ] Monitor exploration vs exploitation (plot epsilon decay) - [ ] Tune hyperparameters (decay schedule, λ) based on task performance ### Typical Configurations **Dense Rewards, Small Action Space (e.g., simple game)** ```python epsilon = epsilon_linear(step, total_steps=100_000, epsilon_start=1.0, epsilon_end=0.01) # Fast exploitation, low exploration needed ``` **Sparse Rewards, Discrete Actions (e.g., Atari)** ```python rnd = RandomNetworkDistillation(...) epsilon = epsilon_linear(step, total_steps=1_000_000, epsilon_start=1.0, epsilon_end=0.05) r_total = r_task + 0.01 * r_intrinsic # Intrinsic motivation + slow decay ``` **Continuous Control, Sparse (e.g., Robotics)** ```python rnd = RandomNetworkDistillation(...) action = policy(state) + gaussian_noise(std=exploration_std) exploration_std = exploration_std_linear(..., std_end=0.01) r_total = r_task + 0.01 * r_intrinsic # Gaussian noise + RND ``` ## Key Takeaways 1. **Exploration is fundamental**: Don't ignore it. Design exploration strategy before training. 2. **Match method to problem**: - Dense rewards → ε-greedy - Sparse rewards → Intrinsic motivation (RND preferred) - Large state space → Intrinsic motivation 3. **Decay exploration over time**: Explore early, exploit late. 4. **Avoid common pitfalls**: - Don't decay to zero (ε_end > 0) - Don't use ε-greedy on continuous actions - Don't forget decay schedule - Don't use exploration at test time 5. **Balance intrinsic and extrinsic**: If using intrinsic rewards, don't let them dominate. 6. **RND is the safe choice**: Works for most exploration problems, simpler than ICM. 7. **Test exploration hypothesis**: Plot epsilon or intrinsic rewards, verify exploration strategy is active. This skill is about **systematic exploration design**, not just tuning one hyperparameter.