Files
gh-tachyon-beep-skillpacks-…/skills/using-deep-rl/exploration-strategies.md
2025-11-30 08:59:51 +08:00

1600 lines
49 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Exploration Strategies in Deep RL
## When to Use This Skill
Invoke this skill when you encounter:
- **Exploration-Exploitation Problem**: Agent stuck in local optimum, not finding sparse rewards
- **ε-Greedy Tuning**: Designing or debugging epsilon decay schedules
- **Sparse Reward Environments**: Montezuma's Revenge, goal-conditioned tasks, minimal feedback
- **Large State Spaces**: Too many states for random exploration to be effective
- **Curiosity-Driven Learning**: Implementing or understanding intrinsic motivation
- **RND (Random Network Distillation)**: Novelty-based exploration for sparse rewards
- **Count-Based Exploration**: Encouraging discovery in discrete/tabular domains
- **Exploration Stability**: Agent explores too much/little, inconsistent performance
- **Method Selection**: Which exploration strategy for this problem?
- **Computational Cost**: Balancing exploration sophistication vs overhead
- **Boltzmann Exploration**: Softmax-based action selection and temperature tuning
**Core Problem:** Many RL agents get stuck exploiting a local optimum, never finding sparse rewards or exploring high-dimensional state spaces effectively. Choosing the right exploration strategy is fundamental to success.
## Do NOT Use This Skill For
- **Algorithm selection** (route to rl-foundations or specific algorithm skills like value-based-methods, policy-gradient-methods)
- **Reward design issues** (route to reward-shaping-engineering)
- **Environment bugs causing poor exploration** (route to rl-debugging first to verify environment works correctly)
- **Basic RL concepts** (route to rl-foundations for MDPs, value functions, Bellman equations)
- **Training instability unrelated to exploration** (route to appropriate algorithm skill or rl-debugging)
## Core Principle: The Exploration-Exploitation Tradeoff
### The Fundamental Tension
In reinforcement learning, every action selection is a decision:
- **Exploit**: Take the action with highest estimated value (maximize immediate reward)
- **Explore**: Try a different action to learn about its value (find better actions)
```
Exploitation Extreme:
- Only take the best-known action
- High immediate reward (in training)
- BUT: Stuck in local optimum if initial action wasn't optimal
- Risk: Never find the actual best reward
Exploration Extreme:
- Take random actions uniformly
- Will eventually find any reward
- BUT: Wasting resources on clearly bad actions
- Risk: No learning because too much randomness
Optimal Balance:
- Explore enough to find good actions
- Exploit enough to benefit from learning
```
### Why Exploration Matters
**Scenario 1: Sparse Reward Environment**
Imagine an agent in Montezuma's Revenge (classic exploration benchmark):
- Most states give reward = 0
- First coin gives +1 (at step 500+)
- Without exploring systematically, random actions won't find that coin in millions of steps
Without exploration strategy:
```
Steps 0-1,000: Random actions, no reward signal
Steps 1,000-10,000: Learned to get to the coin, finally seeing reward
Problem: Took 1,000 steps of pure random exploration!
With smart exploration (RND):
Steps 0-100: RND detects novel states, guides toward unexplored areas
Steps 100-500: Finds coin much faster because exploring strategically
Result: Reward found in 10% of steps
```
**Scenario 2: Local Optimum Trap**
Agent finds a small reward (+1) from a simple policy:
```
Without decay:
- Agent learns exploit_policy achieves +1
- ε-greedy with ε=0.3: Still 30% random (good, explores)
- BUT: 70% exploiting suboptimal policy indefinitely
With decay:
- Step 0: ε=1.0, 100% explore
- Step 100k: ε=0.05, 5% explore
- Step 500k: ε=0.01, 1% explore
- Result: Enough exploration to find +5 reward, then exploit it
```
### Core Rule
**Exploration is an investment with declining returns.**
- Early training: Exploration critical (don't know anything yet)
- Mid training: Balanced (learning but not confident)
- Late training: Exploitation dominant (confident in good actions)
## Part 1: ε-Greedy Exploration
### The Baseline Method
ε-Greedy is the simplest exploration strategy: with probability ε, take a random action; otherwise, take the greedy (best-known) action.
```python
import numpy as np
def epsilon_greedy_action(q_values, epsilon):
"""
Select action using ε-greedy.
Args:
q_values: Q(s, *) - values for all actions
epsilon: exploration probability [0, 1]
Returns:
action: int (0 to num_actions-1)
"""
if np.random.random() < epsilon:
# Explore: random action
return np.random.randint(len(q_values))
else:
# Exploit: best action
return np.argmax(q_values)
```
### Why ε-Greedy Works
1. **Simple**: Easy to implement and understand
2. **Guaranteed Convergence**: Will eventually visit all states (if ε > 0)
3. **Effective Baseline**: Works surprisingly well for many tasks
4. **Interpretable**: ε has clear meaning (probability of random action)
### When ε-Greedy Fails
```
Problem Space → Exploration Effectiveness:
Small discrete spaces (< 100 actions):
- ε-greedy: Excellent ✓
- Reason: Random exploration covers space quickly
Large discrete spaces (100-10,000 actions):
- ε-greedy: Poor ✗
- Reason: Random action is almost always bad
- Example: Game with 500 actions, random 1/500 chance is right action
Continuous action spaces:
- ε-greedy: Terrible ✗
- Reason: Random action in [-∞, ∞] is meaningless noise
- Alternative: Gaussian noise on action (not true ε-greedy)
Sparse rewards, large state spaces:
- ε-greedy: Hopeless ✗
- Reason: Random exploration won't find rare reward before heat death
- Alternative: Curiosity, RND, intrinsic motivation
```
### ε-Decay Schedules
The key insight: ε should decay over time. Explore early, exploit late.
#### Linear Decay
```python
def epsilon_linear(step, total_steps, epsilon_start=1.0, epsilon_end=0.1):
"""
Linear decay from epsilon_start to epsilon_end.
ε(t) = ε_start - (ε_start - ε_end) * t / T
"""
t = min(step, total_steps)
return epsilon_start - (epsilon_start - epsilon_end) * t / total_steps
```
**Properties:**
- Simple, predictable, easy to tune
- Equal exploration reduction per step
- Good for most tasks
**Guidance:**
- Use if no special knowledge about task
- `epsilon_start = 1.0` (explore fully initially)
- `epsilon_end = 0.01` to `0.1` (small residual exploration)
- `total_steps = 1,000,000` (typical deep RL)
#### Exponential Decay
```python
def epsilon_exponential(step, decay_rate=0.9995):
"""
Exponential decay with constant rate.
ε(t) = ε_0 * decay_rate^t
"""
return 1.0 * (decay_rate ** step)
```
**Properties:**
- Fast initial decay, slow tail
- Aggressive early exploration cutoff
- Exploration drops exponentially
**Guidance:**
- Use if task rewards are found quickly
- `decay_rate = 0.9995` is gentle (1% per 100 steps)
- `decay_rate = 0.999` is aggressive (1% per step)
- Watch for premature convergence to local optimum
#### Polynomial Decay
```python
def epsilon_polynomial(step, total_steps, epsilon_start=1.0,
epsilon_end=0.01, power=2.0):
"""
Polynomial decay: ε(t) = ε_start * (1 - t/T)^p
power=1: Linear
power=2: Quadratic (faster early decay)
power=0.5: Slower decay
"""
t = min(step, total_steps)
fraction = t / total_steps
return epsilon_start * (1 - fraction) ** power
```
**Properties:**
- Smooth, tunable decay curve
- Power > 1: Fast early decay, slow tail
- Power < 1: Slow early decay, fast tail
**Guidance:**
- `power = 2.0`: Quadratic (balanced, common)
- `power = 3.0`: Cubic (aggressive early decay)
- `power = 0.5`: Slower (gentle early decay)
### Practical Guidance: Choosing Epsilon Parameters
```
Rule of Thumb:
- epsilon_start = 1.0 (explore uniformly initially)
- epsilon_end = 0.01 to 0.1 (maintain minimal exploration)
- 0.01: For large action spaces (need some exploration)
- 0.05: Default choice
- 0.1: For small action spaces (can afford random actions)
- total_steps: Based on training duration
- Usually 500k to 1M steps
- Longer if rewards are sparse or delayed
Task-Specific Adjustments:
- Sparse rewards: Longer decay (explore for more steps)
- Dense rewards: Shorter decay (can exploit earlier)
- Large action space: Higher epsilon_end (maintain exploration)
- Small action space: Lower epsilon_end (exploitation is cheap)
```
### ε-Greedy Pitfall 1: Decay Too Fast
```python
# WRONG: Decays to 0 in just 10k steps
epsilon_final = 0.01
decay_steps = 10_000
epsilon = epsilon_final ** (step / decay_steps) # ← BUG
# CORRECT: Decays gently over training
total_steps = 1_000_000
epsilon_linear(step, total_steps, epsilon_start=1.0, epsilon_end=0.01)
```
**Symptom:** Agent plateaus early, never improves past initial local optimum
**Fix:** Use longer decay schedule, ensure epsilon_end > 0
### ε-Greedy Pitfall 2: Never Decays (Constant ε)
```python
# WRONG: Fixed epsilon forever
epsilon = 0.3 # Constant
# CORRECT: Decay epsilon over time
epsilon = epsilon_linear(step, total_steps=1_000_000)
```
**Symptom:** Agent learns but performance noisy, can't fully exploit learned policy
**Fix:** Add epsilon decay schedule
### ε-Greedy Pitfall 3: Epsilon on Continuous Actions
```python
# WRONG: Discrete epsilon-greedy on continuous actions
action = np.random.uniform(-1, 1) if random() < epsilon else greedy_action
# CORRECT: Gaussian noise on continuous actions
def continuous_exploration(action, exploration_std=0.1):
return action + np.random.normal(0, exploration_std, action.shape)
```
**Symptom:** Continuous action spaces don't benefit from ε-greedy (random action is meaningless)
**Fix:** Use Gaussian noise or other continuous exploration methods
## Part 2: Boltzmann Exploration
### Temperature-Based Action Selection
Instead of deterministic greedy action, select actions proportional to their Q-values using softmax with temperature T.
```python
def boltzmann_exploration(q_values, temperature=1.0):
"""
Select action using Boltzmann distribution.
P(a) = exp(Q(s,a) / T) / Σ exp(Q(s,a') / T)
Args:
q_values: Q(s, *) - values for all actions
temperature: Exploration parameter
T → 0: Becomes deterministic (greedy)
T → ∞: Becomes uniform random
Returns:
action: int (sampled from distribution)
"""
# Subtract max for numerical stability
q_shifted = q_values - np.max(q_values)
# Compute probabilities
probabilities = np.exp(q_shifted / temperature)
probabilities = probabilities / np.sum(probabilities)
# Sample action
return np.random.choice(len(q_values), p=probabilities)
```
### Properties vs ε-Greedy
| Feature | ε-Greedy | Boltzmann |
|---------|----------|-----------|
| Good actions | Probability: 1-ε | Probability: higher (proportional to Q) |
| Bad actions | Probability: ε/(n-1) | Probability: lower (proportional to Q) |
| Action selection | Deterministic or random | Stochastic distribution |
| Exploration | Uniform random | Biased toward better actions |
| Tuning | ε (1 parameter) | T (1 parameter) |
**Key Advantage:** Boltzmann balances better—good actions are preferred but still get chances.
```
Example: Three actions with Q=[10, 0, -10]
ε-Greedy (ε=0.2):
- Action 0: P=0.8 (exploit best)
- Action 1: P=0.1 (random)
- Action 2: P=0.1 (random)
- Problem: Good actions (Q=0, -10) barely sampled
Boltzmann (T=2):
- Action 0: P=0.88 (exp(10/2)=e^5 ≈ 148)
- Action 1: P=0.11 (exp(0/2)=1)
- Action 2: P=0.01 (exp(-10/2)≈0.007)
- Better: Action 1 still gets 11% (not negligible)
```
### Temperature Decay Schedule
Like epsilon, temperature should decay: start high (explore), end low (exploit).
```python
def temperature_decay(step, total_steps, temp_start=1.0, temp_end=0.1):
"""
Linear temperature decay.
T(t) = T_start - (T_start - T_end) * t / T_total
"""
t = min(step, total_steps)
return temp_start - (temp_start - temp_end) * t / total_steps
# Usage in training loop
for step in range(total_steps):
T = temperature_decay(step, total_steps)
action = boltzmann_exploration(q_values, temperature=T)
# ...
```
### When to Use Boltzmann vs ε-Greedy
```
Choose ε-Greedy if:
- Simple implementation preferred
- Discrete action space
- Task has clear good/bad actions (wide Q-value spread)
Choose Boltzmann if:
- Actions have similar Q-values (nuanced exploration)
- Want to bias exploration toward promising actions
- Fine-grained control over exploration desired
```
## Part 3: UCB (Upper Confidence Bound)
### Theoretical Optimality
UCB is provably optimal for the multi-armed bandit problem:
```python
def ucb_action(q_values, action_counts, total_visits, c=1.0):
"""
Select action using Upper Confidence Bound.
UCB(a) = Q(a) + c * sqrt(ln(N) / N(a))
Args:
q_values: Current Q-value estimates
action_counts: N(a) - times each action visited
total_visits: N - total visits to state
c: Exploration constant (usually 1.0 or sqrt(2))
Returns:
action: int (maximizing UCB)
"""
# Avoid division by zero
action_counts = np.maximum(action_counts, 1)
# Compute exploration bonus
exploration_bonus = c * np.sqrt(np.log(total_visits) / action_counts)
# Upper confidence bound
ucb = q_values + exploration_bonus
return np.argmax(ucb)
```
### Why UCB Works
UCB balances exploitation and exploration via **optimism under uncertainty**:
- If Q(a) is high → exploit it
- If Q(a) is uncertain (rarely visited) → exploration bonus makes UCB high
```
Example: Bandit with 2 arms
- Arm A: Visited 100 times, estimated Q=2.0
- Arm B: Visited 10 times, estimated Q=1.5
UCB(A) = 2.0 + 1.0 * sqrt(ln(110) / 100) ≈ 2.0 + 0.26 = 2.26
UCB(B) = 1.5 + 1.0 * sqrt(ln(110) / 10) ≈ 1.5 + 0.82 = 2.32
Result: Try Arm B despite lower Q estimate (less certain)
```
### Critical Limitation: Doesn't Scale to Deep RL
UCB assumes **tabular setting** (small, discrete state space where you can count visits):
```python
# WORKS: Tabular Q-learning
state_action_counts = defaultdict(int) # N(s, a)
state_counts = defaultdict(int) # N(s)
# BREAKS in deep RL:
# With function approximation, states don't repeat exactly
# Can't count "how many times visited state X" in continuous/image observations
```
**Practical Issue:**
In image-based RL (Atari, vision), never see the same pixel image twice. State counting is impossible.
### When UCB Applies
```
Use UCB if:
✓ Discrete action space (< 100 actions)
✓ Discrete state space (< 10,000 states)
✓ Tabular Q-learning (no function approximation)
✓ Rewards come quickly (don't need long-term planning)
Examples: Simple bandits, small Gridworlds, discrete card games
DO NOT use UCB if:
✗ Using neural networks (state approximation)
✗ Continuous actions or large state space
✗ Image observations (pixel space too large)
✗ Sparse rewards (need different methods)
```
### Connection to Deep RL
For deep RL, need to estimate **uncertainty** without explicit counts:
```python
def deep_ucb_approximation(mean_q, uncertainty, c=1.0):
"""
Approximate UCB using learned uncertainty (not action counts).
Used in methods like:
- Deep Ensembles: Use ensemble variance as uncertainty
- Dropout: Use MC-dropout variance
- Bootstrap DQN: Ensemble of Q-networks
UCB ≈ Q(s,a) + c * uncertainty(s,a)
"""
return mean_q + c * uncertainty
```
**Modern Approach:** Instead of counting visits, learn uncertainty through:
- **Ensemble Methods**: Train multiple Q-networks, use disagreement
- **Bayesian Methods**: Learn posterior over Q-values
- **Bootstrap DQN**: Separate Q-networks give uncertainty estimates
These adapt UCB principles to deep RL.
## Part 4: Curiosity-Driven Exploration (ICM)
### The Core Insight
**Prediction Error as Exploration Signal**
Agent is "curious" about states where it can't predict the next state well:
```
Intuition: If I can't predict what will happen, I probably
haven't learned about this state yet. Let me explore here!
Intrinsic Reward = ||next_state - predicted_next_state||^2
```
### Intrinsic Curiosity Module (ICM)
```python
import torch
import torch.nn as nn
class IntrinsicCuriosityModule(nn.Module):
"""
ICM = Forward Model + Inverse Model
Forward Model: Predicts next state from (state, action)
- Input: current state + action taken
- Output: predicted next state
- Error: prediction error = surprise
Inverse Model: Predicts action from (state, next_state)
- Input: current state and next state
- Output: predicted action taken
- Purpose: Learn representation that distinguishes states
"""
def __init__(self, state_dim, action_dim, hidden_dim=128):
super().__init__()
# Inverse model: (s, s') → a
self.inverse = nn.Sequential(
nn.Linear(2 * state_dim, hidden_dim),
nn.ReLU(),
nn.Linear(hidden_dim, action_dim)
)
# Forward model: (s, a) → s'
self.forward = nn.Sequential(
nn.Linear(state_dim + action_dim, hidden_dim),
nn.ReLU(),
nn.Linear(hidden_dim, state_dim)
)
def compute_intrinsic_reward(self, state, action, next_state):
"""
Curiosity reward = prediction error of forward model.
high_error → Unseen state → Reward exploration
low_error → Seen state → Ignore (already learned)
"""
# Predict next state
predicted_next = self.forward(torch.cat([state, action], dim=-1))
# Compute prediction error
prediction_error = torch.norm(next_state - predicted_next, dim=-1)
# Intrinsic reward is prediction error (exploration bonus)
return prediction_error
def loss(self, state, action, next_state, action_pred_logits):
"""
Combine forward and inverse losses.
Forward loss: Forward model prediction error
Inverse loss: Inverse model action prediction error
"""
# Forward loss
predicted_next = self.forward(torch.cat([state, action], dim=-1))
forward_loss = torch.mean((next_state - predicted_next) ** 2)
# Inverse loss
predicted_action = action_pred_logits
inverse_loss = torch.mean((action - predicted_action) ** 2)
return forward_loss + inverse_loss
```
### Why Both Forward and Inverse Models?
```
Forward model alone:
- Can predict next state without learning features
- Might just memorize (Q: Do pixels change when I do action X?)
- Doesn't necessarily learn task-relevant state representation
Inverse model:
- Forces feature learning that distinguishes states
- Can only predict action if states are well-represented
- Improves forward model's learned representation
Together: Forward + Inverse
- Better feature learning (inverse helps)
- Better prediction (forward is primary)
```
### Critical Pitfall: Random Environment Trap
```python
# WRONG: Using curiosity in stochastic environment
# Environment: Atari with pixel randomness/motion artifacts
# Agent gets reward for predicting pixel noise
# Prediction error = pixels changed randomly
# Intrinsic reward goes to the noisiest state!
# Result: Agent learns nothing about task, just explores random pixels
# CORRECT: Use RND instead (next section)
# RND uses FROZEN random network, doesn't get reward for actual noise
```
**Key Distinction:**
- ICM: Learns to predict environment (breaks if environment has noise/randomness)
- RND: Uses frozen random network (robust to environment randomness)
### Computational Cost
```python
# ICM adds significant overhead:
# - Forward model network (encoder + layers + output)
# - Inverse model network (encoder + layers + output)
# - Training both networks every step
# Overhead estimate:
# Base agent: 1 network (policy/value)
# With ICM: 3+ networks (policy + forward + inverse)
# Training time: ~2-3× longer
# Memory: ~3× larger
# When justified:
# - Sparse rewards (ICM critical)
# - Large state spaces (ICM helps)
#
# When NOT justified:
# - Dense rewards (environment signal sufficient)
# - Continuous control with simple rewards (ε-greedy enough)
```
## Part 5: RND (Random Network Distillation)
### The Elegant Solution
RND is simpler and more robust than ICM:
```python
class RandomNetworkDistillation(nn.Module):
"""
RND: Intrinsic reward = prediction error of target network
Key innovation: Target network is RANDOM and FROZEN
(never updated)
Two networks:
1. Target (random, frozen): f_target(s) - fixed throughout training
2. Predictor (trained): f_predict(s) - learns to predict target
Intrinsic reward = ||f_target(s) - f_predict(s)||^2
New state (s not seen) → high prediction error → reward exploration
Seen state (s familiar) → low prediction error → ignore
"""
def __init__(self, state_dim, embedding_dim=128):
super().__init__()
# Target network: random, never updates
self.target = nn.Sequential(
nn.Linear(state_dim, embedding_dim),
nn.ReLU(),
nn.Linear(embedding_dim, embedding_dim)
)
# Predictor network: learns to mimic target
self.predictor = nn.Sequential(
nn.Linear(state_dim, embedding_dim),
nn.ReLU(),
nn.Linear(embedding_dim, embedding_dim)
)
# Freeze target network
for param in self.target.parameters():
param.requires_grad = False
def compute_intrinsic_reward(self, state, scale=1.0):
"""
Intrinsic reward = prediction error of target network.
Args:
state: Current observation
scale: Scale factor for reward (usually 0.1-1.0)
Returns:
Intrinsic reward (novelty signal)
"""
with torch.no_grad():
target_features = self.target(state)
predicted_features = self.predictor(state)
# L2 prediction error
prediction_error = torch.norm(
target_features - predicted_features,
dim=-1,
p=2
)
return scale * prediction_error
def predictor_loss(self, state):
"""
Loss for predictor: minimize prediction error.
Only update predictor (target stays frozen).
"""
with torch.no_grad():
target_features = self.target(state)
predicted_features = self.predictor(state)
# MSE loss
return torch.mean((target_features - predicted_features) ** 2)
```
### Why RND is Elegant
1. **No Environment Model**: Doesn't need to model dynamics (unlike ICM)
2. **Robust to Randomness**: Random network isn't trying to predict anything real, so environment noise doesn't fool it
3. **Simple**: Just predict random features
4. **Fast**: Train only predictor (target frozen)
### RND vs ICM Comparison
| Aspect | ICM | RND |
|--------|-----|-----|
| Networks | Forward + Inverse | Target (frozen) + Predictor |
| Learns | Environment dynamics | Random feature prediction |
| Robust to noise | No (breaks with stochastic envs) | Yes (random target immune) |
| Complexity | High (3+ networks, 2 losses) | Medium (2 networks, 1 loss) |
| Computation | 2-3× base agent | 1.5-2× base agent |
| When to use | Dense features, clean env | Sparse rewards, noisy env |
### RND Pitfall: Training Instability
```python
# WRONG: High learning rate, large reward scale
rnd_loss = rnd.predictor_loss(state)
optimizer.zero_grad()
rnd_loss.backward()
optimizer.step() # ← high learning rate causes divergence
# CORRECT: Careful hyperparameter tuning
rnd_lr = 1e-4 # Much smaller than main agent
rnd_optimizer = Adam(rnd.predictor.parameters(), lr=rnd_lr)
# Scale intrinsic reward appropriately
intrinsic_reward = rnd.compute_intrinsic_reward(state, scale=0.01)
```
**Symptom:** RND rewards explode, agent overfits to novelty
**Fix:** Lower learning rate for RND, scale intrinsic rewards carefully
## Part 6: Count-Based Exploration
### State Visitation Counts
For **discrete/tabular** environments, track how many times each state visited:
```python
from collections import defaultdict
class CountBasedExploration:
"""
Count-based exploration: encourage visiting rarely-seen states.
Works for:
✓ Tabular (small discrete state space)
✓ Gridworlds, simple games
Doesn't work for:
✗ Continuous spaces
✗ Image observations (never see same image twice)
✗ Large state spaces
"""
def __init__(self):
self.state_counts = defaultdict(int)
def compute_intrinsic_reward(self, state, reward_scale=1.0):
"""
Intrinsic reward inversely proportional to state visitation.
intrinsic_reward = reward_scale / sqrt(N(s))
Rarely visited states (small N) → high intrinsic reward
Frequently visited states (large N) → low intrinsic reward
"""
count = max(self.state_counts[state], 1) # Avoid division by zero
return reward_scale / np.sqrt(count)
def update_counts(self, state):
"""Increment visitation count for state."""
self.state_counts[state] += 1
```
### Example: Gridworld with Sparse Reward
```python
# Gridworld: 10×10 grid, reward at (9, 9), start at (0, 0)
# Without exploration: Random walking takes exponential time
# With count-based: Directed toward unexplored cells
# Pseudocode:
for episode in range(episodes):
state = env.reset()
for step in range(max_steps):
# Compute exploration bonus
intrinsic_reward = count_explorer.compute_intrinsic_reward(state)
# Combine with task reward
combined_reward = env_reward + lambda * intrinsic_reward
# Q-learning with combined reward
action = epsilon_greedy(q_values[state], epsilon)
next_state, env_reward = env.step(action)
q_values[state][action] += alpha * (
combined_reward + gamma * max(q_values[next_state]) - q_values[state][action]
)
# Update counts
count_explorer.update_counts(next_state)
state = next_state
```
### Critical Limitation: Doesn't Scale
```python
# Works: Small state space
state_space_size = 100 # 10×10 grid
# Can track counts for all states
# Fails: Large/continuous state space
state_space_size = 10^18 # Image observations
# Can't track visitation counts for 10^18 unique states!
```
## Part 7: When Exploration is Critical
### Decision Framework
**Exploration matters when:**
1. **Sparse Rewards** (rewards rare, hard to find)
- Examples: Montezuma's Revenge, goal-conditioned tasks, real robotics
- No dense reward signal to guide learning
- Agent must explore to find any reward
- Solution: Intrinsic motivation (curiosity, RND)
2. **Large State Spaces** (too many possible states)
- Examples: Image-based RL, continuous control
- Random exploration covers infinitesimal fraction
- Systematic exploration essential
- Solution: Curiosity-driven or RND
3. **Long Horizons** (many steps before reward)
- Examples: Multi-goal tasks, planning problems
- Temporal credit assignment hard
- Need to explore systematically to connect actions to delayed rewards
- Solution: Sophisticated exploration strategy
4. **Deceptive Reward Landscape** (local optima common)
- Examples: Multiple solutions, trade-offs
- Easy to get stuck in suboptimal policy
- Exploration helps escape local optima
- Solution: Slow decay schedule, maintain exploration
### Decision Framework (Quick Check)
```
Do you have SPARSE rewards?
YES → Use intrinsic motivation (curiosity, RND)
NO → Continue
Is state space large (images, continuous)?
YES → Use curiosity-driven or RND
NO → Continue
Is exploration reasonably efficient with ε-greedy?
YES → Use ε-greedy + appropriate decay schedule
NO → Use curiosity-driven or RND
```
### Example: Reward Structure Analysis
```python
def analyze_reward_structure(rewards):
"""Determine if exploration strategy needed."""
# Check sparsity
nonzero_rewards = np.count_nonzero(rewards)
sparsity = 1 - (nonzero_rewards / len(rewards))
if sparsity > 0.95:
print("SPARSE REWARDS detected")
print(" → Use: Intrinsic motivation (RND or curiosity)")
print(" → Why: Reward signal too rare to guide learning")
# Check reward magnitude
reward_std = np.std(rewards)
reward_mean = np.mean(rewards)
if reward_std < 0.1:
print("WEAK/NOISY REWARDS detected")
print(" → Use: Intrinsic motivation")
print(" → Why: Reward signal insufficient to learn from")
# Check reward coverage
episode_length = len(rewards)
if episode_length > 1000:
print("LONG HORIZONS detected")
print(" → Use: Strong exploration decay or intrinsic motivation")
print(" → Why: Temporal credit assignment difficult")
```
## Part 8: Combining Exploration with Task Rewards
### Combining Intrinsic and Extrinsic Rewards
When using intrinsic motivation, balance with task reward:
```python
def combine_rewards(extrinsic_reward, intrinsic_reward,
intrinsic_scale=0.01):
"""
Combine extrinsic (task) and intrinsic (curiosity) rewards.
r_total = r_extrinsic + λ * r_intrinsic
λ controls tradeoff:
- λ = 0: Ignore intrinsic reward (no exploration)
- λ = 0.01: Curiosity helps, task reward primary (typical)
- λ = 0.1: Curiosity significant
- λ = 1.0: Curiosity dominates (might ignore task)
"""
return extrinsic_reward + intrinsic_scale * intrinsic_reward
```
### Challenges: Reward Hacking
```python
# PROBLEM: Intrinsic reward encourages anything novel
# Even if novel thing is useless for task
# Example: Atari with RND
# If game has pixel randomness, RND rewards exploring random pixels
# Instead of exploring to find coins/power-ups
# SOLUTION: Scale intrinsic reward carefully
# Make it significant but not dominant
# SOLUTION 2: Curriculum learning
# Start with high intrinsic reward (discover environment)
# Gradually reduce as agent finds reward signals
```
### Intrinsic Reward Scale Tuning
```python
# Quick tuning procedure:
for intrinsic_scale in [0.001, 0.01, 0.1, 1.0]:
agent = RL_Agent(intrinsic_reward_scale=intrinsic_scale)
for episode in episodes:
performance = train_episode(agent)
print(f"Scale={intrinsic_scale}: Performance={performance}")
# Find scale where agent learns task well AND explores
# Usually 0.01-0.1 is sweet spot
```
## Part 9: Common Pitfalls and Debugging
### Pitfall 1: Epsilon Decay Too Fast
**Symptom:** Agent plateaus at poor performance early in training
**Root Cause:** Epsilon decays to near-zero before agent finds good actions
```python
# WRONG: Decays in 10k steps
epsilon_final = 0.0
epsilon_decay = 0.9999 # Per-step decay
# After 10k steps: ε ≈ 0, almost no exploration left
# CORRECT: Decay over full training
total_training_steps = 1_000_000
epsilon_linear(step, total_training_steps,
epsilon_start=1.0, epsilon_end=0.01)
```
**Diagnosis:**
- Plot epsilon over training: does it reach 0 too early?
- Check if performance improves after epsilon reaches low values
**Fix:**
- Use longer decay (more steps)
- Use higher epsilon_end (never go to pure exploitation)
### Pitfall 2: Intrinsic Reward Too Strong
**Symptom:** Agent explores forever, ignores task reward
**Root Cause:** Intrinsic reward scale too high
```python
# WRONG: Intrinsic reward dominates
r_total = r_task + 1.0 * r_intrinsic
# Agent optimizes novelty, ignores task
# CORRECT: Intrinsic reward is small bonus
r_total = r_task + 0.01 * r_intrinsic
# Task reward primary, intrinsic helps exploration
```
**Diagnosis:**
- Agent explores everywhere but doesn't collect task rewards
- Intrinsic reward signal going to seemingly useless states
**Fix:**
- Reduce intrinsic_reward_scale (try 0.01, 0.001)
- Verify agent eventually starts collecting task rewards
### Pitfall 3: ε-Greedy on Continuous Actions
**Symptom:** Exploration ineffective, agent doesn't learn
**Root Cause:** Random action in continuous space is meaningless
```python
# WRONG: ε-greedy on continuous actions
if random() < epsilon:
action = np.random.uniform(-1, 1) # Random in action space
else:
action = network(state) # Neural network action
# Random action is far from learned policy, completely unhelpful
# CORRECT: Gaussian noise on action
action = network(state)
noisy_action = action + np.random.normal(0, exploration_std)
noisy_action = np.clip(noisy_action, -1, 1)
```
**Diagnosis:**
- Continuous action space and using ε-greedy
- Agent not learning effectively
**Fix:**
- Use Gaussian noise: action + N(0, σ)
- Decay exploration_std over time (like epsilon decay)
### Pitfall 4: Forgetting to Decay Exploration
**Symptom:** Training loss decreases but policy doesn't improve, noisy behavior
**Root Cause:** Agent keeps exploring randomly instead of exploiting learned policy
```python
# WRONG: Constant exploration forever
epsilon = 0.3
# CORRECT: Decaying exploration
epsilon = epsilon_linear(step, total_steps)
```
**Diagnosis:**
- No epsilon decay schedule mentioned in code
- Agent behaves randomly even after many training steps
**Fix:**
- Add decay schedule (linear, exponential, polynomial)
### Pitfall 5: Using Exploration at Test Time
**Symptom:** Test performance worse than training, highly variable
**Root Cause:** Applying exploration strategy (ε > 0) at test time
```python
# WRONG: Test with exploration
for test_episode in test_episodes:
action = epsilon_greedy(q_values, epsilon=0.05) # Wrong!
# Agent still explores at test time
# CORRECT: Test with greedy policy
for test_episode in test_episodes:
action = np.argmax(q_values) # Deterministic, no exploration
```
**Diagnosis:**
- Test performance has high variance
- Test performance < training performance (exploration hurts)
**Fix:**
- At test time, use greedy/deterministic policy
- No ε-greedy, no Boltzmann, no exploration noise
### Pitfall 6: RND Predictor Overfitting
**Symptom:** RND loss decreases but intrinsic rewards still large everywhere
**Root Cause:** Predictor overfits to training data, doesn't generalize to new states
```python
# WRONG: High learning rate, no regularization
rnd_optimizer = Adam(rnd.predictor.parameters(), lr=0.001)
rnd_loss.backward()
rnd_optimizer.step()
# Predictor fits perfectly to seen states but doesn't generalize
# CORRECT: Lower learning rate, regularization
rnd_optimizer = Adam(rnd.predictor.parameters(), lr=0.0001)
# Add weight decay for regularization
```
**Diagnosis:**
- RND training loss is low (close to 0)
- But intrinsic rewards still high for most states
- Suggests predictor fitted to training states but not generalizing
**Fix:**
- Reduce RND learning rate
- Add weight decay (L2 regularization)
- Use batch normalization in predictor
### Pitfall 7: Count-Based on Non-Tabular Problems
**Symptom:** Exploration ineffective, agent keeps revisiting similar states
**Root Cause:** State counting doesn't work for continuous/image spaces
```python
# WRONG: Counting state IDs in image-based RL
state = env.render(mode='rgb_array') # 84x84 image
state_id = hash(state.tobytes()) # Different hash every time!
count_based_explorer.update_counts(state_id)
# Every frame is "new" because of slight pixel differences
# State counting broken
# CORRECT: Use RND or curiosity instead
rnd = RandomNetworkDistillation(state_dim)
# RND handles high-dimensional states
```
**Diagnosis:**
- Using count-based exploration with images/continuous observations
- Exploration not working effectively
**Fix:**
- Switch to RND or curiosity-driven methods
- Count-based only for small discrete state spaces
## Part 10: Red Flags and Pressure Tests
### Red Flags Checklist
- [ ] **Constant epsilon (no decay)**: Agent explores forever, can't exploit
- [ ] **Epsilon decays to zero (pure exploitation)**: Agent stuck in local optimum
- [ ] **ε-greedy on continuous actions**: Exploration unguided, learning slow
- [ ] **Intrinsic reward >> task reward**: Agent ignores task
- [ ] **No exploration specified**: Might be using default (often insufficient)
- [ ] **Using RND in random/stochastic environment**: Breaks because predicting noise
- [ ] **Count-based on images**: Can't count unique images, worthless
- [ ] **Exploration at test time**: Test performance degrades from exploration noise
- [ ] **ICM on pixel-randomness environment**: Forward model tries to predict noise
- [ ] **Ignoring sparse rewards**: Using ε-greedy only, no intrinsic motivation
### Pressure Test Scenario 1: "My Agent Stuck in Local Optimum"
**User Query:** "I trained my agent and it found a solution giving +10 reward. But I know the optimal solution is +100. Agent is stuck at +10. How do I fix it?"
**What Skill Must Provide:**
1. Diagnosis: Likely exploration problem (local optimum)
2. Solution: Increase exploration (slower decay, or intrinsic motivation)
3. Prevention: Design exploration strategy before training
**Skill Response:**
"Your agent found a local optimum (+10) and exploitation is too strong. This is a classic exploration-exploitation problem.
**Quick Fixes:**
1. Check epsilon decay: Is epsilon decaying too fast? Use slower decay schedule.
2. Check epsilon_end: Is epsilon_end = 0? Change to 0.01-0.05.
3. Add intrinsic motivation: Your environment might need curiosity-driven exploration.
**Systematic Approach:**
- If rewards are dense (frequent signals): ε-greedy with slow decay
- If rewards are sparse (rare signals): Add RND or curiosity
- If state space huge: Definitely use RND
**Example Fix:**
```
# Instead of epsilon decaying to 0.01 in 100k steps
# Decay over 1M steps with epsilon_end=0.05
epsilon = epsilon_linear(step, total_steps=1_000_000,
epsilon_start=1.0, epsilon_end=0.05)
```"
### Pressure Test Scenario 2: "Which Exploration for Montezuma's Revenge?"
**User Query:** "I'm training on Montezuma's Revenge (classic sparse reward benchmark). ε-Greedy isn't working (agent never finds first coin). What exploration strategy should I use?"
**What Skill Must Provide:**
1. Recognition: This is THE benchmark for intrinsic motivation
2. Solution: RND is standard, ICM is alternative
3. Implementation: Code/guidance for using RND
**Skill Response:**
"Montezuma's Revenge is the canonical example where ε-greedy fails. The first coin is at step 500+ with no reward signal. Random exploration won't find it.
**Recommended:** RND (Random Network Distillation)
Why RND:
- Robust to stochasticity (doesn't try to predict environment)
- Simpler than ICM (no environment model needed)
- Proven to work on Montezuma's Revenge
**Implementation:**
```python
rnd = RandomNetworkDistillation(state_dim=...)
lambda_intrinsic = 0.01 # Scale intrinsic reward
for step in training_steps:
state = env.observation
# Get intrinsic reward
r_intrinsic = rnd.compute_intrinsic_reward(state)
# Combine with task reward
r_total = r_task + lambda_intrinsic * r_intrinsic
# Train agent with combined reward
action = agent.select_action(state)
next_state, r_task = env.step(action)
# Update RND predictor
rnd_loss = rnd.predictor_loss(state)
rnd_optimizer.zero_grad()
rnd_loss.backward()
rnd_optimizer.step()
```
**Alternative:** Curiosity-driven (ICM) also works if environment deterministic.
**What to tune:**
- lambda_intrinsic: Start 0.01, increase if agent doesn't explore enough
- RND learning rate: Usually 1e-4 (much smaller than main agent)"
### Pressure Test Scenario 3: "ε-Greedy Decay Schedule"
**User Query:** "I'm using ε-greedy but not sure about the decay schedule. I have 100k training steps. Should I decay epsilon linearly? Over how many steps? To what final value?"
**What Skill Must Provide:**
1. Formula: Linear, exponential, polynomial decay
2. Guidance: Typical values for different problem types
3. Tuning: How to adjust based on task
**Skill Response:**
"**Quick Answer:**
```python
total_steps = 100_000
epsilon = epsilon_linear(step, total_steps,
epsilon_start=1.0, epsilon_end=0.05)
```
**Reasoning:**
- `epsilon_start=1.0`: Explore fully initially (know nothing)
- `epsilon_end=0.05`: Keep 5% exploration always (helps with stochasticity)
- Linear decay: Simple, works well for most tasks
**Decision Tree:**
1. **How dense are rewards?**
- Sparse: Use slower decay (maybe 0.1 epsilon_end)
- Dense: Use faster decay (0.01 epsilon_end)
2. **How many steps?**
- 100k steps: Decay over 50k-100k (half to full training)
- 1M steps: Decay over 500k-1M
- Rule: Decay over 50-100% of training time
3. **Which decay curve?**
- Linear (ε_linear): Safe default
- Exponential (ε_exponential): Faster early decay
- Polynomial with power=2: Balanced
**Example Tuning:**
```python
# Sparse rewards: More exploration needed
epsilon = epsilon_linear(step, total_steps=100_000,
epsilon_start=1.0, epsilon_end=0.1)
# Very sparse rewards: Even more
epsilon = epsilon_linear(step, total_steps=100_000,
epsilon_start=1.0, epsilon_end=0.2)
# Dense rewards: Faster decay to exploitation
epsilon = epsilon_linear(step, total_steps=50_000,
epsilon_start=1.0, epsilon_end=0.01)
```"
### Pressure Test Scenario 4: "ICM vs RND vs ε-Greedy"
**User Query:** "I have sparse rewards and a deterministic environment. Should I use ε-greedy, ICM curiosity, or RND? How do I decide?"
**What Skill Must Provide:**
1. Comparison table: Pros/cons of each
2. Decision framework: Which for which situation
3. Implementation difficulty: Is it worth the complexity?
**Skill Response:**
"**Comparison:**
| Method | Complexity | Robustness | When to Use |
|--------|-----------|-----------|------------|
| ε-Greedy | Very Simple | Good on dense rewards | Dense rewards, simple env |
| ICM | Complex (2 models) | Poor on stochastic | Sparse, deterministic, learning important |
| RND | Medium (1 model) | Excellent | Sparse, any environment |
**Decision Framework:**
1. **Sparse rewards + Deterministic + Lots of learning:**
Try ICM first (forward model learns useful features)
2. **Sparse rewards + Any environment:**
Use RND (robust, simpler)
3. **Sparse rewards + Large pixel space:**
RND (better for high-dimensional observations)
4. **Dense rewards:**
Just use ε-greedy (intrinsic motivation not needed)
5. **Not sure, want safe choice:**
Use RND (works almost everywhere)
**Implementation Difficulty:**
- ε-Greedy: 5 lines of code
- ICM: 100+ lines (forward + inverse models, 2 losses)
- RND: 50 lines (target + predictor)
**My Recommendation:**
Start ε-greedy. If not working after tuning decay schedules, switch to RND. Only use ICM if environment is fully deterministic and you care about learning dynamics."
### Pressure Test Scenario 5: "Computational Cost of Curiosity"
**User Query:** "I want to add curiosity-driven exploration (ICM) to my agent. What's the computational overhead? Will it slow down training significantly?"
**What Skill Must Provide:**
1. Overhead estimate: Exact computation cost
2. Tradeoff: When overhead is justified
3. Alternatives: Lighter-weight options
**Skill Response:**
"**Computational Overhead Estimate:**
Base Agent:
- 1 policy/value network
- Training time: 1.0×
With ICM:
- Policy/value network + Forward model + Inverse model
- Forward model: Encoder + hidden layers + state-dim output
- Inverse model: Concatenate states + hidden + action-dim output
- Training time: ~2.5-3.0×
**Memory:**
- Base: X MB
- With ICM: ~3-4× X MB
- GPU memory often bottleneck
**When Overhead is Justified:**
- Sparse rewards (ICM critical to success)
- Large state space (intrinsic motivation helps)
- Willing to wait longer for better exploration
**When Overhead is NOT Justified:**
- Dense rewards (ε-greedy sufficient)
- Real-time training constraints
- Limited GPU memory
**Lighter Alternative:**
Use RND instead of ICM:
- ~1.5-2.0× overhead (vs 2.5-3.0× for ICM)
- Same exploration benefits
- Simpler to implement
**Scaling to Large Models:**
```python
# ICM with huge state encoders can be prohibitive
# Example: Vision transformer encoder → ICM
# That's very expensive
# RND scales better: predictor can be small
# Don't need sophisticated encoder
```
**Bottom Line:**
ICM costs 2-3× training time. If you can afford it and rewards are very sparse, worth it. Otherwise try RND or even ε-greedy with slower decay first."
## Part 11: Rationalization Resistance Table
| Rationalization | Reality | Counter-Guidance | Red Flag |
|-----------------|---------|------------------|----------|
| "ε-Greedy works everywhere" | Fails on sparse rewards, large spaces | Use ε-greedy for dense/small, intrinsic motivation for sparse/large | Applying ε-greedy to Montezuma's Revenge |
| "Higher epsilon is better" | High ε → too random, doesn't exploit | Use decay schedule (ε high early, low late) | Using constant ε=0.5 throughout training |
| "Decay epsilon to zero" | Agent needs residual exploration | Keep ε_end=0.01-0.1 always | Setting ε_final=0 (pure exploitation) |
| "Curiosity always helps" | Can break with stochasticity (model tries to predict noise) | Use RND for stochastic, ICM for deterministic | Agent learns to explore random noise instead of task |
| "RND is just ICM simplified" | RND is fundamentally different (frozen random vs learned model) | Understand frozen network prevents overfitting/noise | Not grasping why RND frozen network matters |
| "More intrinsic reward = faster exploration" | Too much intrinsic reward drowns out task signal | Balance with λ=0.01-0.1, tune on task performance | Agent explores forever, ignores task |
| "Count-based works anywhere" | Only works tabular (can't count unique images) | Use RND for continuous/high-dimensional spaces | Trying count-based on Atari images |
| "Boltzmann is always better than ε-greedy" | Boltzmann smoother but harder to tune | Use ε-greedy for simplicity (it works well) | Switching to Boltzmann without clear benefit |
| "Test with ε>0 for exploration" | Test should use learned policy, not explore | ε=0 or greedy policy at test time | Variable test performance from exploration |
| "Longer decay is always better" | Very slow decay wastes time in early training | Match decay to task difficulty (faster for easy, slower for hard) | Decaying over 10M steps when training only 1M |
| "Skip exploration, increase learning rate" | Learning rate is for optimization, exploration for coverage | Use both: exploration strategy + learning rate | Agent oscillates without exploration |
| "ICM is the SOTA exploration" | RND simpler and more robust | Use RND unless you need environment model | Implementing ICM when RND would suffice |
## Part 12: Summary and Decision Framework
### Quick Decision Tree
```
START: Need exploration strategy?
├─ Are rewards sparse? (rare reward signal)
│ ├─ YES → Need intrinsic motivation
│ │ ├─ Environment stochastic?
│ │ │ ├─ YES → RND
│ │ │ └─ NO → ICM (or RND for simplicity)
│ │ └─ Choose RND for safety
│ │
│ └─ NO → Dense rewards
│ └─ Use ε-greedy + decay schedule
├─ Is state space large? (images, continuous)
│ ├─ YES → Intrinsic motivation (RND/curiosity)
│ └─ NO → ε-greedy usually sufficient
└─ Choosing decay schedule:
├─ Sparse rewards → slower decay (ε_end=0.05-0.1)
├─ Dense rewards → faster decay (ε_end=0.01)
└─ Default: Linear decay over 50% of training
```
### Implementation Checklist
- [ ] Define reward structure (dense vs sparse)
- [ ] Estimate state space size (discrete vs continuous)
- [ ] Choose exploration method (ε-greedy, curiosity, RND, UCB, count-based)
- [ ] Set epsilon/temperature parameters (start, end)
- [ ] Choose decay schedule (linear, exponential, polynomial)
- [ ] If using intrinsic motivation: set λ (usually 0.01)
- [ ] Use greedy policy at test time (ε=0)
- [ ] Monitor exploration vs exploitation (plot epsilon decay)
- [ ] Tune hyperparameters (decay schedule, λ) based on task performance
### Typical Configurations
**Dense Rewards, Small Action Space (e.g., simple game)**
```python
epsilon = epsilon_linear(step, total_steps=100_000,
epsilon_start=1.0, epsilon_end=0.01)
# Fast exploitation, low exploration needed
```
**Sparse Rewards, Discrete Actions (e.g., Atari)**
```python
rnd = RandomNetworkDistillation(...)
epsilon = epsilon_linear(step, total_steps=1_000_000,
epsilon_start=1.0, epsilon_end=0.05)
r_total = r_task + 0.01 * r_intrinsic
# Intrinsic motivation + slow decay
```
**Continuous Control, Sparse (e.g., Robotics)**
```python
rnd = RandomNetworkDistillation(...)
action = policy(state) + gaussian_noise(std=exploration_std)
exploration_std = exploration_std_linear(..., std_end=0.01)
r_total = r_task + 0.01 * r_intrinsic
# Gaussian noise + RND
```
## Key Takeaways
1. **Exploration is fundamental**: Don't ignore it. Design exploration strategy before training.
2. **Match method to problem**:
- Dense rewards → ε-greedy
- Sparse rewards → Intrinsic motivation (RND preferred)
- Large state space → Intrinsic motivation
3. **Decay exploration over time**: Explore early, exploit late.
4. **Avoid common pitfalls**:
- Don't decay to zero (ε_end > 0)
- Don't use ε-greedy on continuous actions
- Don't forget decay schedule
- Don't use exploration at test time
5. **Balance intrinsic and extrinsic**: If using intrinsic rewards, don't let them dominate.
6. **RND is the safe choice**: Works for most exploration problems, simpler than ICM.
7. **Test exploration hypothesis**: Plot epsilon or intrinsic rewards, verify exploration strategy is active.
This skill is about **systematic exploration design**, not just tuning one hyperparameter.