1051 lines
31 KiB
Markdown
1051 lines
31 KiB
Markdown
|
||
# Reward-Shaping Engineering
|
||
|
||
## When to Use This Skill
|
||
|
||
Invoke this skill when you encounter:
|
||
|
||
- **Reward Design**: "How do I design the reward function for my task?"
|
||
- **Slow Learning**: "Training is extremely slow with sparse rewards" or "Dense rewards cause weird behavior"
|
||
- **Reward Hacking**: "My agent learned a trick that works in training but fails on test", "Agent oscillating instead of balancing"
|
||
- **Potential-Based Shaping**: "How to add shaping without breaking the optimal policy?"
|
||
- **Distance-Based Rewards**: "How to reward progress toward goal without hacking?"
|
||
- **Inverse RL**: "I have expert demonstrations, can I learn reward from them?"
|
||
- **Auxiliary Rewards**: "Should I add helper rewards like action smoothness?"
|
||
- **Reward Scale Issues**: "Training feels different when rewards change magnitude"
|
||
- **Sparse vs Dense**: "When should I use sparse vs dense rewards?"
|
||
- **Reward Validation**: "How do I verify my reward function is correct?"
|
||
- **Credit Assignment**: "How to help agent understand which actions led to success?"
|
||
- **Normalization**: "Should I clip or normalize rewards?"
|
||
|
||
**This skill provides systematic frameworks and concrete patterns for reward engineering.**
|
||
|
||
Do NOT use for:
|
||
|
||
- Algorithm selection (route to rl-foundations or specific algorithm skill)
|
||
- General RL debugging (route to rl-debugging-methodology)
|
||
- Exploration strategies (route to exploration-strategies)
|
||
- Environment design (route to environment-design-patterns)
|
||
|
||
|
||
## Core Principle
|
||
|
||
**Reward design is often the hardest part of RL.** The reward function defines the entire objective the agent optimizes. A poorly designed reward either:
|
||
|
||
1. Learns something unintended (reward hacking)
|
||
2. Learns slowly due to sparse/noisy signal (credit assignment crisis)
|
||
3. Learns correctly but unstably due to scale/normalization issues
|
||
|
||
The key insight: **You're solving an inverse problem.** You want an agent that achieves behavior X. You need to specify function R(s,a,s') such that optimal policy under R produces behavior X. This is much harder than it sounds because:
|
||
|
||
- Agents optimize expected return, not intentions (find loopholes)
|
||
- Credit assignment requires clear reward signal (sparse rewards fail)
|
||
- Scale/normalization matters (reward magnitude affects gradients)
|
||
- Shaping can help or hurt (need to preserve optimal policy)
|
||
|
||
|
||
## Part 1: Reward Design Principles
|
||
|
||
### Principle 1: Reward Must Align With Task
|
||
|
||
**The Problem**: You want agent to do X, but reward incentivizes Y.
|
||
|
||
**Example (CartPole)**:
|
||
|
||
- Task: Balance pole in center for as long as possible
|
||
- Bad reward: +1 per step (true) → Agent learns to oscillate side-to-side (unintended but gets +1 every step)
|
||
- Better reward: +1 per step centered + penalty for deviation
|
||
|
||
**Example (Robotics)**:
|
||
|
||
- Task: Grasp object efficiently
|
||
- Bad reward: Just +1 when grasped → Agent grasps sloppily, jerky movements
|
||
- Better reward: +1 for grasp + small penalty per action (reward efficiency)
|
||
|
||
**Pattern**: Specify WHAT success looks like, not HOW to achieve it. Let agent find the HOW.
|
||
|
||
```python
|
||
# Anti-pattern: Specify HOW
|
||
bad_reward = -0.1 * np.sum(np.abs(action)) # Penalize movement
|
||
|
||
# Pattern: Specify WHAT
|
||
good_reward = (1.0 if grasp_success else 0.0) + (-0.01 * np.sum(action**2))
|
||
# Says: Success is good, movements have small cost
|
||
# Agent figures out efficient movements to minimize action cost
|
||
```
|
||
|
||
### Principle 2: Reward Should Enable Credit Assignment
|
||
|
||
**The Problem**: Sparse rewards mean agent can't learn which actions led to success.
|
||
|
||
**Example (Goal Navigation)**:
|
||
|
||
- Sparse: Only +1 when reaching goal (1 in 1000 episodes maybe)
|
||
- Agent can't tell: Did action 10 steps ago help or action 5 steps ago?
|
||
- Solution: Add shaping reward based on progress
|
||
|
||
**Credit Assignment Window**:
|
||
|
||
```
|
||
Short window (< 10 steps): Need dense rewards every 1-2 steps
|
||
Medium window (10-100 steps): Reward every 5-10 steps OK
|
||
Long window (> 100 steps): Sparse rewards very hard, need shaping
|
||
```
|
||
|
||
**When to Add Shaping**:
|
||
|
||
- Episode length > 50 steps AND sparse rewards
|
||
- Agent can't achieve >10% success after exploring
|
||
|
||
### Principle 3: Reward Should Prevent Hacking
|
||
|
||
**The Problem**: Agent finds unintended loopholes.
|
||
|
||
**Classic Hacking Patterns**:
|
||
|
||
1. **Shortcut Exploitation**: Taking unintended path to goal
|
||
- Example: Quadruped learns to flip instead of walk
|
||
- Solution: Specify movement requirements in reward
|
||
|
||
2. **Side-Effect Exploitation**: Achieving side-effect that gives reward
|
||
- Example: Robotic arm oscillating (gets +1 per step for oscillation)
|
||
- Solution: Add penalty for suspicious behavior
|
||
|
||
3. **Scale Exploitation**: Abusing unbounded reward dimension
|
||
- Example: Agent learns to get reward signal to spike → oscillates
|
||
- Solution: Use clipped/normalized rewards
|
||
|
||
**Prevention Framework**:
|
||
|
||
```python
|
||
def design_robust_reward(s, a, s_next):
|
||
# Core task reward
|
||
task_reward = compute_task_reward(s_next)
|
||
|
||
# Anti-hacking penalties
|
||
action_penalty = -0.01 * np.sum(a**2) # Penalize unnecessary action
|
||
suspension_penalty = check_suspension(s_next) # Penalize weird postures
|
||
|
||
return task_reward + action_penalty + suspension_penalty
|
||
```
|
||
|
||
### Principle 4: Reward Scale and Normalization Matter
|
||
|
||
**The Problem**: Reward magnitude affects gradient flow.
|
||
|
||
**Example**:
|
||
|
||
```
|
||
Task A rewards: 0 to 1000
|
||
Task B rewards: 0 to 1
|
||
Same optimizer with fixed learning rate:
|
||
Task A: Step sizes huge, diverges
|
||
Task B: Step sizes tiny, barely learns
|
||
|
||
Solution: Normalize both to [-1, 1]
|
||
```
|
||
|
||
**Standard Normalization Pipeline**:
|
||
|
||
```python
|
||
def normalize_reward(r):
|
||
# 1. Clip to reasonable range (prevents scale explosions)
|
||
r_clipped = np.clip(r, -1000, 1000)
|
||
|
||
# 2. Normalize using running statistics
|
||
reward_mean = running_mean(r_clipped)
|
||
reward_std = running_std(r_clipped)
|
||
r_normalized = (r_clipped - reward_mean) / (reward_std + 1e-8)
|
||
|
||
# 3. Clip again to [-1, 1] for stability
|
||
return np.clip(r_normalized, -1.0, 1.0)
|
||
```
|
||
|
||
|
||
## Part 2: Potential-Based Shaping (The Theorem)
|
||
|
||
### The Fundamental Problem
|
||
|
||
You want to:
|
||
|
||
- Help agent learn faster (add shaping rewards)
|
||
- Preserve the optimal policy (so shaping doesn't change what's best)
|
||
|
||
**The Solution: Potential-Based Shaping**
|
||
|
||
The theorem states: If you add shaping reward of form
|
||
|
||
```
|
||
F(s, a, s') = γ * Φ(s') - Φ(s)
|
||
```
|
||
|
||
where Φ(s) is ANY function of state, then:
|
||
|
||
1. Optimal policy remains unchanged
|
||
2. Optimal value function shifts by Φ
|
||
3. Learning accelerates due to better signal
|
||
|
||
**Why This Matters**: You can safely add rewards like distance-to-goal without worrying you're changing what the agent should do.
|
||
|
||
### Mathematical Foundation
|
||
|
||
Original MDP has Q-function: `Q^π(s,a) = E[R(s,a,s') + γV^π(s')]`
|
||
|
||
With potential-based shaping:
|
||
|
||
```
|
||
Q'^π(s,a) = Q^π(s,a) + [γΦ(s') - Φ(s)]
|
||
= E[R(s,a,s') + γΦ(s') - Φ(s) + γV^π(s')]
|
||
= E[R(s,a,s') + γ(Φ(s') + V^π(s')) - Φ(s)]
|
||
```
|
||
|
||
The key insight: When computing optimal policy, Φ(s) acts like state-value function offset. Different actions get different Φ values, but relative ordering (which action is best) unchanged.
|
||
|
||
**Proof Sketch**:
|
||
|
||
- Policy compares Q(s,a₁) vs Q(s,a₂) to pick action
|
||
- Both differ by same [γΦ(s') - Φ(s)] at state s
|
||
- Relative ordering preserved → same optimal action
|
||
|
||
### Practical Implementation
|
||
|
||
```python
|
||
def potential_based_shaping(s, a, s_next, gamma=0.99):
|
||
"""
|
||
Compute shaping reward that preserves optimal policy.
|
||
|
||
Args:
|
||
s: current state
|
||
a: action taken
|
||
s_next: next state (result of action)
|
||
gamma: discount factor
|
||
|
||
Returns:
|
||
Shaping reward to ADD to environment reward
|
||
"""
|
||
# Define potential function (e.g., negative distance to goal)
|
||
phi = compute_potential(s)
|
||
phi_next = compute_potential(s_next)
|
||
|
||
# Potential-based shaping formula
|
||
shaping_reward = gamma * phi_next - phi
|
||
|
||
return shaping_reward
|
||
|
||
def compute_potential(s):
|
||
"""
|
||
Potential function: Usually distance to goal.
|
||
|
||
Negative of distance works well:
|
||
- States farther from goal have lower potential
|
||
- Moving closer increases potential (positive shaping reward)
|
||
- Reaching goal gives highest potential
|
||
"""
|
||
if goal_reached(s):
|
||
return 0.0 # Peak potential
|
||
else:
|
||
distance = euclidean_distance(s['position'], s['goal'])
|
||
return -distance # Negative distance
|
||
```
|
||
|
||
### Critical Error: NOT Using Potential-Based Shaping
|
||
|
||
**Common Mistake**:
|
||
|
||
```python
|
||
# WRONG: This changes the optimal policy!
|
||
shaping_reward = -0.1 * distance_to_goal
|
||
|
||
# WHY WRONG: This isn't potential-based. Moving from d=1 to d=0.5 gives:
|
||
# Reward = -0.1 * 0.5 - (-0.1 * 1.0) = +0.05
|
||
# But moving from d=3 to d=2.5 gives:
|
||
# Reward = -0.1 * 2.5 - (-0.1 * 3.0) = +0.05
|
||
# Same reward for same distance change regardless of state!
|
||
# This distorts value function and can change which action is optimal.
|
||
```
|
||
|
||
**Right Way**:
|
||
|
||
```python
|
||
# CORRECT: Potential-based shaping
|
||
def shaping(s, a, s_next):
|
||
phi_s = -distance(s, goal) # Potential = negative distance
|
||
phi_s_next = -distance(s_next, goal)
|
||
|
||
return gamma * phi_s_next - phi_s
|
||
|
||
# Moving from d=1 to d=0.5:
|
||
# shaping = 0.99 * (-0.5) - (-1.0) = +0.495
|
||
# Moving from d=3 to d=2.5:
|
||
# shaping = 0.99 * (-2.5) - (-3.0) = +0.475
|
||
# Slightly different, depends on state! Preserves policy.
|
||
```
|
||
|
||
### Using Potential-Based Shaping
|
||
|
||
```python
|
||
def compute_total_reward(s, a, s_next, env_reward, gamma=0.99):
|
||
"""
|
||
Combine environment reward with potential-based shaping.
|
||
|
||
Pattern: R_total = R_env + R_shaping
|
||
"""
|
||
# 1. Get reward from environment
|
||
task_reward = env_reward
|
||
|
||
# 2. Compute potential-based shaping (safe to add)
|
||
potential = -distance_to_goal(s_next)
|
||
potential_prev = -distance_to_goal(s)
|
||
shaping_reward = gamma * potential - potential_prev
|
||
|
||
# 3. Combine
|
||
total_reward = task_reward + shaping_reward
|
||
|
||
return total_reward
|
||
```
|
||
|
||
|
||
## Part 3: Sparse vs Dense Rewards
|
||
|
||
### The Fundamental Tradeoff
|
||
|
||
| Aspect | Sparse Rewards | Dense Rewards |
|
||
|--------|---|---|
|
||
| **Credit Assignment** | Hard (credit window huge) | Easy (immediate feedback) |
|
||
| **Learning Speed** | Slow (few positive examples) | Fast (constant signal) |
|
||
| **Reward Hacking** | Less likely (fewer targets) | More likely (many targets to exploit) |
|
||
| **Convergence** | Can converge to suboptimal | May not converge if hacking |
|
||
| **Real-World** | Matches reality (goals sparse) | Artificial but helps learning |
|
||
|
||
### Decision Framework
|
||
|
||
**Use SPARSE when**:
|
||
|
||
- Task naturally has sparse rewards (goal-reaching, game win/loss)
|
||
- Episode short (< 20 steps)
|
||
- You want solution robust to reward hacking
|
||
- Final performance matters more than learning speed
|
||
|
||
**Use DENSE when**:
|
||
|
||
- Episode long (> 50 steps) and no natural sub-goals
|
||
- Learning speed critical (limited training budget)
|
||
- You can design safe auxiliary rewards
|
||
- You'll validate extensively against hacking
|
||
|
||
**Use HYBRID when**:
|
||
|
||
- Combine sparse task reward with dense shaping
|
||
- Example: +1 for reaching goal (sparse) + negative distance shaping (dense)
|
||
- This is the most practical approach for long-horizon tasks
|
||
|
||
### Design Pattern: Sparse Task + Dense Shaping
|
||
|
||
```python
|
||
def reward_function(s, a, s_next, done):
|
||
"""
|
||
Standard pattern: sparse task reward + potential-based shaping.
|
||
|
||
This gets the best of both worlds:
|
||
- Sparse task reward prevents hacking on main objective
|
||
- Dense shaping prevents credit assignment crisis
|
||
"""
|
||
# 1. Sparse task reward (what we truly care about)
|
||
if goal_reached(s_next):
|
||
task_reward = 1.0
|
||
else:
|
||
task_reward = 0.0
|
||
|
||
# 2. Dense potential-based shaping (helps learning)
|
||
gamma = 0.99
|
||
phi_s = -np.linalg.norm(s['position'] - s['goal'])
|
||
phi_s_next = -np.linalg.norm(s_next['position'] - s_next['goal'])
|
||
shaping_reward = gamma * phi_s_next - phi_s
|
||
|
||
# 3. Combine: Sparse main objective + dense guidance
|
||
total = task_reward + 0.1 * shaping_reward
|
||
# Scale shaping (0.1) relative to task (1.0) so task dominates
|
||
|
||
return total
|
||
```
|
||
|
||
### Validation: Confirming Sparse/Dense Choice
|
||
|
||
```python
|
||
def validate_reward_choice(sparse_reward_fn, dense_reward_fn, env, n_trials=10):
|
||
"""
|
||
Compare sparse vs dense by checking:
|
||
1. Learning speed (how fast does agent improve?)
|
||
2. Final performance (does dense cause hacking?)
|
||
3. Stability (does one diverge?)
|
||
"""
|
||
results = {
|
||
'sparse': train_agent(sparse_reward_fn, env, n_trials),
|
||
'dense': train_agent(dense_reward_fn, env, n_trials)
|
||
}
|
||
|
||
# Check learning curves
|
||
print("Sparse learning speed:", results['sparse']['steps_to_50pct'])
|
||
print("Dense learning speed:", results['dense']['steps_to_50pct'])
|
||
|
||
# Check if dense causes hacking
|
||
print("Sparse final score:", results['sparse']['final_score'])
|
||
print("Dense final score:", results['dense']['final_score'])
|
||
|
||
# If dense learned faster AND achieved same/higher score: use dense + validation
|
||
# If sparse achieved higher: reward hacking detected in dense
|
||
```
|
||
|
||
|
||
## Part 4: Reward Hacking - Patterns and Detection
|
||
|
||
### Common Hacking Patterns
|
||
|
||
#### Pattern 1: Shortcut Exploitation
|
||
|
||
Agent finds unintended path to success.
|
||
|
||
**Example (Quadruped)**:
|
||
|
||
- Task: Walk forward 10 meters
|
||
- Intended: Gait pattern that moves forward
|
||
- Hack: Agent learns to flip upside down (center of mass moves forward during flip!)
|
||
|
||
**Detection**:
|
||
|
||
```python
|
||
# Test on distribution shift
|
||
if test_on_different_terrain(agent) << train_performance:
|
||
print("ALERT: Shortcut exploitation detected")
|
||
print("Agent doesn't generalize → learned specific trick")
|
||
```
|
||
|
||
**Prevention**:
|
||
|
||
```python
|
||
def robust_reward(s, a, s_next):
|
||
# Forward progress
|
||
progress = s_next['x'] - s['x']
|
||
|
||
# Requirement: Stay upright (prevents flipping hack)
|
||
upright_penalty = -1.0 if not is_upright(s_next) else 0.0
|
||
|
||
# Requirement: Reasonable movement (prevents wiggling)
|
||
movement_penalty = -0.1 * np.sum(a**2)
|
||
|
||
return progress + upright_penalty + movement_penalty
|
||
```
|
||
|
||
#### Pattern 2: Reward Signal Exploitation
|
||
|
||
Agent exploits direct reward signal rather than task.
|
||
|
||
**Example (Oscillation)**:
|
||
|
||
- Task: Balance pole in center
|
||
- Intended: Keep pole balanced
|
||
- Hack: Agent oscillates rapidly (each oscillation = +1 reward per step)
|
||
|
||
**Detection**:
|
||
|
||
```python
|
||
def detect_oscillation(trajectory):
|
||
positions = [s['pole_angle'] for s in trajectory]
|
||
# Count zero crossings
|
||
crossings = sum(1 for i in range(len(positions)-1)
|
||
if positions[i] * positions[i+1] < 0)
|
||
|
||
if crossings > len(trajectory) / 3:
|
||
print("ALERT: Oscillation detected")
|
||
```
|
||
|
||
**Prevention**:
|
||
|
||
```python
|
||
def non_hackable_reward(s, a, s_next):
|
||
# Task: Balanced pole
|
||
balance_penalty = -(s_next['pole_angle']**2) # Reward being centered
|
||
|
||
# Prevent oscillation
|
||
angle_velocity = s_next['pole_angle'] - s['pole_angle']
|
||
oscillation_penalty = -0.1 * abs(angle_velocity)
|
||
|
||
return balance_penalty + oscillation_penalty
|
||
```
|
||
|
||
#### Pattern 3: Unbounded Reward Exploitation
|
||
|
||
Agent maximizes dimension without bound.
|
||
|
||
**Example (Camera Hack)**:
|
||
|
||
- Task: Detect object (reward for correct detection)
|
||
- Hack: Agent learns to point camera lens at bright light source (always triggers detection)
|
||
|
||
**Detection**:
|
||
|
||
```python
|
||
def detect_unbounded_exploitation(training_history):
|
||
rewards = training_history['episode_returns']
|
||
|
||
# Check if rewards growing without bound
|
||
if rewards[-100:].mean() >> rewards[100:200].mean():
|
||
print("ALERT: Rewards diverging")
|
||
print("Possible unbounded exploitation")
|
||
```
|
||
|
||
**Prevention**:
|
||
|
||
```python
|
||
# Use reward clipping
|
||
def clipped_reward(r):
|
||
return np.clip(r, -1.0, 1.0)
|
||
|
||
# Or normalize
|
||
def normalized_reward(r, running_mean, running_std):
|
||
r_norm = (r - running_mean) / (running_std + 1e-8)
|
||
return np.clip(r_norm, -1.0, 1.0)
|
||
```
|
||
|
||
### Systematic Hacking Detection Framework
|
||
|
||
```python
|
||
def check_for_hacking(agent, train_env, test_envs, holdout_env):
|
||
"""
|
||
Comprehensive hacking detection.
|
||
"""
|
||
# 1. Distribution shift test
|
||
train_perf = evaluate(agent, train_env)
|
||
test_perf = evaluate(agent, test_envs) # Variations of train
|
||
|
||
if train_perf >> test_perf:
|
||
print("HACKING: Agent doesn't generalize to distribution shift")
|
||
return "shortcut_exploitation"
|
||
|
||
# 2. Behavioral inspection
|
||
trajectory = run_episode(agent, holdout_env)
|
||
if has_suspicious_pattern(trajectory):
|
||
print("HACKING: Suspicious behavior detected")
|
||
return "pattern_exploitation"
|
||
|
||
# 3. Reward curve analysis
|
||
if rewards_diverging(agent.training_history):
|
||
print("HACKING: Unbounded reward exploitation")
|
||
return "reward_signal_exploitation"
|
||
|
||
return "no_obvious_hacking"
|
||
```
|
||
|
||
|
||
## Part 5: Auxiliary Rewards and Shaping Examples
|
||
|
||
### Example 1: Distance-Based Shaping
|
||
|
||
**Most common shaping pattern. Safe when done with potential-based formula.**
|
||
|
||
```python
|
||
def distance_shaping(s, a, s_next, gamma=0.99):
|
||
"""
|
||
Reward agent for getting closer to goal.
|
||
|
||
CRITICAL: Use potential-based formula to preserve optimal policy.
|
||
"""
|
||
goal_position = s['goal']
|
||
curr_pos = s['position']
|
||
next_pos = s_next['position']
|
||
|
||
# Potential function: negative distance
|
||
phi = -np.linalg.norm(curr_pos - goal_position)
|
||
phi_next = -np.linalg.norm(next_pos - goal_position)
|
||
|
||
# Potential-based shaping (preserves optimal policy)
|
||
shaping_reward = gamma * phi_next - phi
|
||
|
||
return shaping_reward
|
||
```
|
||
|
||
### Example 2: Auxiliary Smoothness Reward
|
||
|
||
**Help agent learn smooth actions without changing optimal behavior.**
|
||
|
||
```python
|
||
def smoothness_shaping(a, a_prev):
|
||
"""
|
||
Penalize jittery/jerky actions.
|
||
Helps with efficiency and generalization.
|
||
"""
|
||
# Difference between consecutive actions
|
||
action_jerk = np.linalg.norm(a - a_prev)
|
||
|
||
# Penalty (small, doesn't dominate task reward)
|
||
smoothness_penalty = -0.01 * action_jerk
|
||
|
||
return smoothness_penalty
|
||
```
|
||
|
||
### Example 3: Energy/Control Efficiency
|
||
|
||
**Encourage efficient control.**
|
||
|
||
```python
|
||
def efficiency_reward(a):
|
||
"""
|
||
Penalize excessive control effort.
|
||
Makes solutions more robust.
|
||
"""
|
||
# L2 norm of action (total control magnitude)
|
||
effort = np.sum(a**2)
|
||
|
||
# Small penalty
|
||
return -0.001 * effort
|
||
```
|
||
|
||
### Example 4: Staying Safe Reward
|
||
|
||
**Prevent dangerous states (without hard constraints).**
|
||
|
||
```python
|
||
def safety_reward(s):
|
||
"""
|
||
Soft penalty for dangerous states.
|
||
Better than hard constraints (more learnable).
|
||
"""
|
||
danger_score = 0.0
|
||
|
||
# Example: Prevent collision
|
||
min_clearance = np.min(s['collision_distances'])
|
||
if min_clearance < 0.1:
|
||
danger_score += 10.0 * (0.1 - min_clearance)
|
||
|
||
# Example: Prevent extreme states
|
||
if np.abs(s['position']).max() > 5.0:
|
||
danger_score += 1.0
|
||
|
||
return -danger_score
|
||
```
|
||
|
||
### When to Add Auxiliary Rewards
|
||
|
||
**Add auxiliary reward if**:
|
||
|
||
- It's potential-based (safe)
|
||
- Task reward already roughly works (agent > 10% success)
|
||
- Auxiliary targets clear sub-goals
|
||
- You validate with/without
|
||
|
||
**Don't add if**:
|
||
|
||
- Task reward doesn't work at all (fix that first)
|
||
- Creates new exploitation opportunities
|
||
- Makes reward engineering too complex
|
||
|
||
|
||
## Part 6: Inverse RL - Learning Rewards from Demonstrations
|
||
|
||
### The Problem
|
||
|
||
You have expert demonstrations but no explicit reward function. How to learn?
|
||
|
||
**Options**:
|
||
|
||
1. Behavioral cloning: Copy actions directly (doesn't learn why)
|
||
2. Reward learning (inverse RL): Infer reward structure from demonstrations
|
||
3. Imitation learning: Match expert behavior distribution (GAIL style)
|
||
|
||
### Inverse RL Concept
|
||
|
||
**Idea**: Expert is optimal under some reward function. Infer what reward structure makes expert optimal.
|
||
|
||
```
|
||
Expert demonstrations → Infer reward function → Train agent on learned reward
|
||
```
|
||
|
||
**Key insight**: If expert is optimal under reward R, then R(expert_actions) >> R(other_actions)
|
||
|
||
### Practical Inverse RL (Maximum Entropy IRL)
|
||
|
||
```python
|
||
class InverseRLLearner:
|
||
"""
|
||
Learn reward function from expert demonstrations.
|
||
|
||
Assumes expert is performing near-optimal policy under true reward.
|
||
"""
|
||
|
||
def __init__(self, state_dim, action_dim):
|
||
# Reward function (small neural network)
|
||
self.reward_net = nn.Sequential(
|
||
nn.Linear(state_dim + action_dim, 64),
|
||
nn.ReLU(),
|
||
nn.Linear(64, 1)
|
||
)
|
||
self.optimizer = torch.optim.Adam(self.reward_net.parameters())
|
||
|
||
def compute_reward(self, s, a):
|
||
"""Learned reward function."""
|
||
sa = torch.cat([torch.tensor(s), torch.tensor(a)])
|
||
return self.reward_net(sa).item()
|
||
|
||
def train_step(self, expert_trajectories, agent_trajectories):
|
||
"""
|
||
Update reward to make expert better than agent.
|
||
|
||
Principle: Maximize expert returns, minimize agent returns under current reward.
|
||
"""
|
||
# Expert reward sum
|
||
expert_returns = sum(
|
||
sum(self.compute_reward(s, a) for s, a in traj)
|
||
for traj in expert_trajectories
|
||
)
|
||
|
||
# Agent reward sum
|
||
agent_returns = sum(
|
||
sum(self.compute_reward(s, a) for s, a in traj)
|
||
for traj in agent_trajectories
|
||
)
|
||
|
||
# Loss: Want expert >> agent
|
||
loss = agent_returns - expert_returns
|
||
|
||
self.optimizer.zero_grad()
|
||
loss.backward()
|
||
self.optimizer.step()
|
||
|
||
return loss.item()
|
||
```
|
||
|
||
### When to Use Inverse RL
|
||
|
||
**Use when**:
|
||
|
||
- Reward is hard to specify but easy to demonstrate
|
||
- You have expert demonstrations (human, reference controller)
|
||
- Task complex enough that behavior != objective
|
||
- Training budget allows for two-stage process
|
||
|
||
**Don't use when**:
|
||
|
||
- Reward is easy to specify (just specify it!)
|
||
- No expert demonstrations available
|
||
- Demonstration quality varies
|
||
- Need fast learning (inverse RL is slow)
|
||
|
||
|
||
## Part 7: Reward Normalization and Clipping
|
||
|
||
### Why Normalize?
|
||
|
||
Reward scale directly affects gradient magnitude and training stability.
|
||
|
||
```python
|
||
# Without normalization
|
||
reward_taskA = 1000 * task_metric # Large magnitude
|
||
loss = -policy_gradient * reward_taskA # Huge gradients
|
||
|
||
# With normalization
|
||
reward_normalized = reward_taskA / reward_std # Unit magnitude
|
||
loss = -policy_gradient * reward_normalized # Reasonable gradients
|
||
```
|
||
|
||
### Standard Normalization Pipeline
|
||
|
||
```python
|
||
class RewardNormalizer:
|
||
def __init__(self, epsilon=1e-8):
|
||
self.mean = 0.0
|
||
self.var = 1.0
|
||
self.epsilon = epsilon
|
||
|
||
def update_statistics(self, rewards):
|
||
"""Update running mean and variance."""
|
||
rewards = np.array(rewards)
|
||
# Exponential moving average (online update)
|
||
alpha = 0.01
|
||
self.mean = (1 - alpha) * self.mean + alpha * rewards.mean()
|
||
self.var = (1 - alpha) * self.var + alpha * rewards.var()
|
||
|
||
def normalize(self, reward):
|
||
"""Apply standardization then clipping."""
|
||
# 1. Standardize (zero mean, unit variance)
|
||
normalized = (reward - self.mean) / np.sqrt(self.var + self.epsilon)
|
||
|
||
# 2. Clip to [-1, 1] for stability
|
||
clipped = np.clip(normalized, -1.0, 1.0)
|
||
|
||
return clipped
|
||
```
|
||
|
||
### Clipping Strategy
|
||
|
||
```python
|
||
def clip_reward(r, clip_range=(-1.0, 1.0)):
|
||
"""
|
||
Clip reward to fixed range.
|
||
|
||
Prevents large reward spikes from destabilizing training.
|
||
"""
|
||
return np.clip(r, clip_range[0], clip_range[1])
|
||
|
||
# Usage
|
||
def total_reward(task_r, shaping_r):
|
||
# Combine rewards
|
||
combined = task_r + shaping_r
|
||
|
||
# Clip combined
|
||
clipped = clip_reward(combined)
|
||
|
||
return clipped
|
||
```
|
||
|
||
|
||
## Part 8: Validating Reward Functions
|
||
|
||
### Validation Checklist
|
||
|
||
```python
|
||
def validate_reward_function(reward_fn, env, agent_class, n_trials=5):
|
||
"""
|
||
Systematic validation of reward design.
|
||
"""
|
||
results = {}
|
||
|
||
# 1. Learning speed test
|
||
agent = train_agent(agent_class, env, reward_fn, steps=100000)
|
||
success_rate = evaluate(agent, env, n_episodes=100)
|
||
results['learning_speed'] = success_rate
|
||
|
||
if success_rate < 0.3:
|
||
print("WARNING: Agent can't learn → reward signal too sparse")
|
||
return False
|
||
|
||
# 2. Generalization test
|
||
test_variants = [modify_env(env) for _ in range(5)]
|
||
test_rates = [evaluate(agent, test_env, 20) for test_env in test_variants]
|
||
|
||
if np.mean(test_rates) < 0.7 * success_rate:
|
||
print("WARNING: Hacking detected → Agent doesn't generalize")
|
||
return False
|
||
|
||
# 3. Stability test
|
||
agents = [train_agent(...) for _ in range(n_trials)]
|
||
variance = np.var([evaluate(a, env, 20) for a in agents])
|
||
|
||
if variance > 0.3:
|
||
print("WARNING: Training unstable → Reward scale issue?")
|
||
return False
|
||
|
||
# 4. Behavioral inspection
|
||
trajectory = run_episode(agent, env)
|
||
if suspicious_behavior(trajectory):
|
||
print("WARNING: Agent exhibiting strange behavior")
|
||
return False
|
||
|
||
print("PASSED: Reward function validated")
|
||
return True
|
||
```
|
||
|
||
### Red Flags During Validation
|
||
|
||
| Red Flag | Likely Cause | Fix |
|
||
|----------|---|---|
|
||
| Success rate < 10% after 50k steps | Reward too sparse | Add shaping |
|
||
| High variance across seeds | Reward scale/noise | Normalize/clip |
|
||
| Passes train but fails test | Reward hacking | Add anti-hacking penalties |
|
||
| Rewards diverging to infinity | Unbounded reward | Use clipping |
|
||
| Agent oscillates/twitches | Per-step reward exploitation | Penalize action change |
|
||
| Learning suddenly stops | Reward scale issue | Check normalization |
|
||
|
||
|
||
## Part 9: Common Pitfalls and Rationalizations
|
||
|
||
### Pitfall 1: "Let me just add distance reward"
|
||
|
||
**Rationalization**: "I'll add reward for getting closer to goal, it can't hurt"
|
||
**Problem**: Without potential-based formula, changes optimal policy
|
||
**Reality Check**: Measure policy difference with/without shaping
|
||
|
||
### Pitfall 2: "Sparse rewards are always better"
|
||
|
||
**Rationalization**: "Sparse rewards prevent hacking"
|
||
**Problem**: Agent can't learn in long-horizon tasks (credit assignment crisis)
|
||
**Reality Check**: 10+ steps without reward → need shaping or fail training
|
||
|
||
### Pitfall 3: "Normalize everything"
|
||
|
||
**Rationalization**: "I'll normalize all rewards to [-1, 1]"
|
||
**Problem**: Over-normalization loses task structure (goal vs near-goal now equal)
|
||
**Reality Check**: Validate that normalized reward still trains well
|
||
|
||
### Pitfall 4: "Inverse RL is the answer"
|
||
|
||
**Rationalization**: "I don't know how to specify rewards, I'll learn from demos"
|
||
**Problem**: Inverse RL is slow and requires good demonstrations
|
||
**Reality Check**: If you can specify reward clearly, just do it
|
||
|
||
### Pitfall 5: "More auxiliary rewards = faster learning"
|
||
|
||
**Rationalization**: "I'll add smoothness, energy, safety rewards"
|
||
**Problem**: Each auxiliary reward is another hacking target
|
||
**Reality Check**: Validate each auxiliary independently
|
||
|
||
### Pitfall 6: "This should work, why doesn't it?"
|
||
|
||
**Rationalization**: "The reward looks right, must be algorithm issue"
|
||
**Problem**: Reward design is usually the bottleneck, not algorithm
|
||
**Reality Check**: Systematically validate reward using test framework
|
||
|
||
### Pitfall 7: "Agent learned the task, my reward was right"
|
||
|
||
**Rationalization**: "Agent succeeded, so reward design was good"
|
||
**Problem**: Agent might succeed on hacked solution, not true task
|
||
**Reality Check**: Test on distribution shift / different environment variants
|
||
|
||
### Pitfall 8: "Dense rewards cause overfitting"
|
||
|
||
**Rationalization**: "Sparse rewards generalize better"
|
||
**Problem**: Sparse rewards just fail to learn in long episodes
|
||
**Reality Check**: Compare learning curves and final policy generalization
|
||
|
||
### Pitfall 9: "Clipping breaks the signal"
|
||
|
||
**Rationalization**: "If I clip rewards, I lose information"
|
||
**Problem**: Unbounded rewards cause training instability
|
||
**Reality Check**: Relative ordering preserved after clipping, information retained
|
||
|
||
### Pitfall 10: "Potential-based shaping doesn't matter"
|
||
|
||
**Rationalization**: "A reward penalty is a reward penalty"
|
||
**Problem**: Non-potential-based shaping CAN change optimal policy
|
||
**Reality Check**: Prove mathematically that Φ(s') - Φ(s) structure used
|
||
|
||
|
||
## Part 10: Reward Engineering Patterns for Common Tasks
|
||
|
||
### Pattern 1: Goal-Reaching Tasks
|
||
|
||
```python
|
||
def reaching_reward(s, a, s_next, gamma=0.99):
|
||
"""
|
||
Task: Reach target location.
|
||
"""
|
||
goal = s['goal']
|
||
|
||
# Sparse task reward
|
||
if np.linalg.norm(s_next['position'] - goal) < 0.1:
|
||
task_reward = 1.0
|
||
else:
|
||
task_reward = 0.0
|
||
|
||
# Dense potential-based shaping
|
||
distance = np.linalg.norm(s_next['position'] - goal)
|
||
distance_prev = np.linalg.norm(s['position'] - goal)
|
||
|
||
phi = -distance
|
||
phi_prev = -distance_prev
|
||
shaping = gamma * phi - phi_prev
|
||
|
||
# Efficiency penalty (optional)
|
||
efficiency = -0.001 * np.sum(a**2)
|
||
|
||
return task_reward + 0.1 * shaping + efficiency
|
||
```
|
||
|
||
### Pattern 2: Locomotion Tasks
|
||
|
||
```python
|
||
def locomotion_reward(s, a, s_next):
|
||
"""
|
||
Task: Move forward efficiently.
|
||
"""
|
||
# Forward progress (sparse)
|
||
forward_reward = s_next['x_pos'] - s['x_pos']
|
||
|
||
# Staying alive (don't fall)
|
||
alive_bonus = 1.0 if is_alive(s_next) else 0.0
|
||
|
||
# Energy efficiency
|
||
action_penalty = -0.0001 * np.sum(a**2)
|
||
|
||
return forward_reward + alive_bonus + action_penalty
|
||
```
|
||
|
||
### Pattern 3: Multi-Objective Tasks
|
||
|
||
```python
|
||
def multi_objective_reward(s, a, s_next):
|
||
"""
|
||
Task: Multiple objectives (e.g., reach goal AND minimize energy).
|
||
"""
|
||
goal_reward = 10.0 * (goal_progress(s, s_next))
|
||
energy_reward = -0.01 * np.sum(a**2)
|
||
safety_reward = -1.0 * collision_risk(s_next)
|
||
|
||
# Weight objectives
|
||
return 1.0 * goal_reward + 0.1 * energy_reward + 0.5 * safety_reward
|
||
```
|
||
|
||
|
||
## Summary: Reward Engineering Workflow
|
||
|
||
1. **Specify what success looks like** (task reward)
|
||
2. **Choose sparse or dense** based on episode length
|
||
3. **If dense, use potential-based shaping** (preserves policy)
|
||
4. **Add anti-hacking penalties** if needed
|
||
5. **Normalize and clip** for stability
|
||
6. **Validate** systematically (generalization, hacking, stability)
|
||
7. **Iterate** based on validation results
|
||
|
||
|
||
## Key Equations Reference
|
||
|
||
```
|
||
Potential-Based Shaping:
|
||
F(s,a,s') = γΦ(s') - Φ(s)
|
||
|
||
Value Function Shift (with shaping):
|
||
V'(s) = V(s) + Φ(s)
|
||
|
||
Optimal Policy Preservation:
|
||
argmax_a Q'(s,a) = argmax_a Q(s,a) (same action, different Q-values)
|
||
|
||
Reward Normalization:
|
||
r_norm = (r - μ) / (σ + ε)
|
||
|
||
Clipping:
|
||
r_clipped = clip(r_norm, -1, 1)
|
||
```
|
||
|
||
|
||
## Testing Scenarios (13+)
|
||
|
||
The skill addresses these scenarios:
|
||
|
||
1. Detecting reward hacking from test set failure
|
||
2. Implementing potential-based shaping correctly
|
||
3. Choosing sparse vs dense based on episode length
|
||
4. Designing distance-based rewards without changing policy
|
||
5. Adding auxiliary rewards without hacking
|
||
6. Normalizing rewards across task variants
|
||
7. Validating that shaping preserves optimal policy
|
||
8. Applying inverse RL to expert demonstrations
|
||
9. Debugging when reward signal causes oscillation
|
||
10. Engineering rewards for specific task families
|
||
11. Recognizing when reward is bottleneck vs algorithm
|
||
12. Explaining reward hacking in principal-agent terms
|
||
13. Implementing end-to-end reward validation pipeline
|
||
|
||
|
||
## Practical Checklist
|
||
|
||
- [ ] Task reward clearly specifies success
|
||
- [ ] Reward function can't be exploited by shortcuts
|
||
- [ ] Episode length < 20 steps → sparse OK
|
||
- [ ] Episode length > 50 steps → need shaping
|
||
- [ ] Using potential-based formula F = γΦ(s') - Φ(s)
|
||
- [ ] Clipping/normalizing rewards to [-1, 1]
|
||
- [ ] Tested on distribution shift (different env variant)
|
||
- [ ] Behavioral inspection (is agent doing what you expect?)
|
||
- [ ] Training stability across seeds (variance < 0.3)
|
||
- [ ] Learning curves look reasonable (no sudden divergence)
|
||
- [ ] Final policy generalizes to test distribution
|