2188 lines
55 KiB
Markdown
2188 lines
55 KiB
Markdown
|
||
# RL Foundations
|
||
|
||
## When to Use This Skill
|
||
|
||
Invoke this skill when you encounter:
|
||
|
||
- **New to RL**: User asks "what is RL" or "how does RL work"
|
||
- **Theory Questions**: MDP, value functions, Bellman equations, policy optimization
|
||
- **Conceptual Confusion**: Mixing up V(s) and Q(s,a), value iteration vs policy iteration
|
||
- **Before Implementation**: User wants to implement RL algorithms without understanding foundations
|
||
- **Debugging Theory**: Why discount factor matters, why exploration needed, how algorithms differ
|
||
- **Foundation Check**: User jumps to DQN/PPO without understanding MDPs
|
||
|
||
**This skill provides the theoretical foundation for ALL other deep-rl skills.**
|
||
|
||
Do NOT use this skill for:
|
||
|
||
- Algorithm implementation (use value-based-methods, policy-gradient-methods, actor-critic-methods)
|
||
- Debugging code (use rl-debugging)
|
||
- Environment setup (use rl-environments)
|
||
|
||
## Core Principle
|
||
|
||
**Understanding the theory enables everything else.**
|
||
|
||
Reinforcement learning is built on a rigorous mathematical foundation:
|
||
|
||
1. **MDP** (Markov Decision Process) - the framework
|
||
2. **Value Functions** - quantify expected return
|
||
3. **Bellman Equations** - recursive decomposition
|
||
4. **Optimal Policy** - maximize expected return
|
||
5. **Algorithms** - methods to find optimal policy
|
||
|
||
**Without this foundation, you're copy-pasting code you cannot debug, adapt, or extend.**
|
||
|
||
|
||
## Part 1: Markov Decision Process (MDP)
|
||
|
||
### What is an MDP?
|
||
|
||
An MDP is the mathematical framework for sequential decision-making under uncertainty.
|
||
|
||
**Formal Definition**: A Markov Decision Process is a 5-tuple (S, A, P, R, γ):
|
||
|
||
- **S**: State space (set of all possible states)
|
||
- **A**: Action space (set of all possible actions)
|
||
- **P**: Transition probability P(s'|s,a) - probability of reaching state s' from state s after action a
|
||
- **R**: Reward function R(s,a,s') - immediate reward for transition
|
||
- **γ**: Discount factor (0 ≤ γ ≤ 1) - controls importance of future rewards
|
||
|
||
**Key Property**: **Markov Property**
|
||
|
||
```
|
||
P(s_{t+1} | s_t, a_t, s_{t-1}, a_{t-1}, ..., s_0, a_0) = P(s_{t+1} | s_t, a_t)
|
||
```
|
||
|
||
**Meaning**: The future depends only on the present state, not the history.
|
||
|
||
**Why this matters**: Allows recursive algorithms (Bellman equations). If Markov property violated, standard RL algorithms may fail.
|
||
|
||
|
||
### Example 1: GridWorld MDP
|
||
|
||
**Problem**: Agent navigates 4x4 grid to reach goal.
|
||
|
||
```
|
||
S = {(0,0), (0,1), ..., (3,3)} # 16 states
|
||
A = {UP, DOWN, LEFT, RIGHT} # 4 actions
|
||
R = -1 for each step, +10 at goal
|
||
γ = 0.9
|
||
P: Deterministic (up always moves up if not wall)
|
||
```
|
||
|
||
**Visualization**:
|
||
|
||
```
|
||
S . . .
|
||
. . . .
|
||
. # . . # = wall
|
||
. . . G G = goal (+10)
|
||
```
|
||
|
||
**Transition Example**:
|
||
|
||
- State s = (1,1), Action a = RIGHT
|
||
- Deterministic: P(s'=(1,2) | s=(1,1), a=RIGHT) = 1.0
|
||
- Reward: R(s,a,s') = -1
|
||
- Next state: s' = (1,2)
|
||
|
||
**Markov Property Holds**: Future position depends only on current position and action, not how you got there.
|
||
|
||
|
||
### Example 2: Stochastic GridWorld
|
||
|
||
**Modification**: Actions succeed with probability 0.8, move perpendicular with probability 0.1 each.
|
||
|
||
```
|
||
P((1,2) | (1,1), RIGHT) = 0.8 # intended
|
||
P((0,1) | (1,1), RIGHT) = 0.1 # slip up
|
||
P((2,1) | (1,1), RIGHT) = 0.1 # slip down
|
||
```
|
||
|
||
**Why Stochastic**: Models real-world uncertainty (robot actuators, wind, slippery surfaces).
|
||
|
||
**Consequence**: Agent must consider probabilities when choosing actions.
|
||
|
||
|
||
### Example 3: Continuous State MDP (Cartpole)
|
||
|
||
```
|
||
S ⊂ ℝ⁴: (cart_position, cart_velocity, pole_angle, pole_angular_velocity)
|
||
A = {LEFT, RIGHT} # discrete actions, continuous state
|
||
R = +1 for each timestep upright
|
||
γ = 0.99
|
||
P: Physics-based transition (continuous dynamics)
|
||
```
|
||
|
||
**Key Difference**: State space is continuous, requires function approximation (neural networks).
|
||
|
||
**Still an MDP**: Markov property holds (physics is Markovian given state).
|
||
|
||
|
||
### When is Markov Property Violated?
|
||
|
||
**Example: Poker**
|
||
|
||
```
|
||
State: Current cards visible
|
||
Markov Violated: Opponents' strategies depend on past betting patterns
|
||
```
|
||
|
||
**Solution**: Augment state with history (last N actions), or use partially observable MDP (POMDP).
|
||
|
||
**Example: Robot with Noisy Sensors**
|
||
|
||
```
|
||
State: Raw sensor reading (single frame)
|
||
Markov Violated: True position requires integrating multiple frames
|
||
```
|
||
|
||
**Solution**: Stack frames (last 4 frames as state), or use recurrent network (LSTM).
|
||
|
||
|
||
### Episodic vs Continuing Tasks
|
||
|
||
**Episodic**: Task terminates (games, reaching goal)
|
||
|
||
```
|
||
Episode: s₀ → s₁ → ... → s_T (terminal state)
|
||
Return: G_t = r_t + γr_{t+1} + ... + γ^{T-t}r_T
|
||
```
|
||
|
||
**Continuing**: Task never ends (stock trading, robot operation)
|
||
|
||
```
|
||
Return: G_t = r_t + γr_{t+1} + γ²r_{t+2} + ... (infinite)
|
||
```
|
||
|
||
**Critical**: Continuing tasks REQUIRE γ < 1 (else return infinite).
|
||
|
||
|
||
### MDP Pitfall #1: Using Wrong State Representation
|
||
|
||
**Bad**: State = current frame only (when velocity matters)
|
||
|
||
```python
|
||
# Pong: Ball position alone doesn't tell velocity
|
||
state = current_frame # WRONG - not Markovian
|
||
```
|
||
|
||
**Good**: State = last 4 frames (velocity from difference)
|
||
|
||
```python
|
||
# Frame stacking preserves Markov property
|
||
state = np.concatenate([frame_t, frame_{t-1}, frame_{t-2}, frame_{t-3}])
|
||
```
|
||
|
||
**Why**: Ball velocity = (position_t - position_{t-1}) / dt, need history.
|
||
|
||
|
||
### MDP Pitfall #2: Reward Function Shapes Behavior
|
||
|
||
**Example**: Robot navigating to goal
|
||
|
||
**Bad Reward**:
|
||
|
||
```python
|
||
reward = +1 if at_goal else 0 # Sparse
|
||
```
|
||
|
||
**Problem**: No signal until goal reached, hard to learn.
|
||
|
||
**Better Reward**:
|
||
|
||
```python
|
||
reward = -distance_to_goal # Dense
|
||
```
|
||
|
||
**Problem**: Agent learns to get closer but may not reach goal (local optimum).
|
||
|
||
**Best Reward** (Potential-Based Shaping):
|
||
|
||
```python
|
||
reward = (distance_prev - distance_curr) + large_bonus_at_goal
|
||
```
|
||
|
||
**Why**: Encourages progress + explicit goal reward.
|
||
|
||
**Takeaway**: Reward function engineering is CRITICAL. Route to reward-shaping skill for details.
|
||
|
||
|
||
### MDP Formulation Checklist
|
||
|
||
Before implementing any RL algorithm, answer:
|
||
|
||
- [ ] **States**: What information defines the situation? Is it Markovian?
|
||
- [ ] **Actions**: What can the agent do? Discrete or continuous?
|
||
- [ ] **Transitions**: Deterministic or stochastic? Do you know P(s'|s,a)?
|
||
- [ ] **Rewards**: Immediate reward for each transition? Sparse or dense?
|
||
- [ ] **Discount**: Episodic (can use γ=1) or continuing (need γ<1)?
|
||
- [ ] **Markov Property**: Does current state fully determine future?
|
||
|
||
**If you cannot answer these, you cannot implement RL algorithms effectively.**
|
||
|
||
|
||
## Part 2: Value Functions
|
||
|
||
### What is a Value Function?
|
||
|
||
A value function quantifies "how good" a state (or state-action pair) is.
|
||
|
||
**State-Value Function V^π(s)**:
|
||
|
||
```
|
||
V^π(s) = E_π[G_t | s_t = s]
|
||
= E_π[r_t + γr_{t+1} + γ²r_{t+2} + ... | s_t = s]
|
||
```
|
||
|
||
**Meaning**: Expected cumulative discounted reward starting from state s and following policy π.
|
||
|
||
**Action-Value Function Q^π(s,a)**:
|
||
|
||
```
|
||
Q^π(s,a) = E_π[G_t | s_t = s, a_t = a]
|
||
= E_π[r_t + γr_{t+1} + γ²r_{t+2} + ... | s_t = s, a_t = a]
|
||
```
|
||
|
||
**Meaning**: Expected cumulative discounted reward starting from state s, taking action a, then following policy π.
|
||
|
||
**Relationship**:
|
||
|
||
```
|
||
V^π(s) = Σ_a π(a|s) Q^π(s,a)
|
||
```
|
||
|
||
**Intuition**: V(s) = value of state, Q(s,a) = value of state-action pair.
|
||
|
||
|
||
### Critical Distinction: Value vs Reward
|
||
|
||
**Reward r(s,a)**: Immediate, one-step payoff.
|
||
|
||
**Value V(s)**: Long-term, cumulative expected reward.
|
||
|
||
**Example: GridWorld**
|
||
|
||
```
|
||
Reward: r = -1 every step, r = +10 at goal
|
||
Value at state 2 steps from goal:
|
||
V(s) ≈ -1 + γ(-1) + γ²(+10)
|
||
= -1 - 0.9 + 0.81*10
|
||
= -1.9 + 8.1 = 6.2
|
||
```
|
||
|
||
**Key**: Value is higher than immediate reward because it accounts for future goal reward.
|
||
|
||
**Common Mistake**: Setting V(s) = r(s). This ignores all future rewards.
|
||
|
||
|
||
### Example: Computing V^π for Simple Policy
|
||
|
||
**GridWorld**: 3x3 grid, goal at (2,2), γ=0.9, r=-1 per step.
|
||
|
||
**Policy π**: Always move right or down (deterministic).
|
||
|
||
**Manual Calculation**:
|
||
|
||
```
|
||
V^π((2,2)) = 0 (goal, no future rewards)
|
||
|
||
V^π((2,1)) = r + γ V^π((2,2))
|
||
= -1 + 0.9 * 0 = -1
|
||
|
||
V^π((1,2)) = r + γ V^π((2,2))
|
||
= -1 + 0.9 * 0 = -1
|
||
|
||
V^π((1,1)) = r + γ V^π((1,2)) (assuming action = DOWN)
|
||
= -1 + 0.9 * (-1) = -1.9
|
||
|
||
V^π((0,0)) = r + γ V^π((0,1))
|
||
= ... (depends on path)
|
||
```
|
||
|
||
**Observation**: Values decrease as distance from goal increases (more -1 rewards to collect).
|
||
|
||
|
||
### Optimal Value Functions
|
||
|
||
**Optimal State-Value Function V*(s)**:
|
||
|
||
```
|
||
V*(s) = max_π V^π(s)
|
||
```
|
||
|
||
**Meaning**: Maximum value achievable from state s under ANY policy.
|
||
|
||
**Optimal Action-Value Function Q*(s,a)**:
|
||
|
||
```
|
||
Q*(s,a) = max_π Q^π(s,a)
|
||
```
|
||
|
||
**Meaning**: Maximum value achievable from state s, taking action a, then acting optimally.
|
||
|
||
**Optimal Policy π***:
|
||
|
||
```
|
||
π*(s) = argmax_a Q*(s,a)
|
||
```
|
||
|
||
**Meaning**: Policy that achieves V*(s) at all states.
|
||
|
||
**Key Insight**: If you know Q*(s,a), optimal policy is trivial (pick action with max Q).
|
||
|
||
|
||
### Value Function Pitfall #1: Confusing V and Q
|
||
|
||
**Wrong Understanding**:
|
||
|
||
- V(s) = value of state s
|
||
- Q(s,a) = value of action a (WRONG - ignores state)
|
||
|
||
**Correct Understanding**:
|
||
|
||
- V(s) = value of state s (average over actions under policy)
|
||
- Q(s,a) = value of taking action a IN STATE s
|
||
|
||
**Example**: GridWorld
|
||
|
||
```
|
||
State s = (1,1)
|
||
V(s) might be 5.0 (average value under policy)
|
||
|
||
Q(s, RIGHT) = 6.0 (moving right is good)
|
||
Q(s, LEFT) = 2.0 (moving left is bad)
|
||
Q(s, UP) = 4.0
|
||
Q(s, DOWN) = 7.0 (moving down is best)
|
||
|
||
V(s) = π(RIGHT|s)*6 + π(LEFT|s)*2 + π(UP|s)*4 + π(DOWN|s)*7
|
||
```
|
||
|
||
**Takeaway**: Q depends on BOTH state and action. V depends only on state.
|
||
|
||
|
||
### Value Function Pitfall #2: Forgetting Expectation
|
||
|
||
**Wrong**: V(s) = sum of rewards on one trajectory.
|
||
|
||
**Correct**: V(s) = expected sum over ALL possible trajectories.
|
||
|
||
**Example**: Stochastic GridWorld
|
||
|
||
```python
|
||
# WRONG: Compute V by running one episode
|
||
episode_return = sum([r_0, r_1, ..., r_T])
|
||
V[s_0] = episode_return # This is ONE sample, not expectation
|
||
|
||
# CORRECT: Compute V by averaging over many episodes
|
||
returns = []
|
||
for _ in range(1000):
|
||
episode_return = run_episode(policy, start_state=s)
|
||
returns.append(episode_return)
|
||
V[s] = np.mean(returns) # Expectation via Monte Carlo
|
||
```
|
||
|
||
**Key**: Value is an expectation, not a single sample.
|
||
|
||
|
||
### Value Function Pitfall #3: Ignoring Discount Factor
|
||
|
||
**Scenario**: User computes V without discounting.
|
||
|
||
**Wrong**:
|
||
|
||
```python
|
||
V[s] = r_0 + r_1 + r_2 + ... # No discount
|
||
```
|
||
|
||
**Correct**:
|
||
|
||
```python
|
||
V[s] = r_0 + gamma*r_1 + gamma**2*r_2 + ...
|
||
```
|
||
|
||
**Why It Matters**: Without discount, values blow up in continuing tasks.
|
||
|
||
**Example**: Continuing task with r=1 every step
|
||
|
||
```
|
||
Without discount: V = 1 + 1 + 1 + ... = ∞
|
||
With γ=0.9: V = 1 + 0.9 + 0.81 + ... = 1/(1-0.9) = 10
|
||
```
|
||
|
||
**Takeaway**: Always discount future rewards in continuing tasks.
|
||
|
||
|
||
## Part 3: Policies
|
||
|
||
### What is a Policy?
|
||
|
||
A policy π is a mapping from states to actions (or action probabilities).
|
||
|
||
**Deterministic Policy**: π: S → A
|
||
|
||
```
|
||
π(s) = a (always take action a in state s)
|
||
```
|
||
|
||
**Stochastic Policy**: π: S × A → [0,1]
|
||
|
||
```
|
||
π(a|s) = probability of taking action a in state s
|
||
Σ_a π(a|s) = 1 (probabilities sum to 1)
|
||
```
|
||
|
||
|
||
### Example: Policies in GridWorld
|
||
|
||
**Deterministic Policy**:
|
||
|
||
```python
|
||
def policy(state):
|
||
if state[0] < 2:
|
||
return "RIGHT"
|
||
else:
|
||
return "DOWN"
|
||
```
|
||
|
||
**Stochastic Policy**:
|
||
|
||
```python
|
||
def policy(state):
|
||
# 70% right, 20% down, 10% up
|
||
return np.random.choice(["RIGHT", "DOWN", "UP"],
|
||
p=[0.7, 0.2, 0.1])
|
||
```
|
||
|
||
**Uniform Random Policy**:
|
||
|
||
```python
|
||
def policy(state):
|
||
return np.random.choice(["UP", "DOWN", "LEFT", "RIGHT"])
|
||
```
|
||
|
||
|
||
### Policy Evaluation
|
||
|
||
**Problem**: Given policy π, compute V^π(s) for all states.
|
||
|
||
**Approach 1: Monte Carlo** (sample trajectories)
|
||
|
||
```python
|
||
# Run many episodes, average returns
|
||
V = defaultdict(float)
|
||
counts = defaultdict(int)
|
||
|
||
for episode in range(10000):
|
||
trajectory = run_episode(policy)
|
||
G = 0
|
||
for (s, a, r) in reversed(trajectory):
|
||
G = r + gamma * G
|
||
V[s] += G
|
||
counts[s] += 1
|
||
|
||
for s in V:
|
||
V[s] /= counts[s] # Average
|
||
```
|
||
|
||
**Approach 2: Bellman Expectation** (iterative)
|
||
|
||
```python
|
||
# Initialize V arbitrarily
|
||
V = {s: 0 for s in states}
|
||
|
||
# Iterate until convergence
|
||
while not converged:
|
||
V_new = {}
|
||
for s in states:
|
||
V_new[s] = sum(policy(a|s) * (R(s,a) + gamma * sum(P(s'|s,a) * V[s']
|
||
for s' in states))
|
||
for a in actions)
|
||
V = V_new
|
||
```
|
||
|
||
**Approach 2 requires knowing P(s'|s,a)** (model-based).
|
||
|
||
|
||
### Policy Improvement
|
||
|
||
**Theorem**: Given V^π, greedy policy π' with respect to V^π is at least as good as π.
|
||
|
||
```
|
||
π'(s) = argmax_a Q^π(s,a)
|
||
= argmax_a Σ_{s'} P(s'|s,a) [R(s,a,s') + γV^π(s')]
|
||
```
|
||
|
||
**Proof Sketch**: By construction, π' maximizes expected immediate reward + future value.
|
||
|
||
**Consequence**: Iterating policy evaluation + policy improvement converges to optimal policy π*.
|
||
|
||
|
||
### Optimal Policy π*
|
||
|
||
**Theorem**: There exists an optimal policy π*that achieves V*(s) at all states.
|
||
|
||
**How to find π* from Q***:
|
||
|
||
```python
|
||
def optimal_policy(state):
|
||
return argmax(Q_star[state, :]) # Greedy w.r.t. Q*
|
||
```
|
||
|
||
**How to find π* from V***:
|
||
|
||
```python
|
||
def optimal_policy(state):
|
||
# One-step lookahead
|
||
return argmax([R(state, a) + gamma * sum(P(s'|state,a) * V_star[s']
|
||
for s' in states)
|
||
for a in actions])
|
||
```
|
||
|
||
**Key**: Optimal policy is deterministic (greedy w.r.t. Q*or V*).
|
||
|
||
**Exception**: In stochastic games with multiple optimal actions, any distribution over optimal actions is fine.
|
||
|
||
|
||
### Policy Pitfall #1: Greedy Policy Without Exploration
|
||
|
||
**Problem**: Always taking argmax(Q) means never trying new actions.
|
||
|
||
**Example**:
|
||
|
||
```python
|
||
# Pure greedy policy (WRONG for learning)
|
||
def policy(state):
|
||
return argmax(Q[state, :])
|
||
```
|
||
|
||
**Why It Fails**: If Q is initialized wrong, agent never explores better actions.
|
||
|
||
**Solution**: ε-greedy policy
|
||
|
||
```python
|
||
def epsilon_greedy_policy(state, epsilon=0.1):
|
||
if random.random() < epsilon:
|
||
return random.choice(actions) # Explore
|
||
else:
|
||
return argmax(Q[state, :]) # Exploit
|
||
```
|
||
|
||
**Exploration-Exploitation Tradeoff**: Explore to find better actions, exploit to maximize reward.
|
||
|
||
|
||
### Policy Pitfall #2: Stochastic Policy for Deterministic Optimal
|
||
|
||
**Scenario**: Optimal policy is deterministic (most MDPs), but user uses stochastic policy.
|
||
|
||
**Effect**: Suboptimal performance (randomness doesn't help).
|
||
|
||
**Example**: GridWorld optimal policy always moves toward goal (deterministic).
|
||
|
||
**When Stochastic is Needed**:
|
||
|
||
1. **During Learning**: Exploration (ε-greedy, Boltzmann)
|
||
2. **Partially Observable**: Stochasticity can help in POMDPs
|
||
3. **Multi-Agent**: Randomness prevents exploitation by opponents
|
||
|
||
**Takeaway**: After learning, optimal policy is usually deterministic. Use stochastic for exploration.
|
||
|
||
|
||
## Part 4: Bellman Equations
|
||
|
||
### Bellman Expectation Equation
|
||
|
||
**For V^π**:
|
||
|
||
```
|
||
V^π(s) = Σ_a π(a|s) Σ_{s'} P(s'|s,a) [R(s,a,s') + γ V^π(s')]
|
||
```
|
||
|
||
**Intuition**: Value of state s = expected immediate reward + discounted value of next state.
|
||
|
||
**For Q^π**:
|
||
|
||
```
|
||
Q^π(s,a) = Σ_{s'} P(s'|s,a) [R(s,a,s') + γ Σ_{a'} π(a'|s') Q^π(s',a')]
|
||
```
|
||
|
||
**Intuition**: Value of (s,a) = expected immediate reward + discounted value of next (s',a').
|
||
|
||
**Relationship**:
|
||
|
||
```
|
||
V^π(s) = Σ_a π(a|s) Q^π(s,a)
|
||
Q^π(s,a) = Σ_{s'} P(s'|s,a) [R(s,a,s') + γ V^π(s')]
|
||
```
|
||
|
||
|
||
### Bellman Optimality Equation
|
||
|
||
**For V***:
|
||
|
||
```
|
||
V*(s) = max_a Σ_{s'} P(s'|s,a) [R(s,a,s') + γ V*(s')]
|
||
```
|
||
|
||
**Intuition**: Optimal value = max over actions of (immediate reward + discounted optimal future value).
|
||
|
||
**For Q***:
|
||
|
||
```
|
||
Q*(s,a) = Σ_{s'} P(s'|s,a) [R(s,a,s') + γ max_{a'} Q*(s',a')]
|
||
```
|
||
|
||
**Intuition**: Optimal Q-value = expected immediate reward + discounted optimal Q-value of next state.
|
||
|
||
**Relationship**:
|
||
|
||
```
|
||
V*(s) = max_a Q*(s,a)
|
||
Q*(s,a) = Σ_{s'} P(s'|s,a) [R(s,a,s') + γ V*(s')]
|
||
```
|
||
|
||
|
||
### Deriving the Bellman Equation
|
||
|
||
**Start with definition of V^π**:
|
||
|
||
```
|
||
V^π(s) = E_π[G_t | s_t = s]
|
||
= E_π[r_t + γr_{t+1} + γ²r_{t+2} + ... | s_t = s]
|
||
```
|
||
|
||
**Factor out first reward**:
|
||
|
||
```
|
||
V^π(s) = E_π[r_t + γ(r_{t+1} + γr_{t+2} + ...) | s_t = s]
|
||
= E_π[r_t | s_t = s] + γ E_π[r_{t+1} + γr_{t+2} + ... | s_t = s]
|
||
```
|
||
|
||
**Second term is V^π(s_{t+1})**:
|
||
|
||
```
|
||
V^π(s) = E_π[r_t | s_t = s] + γ E_π[V^π(s_{t+1}) | s_t = s]
|
||
```
|
||
|
||
**Expand expectations**:
|
||
|
||
```
|
||
V^π(s) = Σ_a π(a|s) Σ_{s'} P(s'|s,a) [R(s,a,s') + γ V^π(s')]
|
||
```
|
||
|
||
**This is the Bellman Expectation Equation.**
|
||
|
||
**Key Insight**: Value function satisfies a consistency equation (recursive).
|
||
|
||
|
||
### Why Bellman Equations Matter
|
||
|
||
**1. Iterative Algorithms**: Use Bellman equation as update rule
|
||
|
||
```python
|
||
# Value Iteration
|
||
V_new[s] = max_a Σ_{s'} P(s'|s,a) [R(s,a,s') + γ V[s']]
|
||
|
||
# Q-Learning
|
||
Q[s,a] += alpha * (r + gamma * max_a' Q[s',a'] - Q[s,a])
|
||
```
|
||
|
||
**2. Convergence Guarantees**: Bellman operator is a contraction, guarantees convergence.
|
||
|
||
**3. Understanding Algorithms**: All RL algorithms approximate Bellman equations.
|
||
|
||
**Takeaway**: Bellman equations are the foundation of RL algorithms.
|
||
|
||
|
||
### Bellman Pitfall #1: Forgetting Max vs Expectation
|
||
|
||
**Bellman Expectation** (for policy π):
|
||
|
||
```
|
||
V^π(s) = Σ_a π(a|s) ... # Expectation over policy
|
||
```
|
||
|
||
**Bellman Optimality** (for optimal policy):
|
||
|
||
```
|
||
V*(s) = max_a ... # Maximize over actions
|
||
```
|
||
|
||
**Consequence**:
|
||
|
||
- Policy evaluation uses Bellman expectation
|
||
- Value iteration uses Bellman optimality
|
||
|
||
**Common Mistake**: Using max when evaluating a non-greedy policy.
|
||
|
||
|
||
### Bellman Pitfall #2: Ignoring Transition Probabilities
|
||
|
||
**Deterministic Transition**:
|
||
|
||
```
|
||
V^π(s) = R(s,a) + γ V^π(s') # Direct, s' is deterministic
|
||
```
|
||
|
||
**Stochastic Transition**:
|
||
|
||
```
|
||
V^π(s) = Σ_{s'} P(s'|s,a) [R(s,a,s') + γ V^π(s')] # Weighted sum
|
||
```
|
||
|
||
**Example**: Stochastic GridWorld
|
||
|
||
```
|
||
# Action RIGHT from (1,1)
|
||
V((1,1)) = 0.8 * [r + γ V((1,2))] # 80% intended
|
||
+ 0.1 * [r + γ V((0,1))] # 10% slip up
|
||
+ 0.1 * [r + γ V((2,1))] # 10% slip down
|
||
```
|
||
|
||
**Takeaway**: Don't forget to weight by transition probabilities in stochastic environments.
|
||
|
||
|
||
## Part 5: Discount Factor γ
|
||
|
||
### What Does γ Control?
|
||
|
||
**Discount factor γ ∈ [0, 1]** controls how much the agent cares about future rewards.
|
||
|
||
**γ = 0**: Only immediate reward matters
|
||
|
||
```
|
||
V(s) = E[r_t] (myopic)
|
||
```
|
||
|
||
**γ = 1**: All future rewards matter equally
|
||
|
||
```
|
||
V(s) = E[r_t + r_{t+1} + r_{t+2} + ...] (far-sighted)
|
||
```
|
||
|
||
**γ = 0.9**: Future discounted exponentially
|
||
|
||
```
|
||
V(s) = E[r_t + 0.9*r_{t+1} + 0.81*r_{t+2} + ...]
|
||
```
|
||
|
||
**Reward 10 steps away**:
|
||
|
||
- γ=0.9: worth 0.9^10 = 0.35 of immediate reward
|
||
- γ=0.99: worth 0.99^10 = 0.90 of immediate reward
|
||
|
||
|
||
### Planning Horizon
|
||
|
||
**Effective Horizon**: How far ahead does agent plan?
|
||
|
||
**Approximation**: Horizon ≈ 1/(1-γ)
|
||
|
||
**Examples**:
|
||
|
||
- γ=0.9 → Horizon ≈ 10 steps
|
||
- γ=0.99 → Horizon ≈ 100 steps
|
||
- γ=0.5 → Horizon ≈ 2 steps
|
||
- γ=0.999 → Horizon ≈ 1000 steps
|
||
|
||
**Intuition**: After horizon steps, rewards are discounted to ~37% (e^{-1}).
|
||
|
||
**Formal**: Σ_{t=0}^∞ γ^t = 1/(1-γ) (sum of geometric series).
|
||
|
||
|
||
### Choosing γ
|
||
|
||
**Rule of Thumb**:
|
||
|
||
- **Task horizon known**: γ such that 1/(1-γ) ≈ task_length
|
||
- **Short episodes** (< 100 steps): γ = 0.9 to 0.95
|
||
- **Long episodes** (100-1000 steps): γ = 0.99
|
||
- **Very long** (> 1000 steps): γ = 0.999
|
||
|
||
**Example: Pong** (episode ~ 1000 steps)
|
||
|
||
```
|
||
γ = 0.99 # Horizon ≈ 100, sees ~10% of episode
|
||
```
|
||
|
||
**Example: Cartpole** (episode ~ 200 steps)
|
||
|
||
```
|
||
γ = 0.99 # Horizon ≈ 100, sees half of episode
|
||
```
|
||
|
||
**Example: Chess** (game ~ 40 moves = 80 steps)
|
||
|
||
```
|
||
γ = 0.95 # Horizon ≈ 20, sees quarter of game
|
||
```
|
||
|
||
|
||
### γ = 1 Special Case
|
||
|
||
**When γ = 1**:
|
||
|
||
- Only valid for **episodic tasks** (guaranteed termination)
|
||
- Continuing tasks: V = ∞ (unbounded)
|
||
|
||
**Example: GridWorld** (terminates at goal)
|
||
|
||
```
|
||
γ = 1.0 # OK, episode ends
|
||
V(s) = -steps_to_goal + 10 (finite)
|
||
```
|
||
|
||
**Example: Stock trading** (never terminates)
|
||
|
||
```
|
||
γ = 1.0 # WRONG, V = ∞
|
||
γ = 0.99 # Correct
|
||
```
|
||
|
||
**Takeaway**: Use γ < 1 for continuing tasks, γ = 1 allowed for episodic.
|
||
|
||
|
||
### Discount Factor Pitfall #1: Too Small γ
|
||
|
||
**Scenario**: Task requires 50 steps to reach goal, γ=0.9.
|
||
|
||
**Problem**:
|
||
|
||
```
|
||
Reward at step 50 discounted by 0.9^50 = 0.0052
|
||
```
|
||
|
||
**Effect**: Agent effectively blind to long-term goals (can't see reward).
|
||
|
||
**Solution**: Increase γ to 0.99 (0.99^50 = 0.61, still significant).
|
||
|
||
**Symptom**: Agent learns suboptimal policy (ignores distant goals).
|
||
|
||
|
||
### Discount Factor Pitfall #2: γ = 1 in Continuing Tasks
|
||
|
||
**Scenario**: Continuing task (never terminates), γ=1.
|
||
|
||
**Problem**:
|
||
|
||
```
|
||
V(s) = r + r + r + ... = ∞ (unbounded)
|
||
```
|
||
|
||
**Effect**: Value iteration, Q-learning diverge (values explode).
|
||
|
||
**Solution**: Use γ < 1 (e.g., γ=0.99).
|
||
|
||
**Symptom**: Values grow without bound, algorithm doesn't converge.
|
||
|
||
|
||
### Discount Factor Pitfall #3: Treating γ as Hyperparameter
|
||
|
||
**Wrong Mindset**: "Let's grid search γ in [0.9, 0.95, 0.99]."
|
||
|
||
**Correct Mindset**: "Task requires planning X steps ahead, so γ = 1 - 1/X."
|
||
|
||
**Example**: Goal 100 steps away
|
||
|
||
```
|
||
Required horizon = 100
|
||
γ = 1 - 1/100 = 0.99
|
||
```
|
||
|
||
**Takeaway**: γ is not arbitrary. Choose based on task horizon.
|
||
|
||
|
||
## Part 6: Algorithm Families
|
||
|
||
### Three Paradigms
|
||
|
||
**1. Dynamic Programming (DP)**:
|
||
|
||
- Requires full MDP model (P, R known)
|
||
- Exact algorithms (no sampling)
|
||
- Examples: Value Iteration, Policy Iteration
|
||
|
||
**2. Monte Carlo (MC)**:
|
||
|
||
- Model-free (learn from experience)
|
||
- Learns from complete episodes
|
||
- Examples: First-visit MC, Every-visit MC
|
||
|
||
**3. Temporal Difference (TD)**:
|
||
|
||
- Model-free (learn from experience)
|
||
- Learns from incomplete episodes
|
||
- Examples: TD(0), Q-learning, SARSA
|
||
|
||
**Key Differences**:
|
||
|
||
- DP: Needs model, no sampling
|
||
- MC: No model, full episodes
|
||
- TD: No model, partial episodes (most flexible)
|
||
|
||
|
||
### Value Iteration
|
||
|
||
**Algorithm**: Iteratively apply Bellman optimality operator.
|
||
|
||
```python
|
||
# Initialize
|
||
V = {s: 0 for s in states}
|
||
|
||
# Iterate until convergence
|
||
while not converged:
|
||
V_new = {}
|
||
for s in states:
|
||
# Bellman optimality backup
|
||
V_new[s] = max([sum(P(s_next|s,a) * (R(s,a,s_next) + gamma * V[s_next])
|
||
for s_next in states)
|
||
for a in actions])
|
||
|
||
if max(abs(V_new[s] - V[s]) for s in states) < threshold:
|
||
converged = True
|
||
V = V_new
|
||
|
||
# Extract policy
|
||
policy = {s: argmax([sum(P(s_next|s,a) * (R(s,a,s_next) + gamma * V[s_next])
|
||
for s_next in states)
|
||
for a in actions])
|
||
for s in states}
|
||
```
|
||
|
||
**Convergence**: Guaranteed (Bellman operator is contraction).
|
||
|
||
**Computational Cost**: O(|S|² |A|) per iteration.
|
||
|
||
**When to Use**: Small state spaces (< 10,000 states), full model available.
|
||
|
||
|
||
### Policy Iteration
|
||
|
||
**Algorithm**: Alternate between policy evaluation and policy improvement.
|
||
|
||
```python
|
||
# Initialize random policy
|
||
policy = {s: random.choice(actions) for s in states}
|
||
|
||
while not converged:
|
||
# Policy Evaluation: Compute V^π
|
||
V = {s: 0 for s in states}
|
||
while not converged_V:
|
||
V_new = {}
|
||
for s in states:
|
||
a = policy[s]
|
||
V_new[s] = sum(P(s_next|s,a) * (R(s,a,s_next) + gamma * V[s_next])
|
||
for s_next in states)
|
||
V = V_new
|
||
|
||
# Policy Improvement: Make policy greedy w.r.t. V
|
||
policy_stable = True
|
||
for s in states:
|
||
old_action = policy[s]
|
||
policy[s] = argmax([sum(P(s_next|s,a) * (R(s,a,s_next) + gamma * V[s_next])
|
||
for s_next in states)
|
||
for a in actions])
|
||
if old_action != policy[s]:
|
||
policy_stable = False
|
||
|
||
if policy_stable:
|
||
converged = True
|
||
```
|
||
|
||
**Convergence**: Guaranteed, often fewer iterations than value iteration.
|
||
|
||
**When to Use**: When policy converges faster than values (common).
|
||
|
||
**Key Difference from Value Iteration**:
|
||
|
||
- Value iteration: no explicit policy until end
|
||
- Policy iteration: maintain and improve policy each iteration
|
||
|
||
|
||
### Monte Carlo Methods
|
||
|
||
**Idea**: Estimate V^π(s) by averaging returns from state s.
|
||
|
||
```python
|
||
# First-visit MC
|
||
V = defaultdict(float)
|
||
counts = defaultdict(int)
|
||
|
||
for episode in range(num_episodes):
|
||
trajectory = run_episode(policy) # [(s_0, a_0, r_0), ..., (s_T, a_T, r_T)]
|
||
|
||
G = 0
|
||
visited = set()
|
||
for (s, a, r) in reversed(trajectory):
|
||
G = r + gamma * G # Accumulate return
|
||
|
||
if s not in visited: # First-visit
|
||
V[s] += G
|
||
counts[s] += 1
|
||
visited.add(s)
|
||
|
||
for s in counts:
|
||
V[s] /= counts[s] # Average return
|
||
```
|
||
|
||
**Advantages**:
|
||
|
||
- No model needed (model-free)
|
||
- Can handle stochastic environments
|
||
- Unbiased estimates
|
||
|
||
**Disadvantages**:
|
||
|
||
- Requires complete episodes (can't learn mid-episode)
|
||
- High variance (one trajectory is noisy)
|
||
- Slow convergence
|
||
|
||
**When to Use**: Episodic tasks, when model unavailable.
|
||
|
||
|
||
### Temporal Difference (TD) Learning
|
||
|
||
**Idea**: Update V after each step using bootstrapping.
|
||
|
||
**TD(0) Update**:
|
||
|
||
```python
|
||
V[s] += alpha * (r + gamma * V[s_next] - V[s])
|
||
# \_____________________/
|
||
# TD error
|
||
```
|
||
|
||
**Bootstrapping**: Use current estimate V[s_next] instead of true return.
|
||
|
||
**Full Algorithm**:
|
||
|
||
```python
|
||
V = {s: 0 for s in states}
|
||
|
||
for episode in range(num_episodes):
|
||
s = initial_state()
|
||
|
||
while not terminal:
|
||
a = policy(s)
|
||
s_next, r = environment.step(s, a)
|
||
|
||
# TD update
|
||
V[s] += alpha * (r + gamma * V[s_next] - V[s])
|
||
|
||
s = s_next
|
||
```
|
||
|
||
**Advantages**:
|
||
|
||
- No model needed (model-free)
|
||
- Can learn from incomplete episodes (online)
|
||
- Lower variance than MC
|
||
|
||
**Disadvantages**:
|
||
|
||
- Biased estimates (bootstrap uses estimate)
|
||
- Requires tuning α (learning rate)
|
||
|
||
**When to Use**: Model-free, need online learning.
|
||
|
||
|
||
### Q-Learning (TD for Q-values)
|
||
|
||
**TD for action-values Q(s,a)**:
|
||
|
||
```python
|
||
Q[s,a] += alpha * (r + gamma * max_a' Q[s_next, a'] - Q[s,a])
|
||
```
|
||
|
||
**Full Algorithm**:
|
||
|
||
```python
|
||
Q = defaultdict(lambda: defaultdict(float))
|
||
|
||
for episode in range(num_episodes):
|
||
s = initial_state()
|
||
|
||
while not terminal:
|
||
# ε-greedy action selection
|
||
if random.random() < epsilon:
|
||
a = random.choice(actions)
|
||
else:
|
||
a = argmax(Q[s])
|
||
|
||
s_next, r = environment.step(s, a)
|
||
|
||
# Q-learning update (off-policy)
|
||
Q[s][a] += alpha * (r + gamma * max(Q[s_next].values()) - Q[s][a])
|
||
|
||
s = s_next
|
||
```
|
||
|
||
**Key**: Off-policy (learns optimal Q regardless of behavior policy).
|
||
|
||
**When to Use**: Model-free, discrete actions, want optimal policy.
|
||
|
||
|
||
### SARSA (On-Policy TD)
|
||
|
||
**Difference from Q-learning**: Uses next action from policy (on-policy).
|
||
|
||
```python
|
||
Q[s,a] += alpha * (r + gamma * Q[s_next, a_next] - Q[s,a])
|
||
# ^^^^^^
|
||
# Action from policy, not max
|
||
```
|
||
|
||
**Full Algorithm**:
|
||
|
||
```python
|
||
Q = defaultdict(lambda: defaultdict(float))
|
||
|
||
for episode in range(num_episodes):
|
||
s = initial_state()
|
||
a = epsilon_greedy(Q[s], epsilon) # Choose first action
|
||
|
||
while not terminal:
|
||
s_next, r = environment.step(s, a)
|
||
a_next = epsilon_greedy(Q[s_next], epsilon) # Next action from policy
|
||
|
||
# SARSA update (on-policy)
|
||
Q[s][a] += alpha * (r + gamma * Q[s_next][a_next] - Q[s][a])
|
||
|
||
s, a = s_next, a_next
|
||
```
|
||
|
||
**Difference from Q-learning**:
|
||
|
||
- Q-learning: learns optimal policy (off-policy)
|
||
- SARSA: learns policy being followed (on-policy)
|
||
|
||
**When to Use**: When you want policy to reflect exploration strategy.
|
||
|
||
|
||
### Algorithm Comparison
|
||
|
||
| Algorithm | Model? | Episodes? | Convergence | Use Case |
|
||
|-----------|--------|-----------|-------------|----------|
|
||
| Value Iteration | Yes (P, R) | No | Guaranteed | Small MDPs, known model |
|
||
| Policy Iteration | Yes (P, R) | No | Guaranteed, faster | Small MDPs, good init policy |
|
||
| Monte Carlo | No | Complete | Slow, high variance | Episodic, model-free |
|
||
| TD(0) | No | Partial | Faster, lower variance | Online, model-free |
|
||
| Q-Learning | No | Partial | Guaranteed* | Discrete actions, off-policy |
|
||
| SARSA | No | Partial | Guaranteed* | On-policy, safe exploration |
|
||
|
||
*With appropriate exploration and learning rate schedule.
|
||
|
||
|
||
### Algorithm Pitfall #1: Using DP Without Model
|
||
|
||
**Scenario**: User tries value iteration on real robot (no model).
|
||
|
||
**Problem**: Value iteration requires P(s'|s,a) and R(s,a,s').
|
||
|
||
**Solution**: Use model-free methods (Q-learning, SARSA, policy gradients).
|
||
|
||
**Red Flag**: "Let's use policy iteration for Atari games." (No model available.)
|
||
|
||
|
||
### Algorithm Pitfall #2: Monte Carlo on Non-Episodic Tasks
|
||
|
||
**Scenario**: Continuing task (never terminates), try MC.
|
||
|
||
**Problem**: MC requires complete episodes to compute return.
|
||
|
||
**Solution**: Use TD methods (learn from partial trajectories).
|
||
|
||
**Red Flag**: "Let's use MC for stock trading." (Continuing task.)
|
||
|
||
|
||
### Algorithm Pitfall #3: Confusing Q-Learning and SARSA
|
||
|
||
**Scenario**: User uses Q-learning but expects on-policy behavior.
|
||
|
||
**Example**: Cliff walking with epsilon-greedy
|
||
|
||
- Q-learning: Learns optimal (risky) path along cliff
|
||
- SARSA: Learns safe path away from cliff (accounts for exploration)
|
||
|
||
**Takeaway**:
|
||
|
||
- Q-learning: Learns optimal policy (off-policy)
|
||
- SARSA: Learns policy being followed (on-policy)
|
||
|
||
**Choose based on whether you want optimal policy or policy that accounts for exploration.**
|
||
|
||
|
||
## Part 7: Exploration vs Exploitation
|
||
|
||
### The Tradeoff
|
||
|
||
**Exploitation**: Choose action with highest known value (maximize immediate reward).
|
||
|
||
**Exploration**: Try new actions to discover if they're better (maximize long-term information).
|
||
|
||
**Dilemma**: Must explore to find optimal policy, but exploration sacrifices short-term reward.
|
||
|
||
**Example**: Restaurant choice
|
||
|
||
- Exploitation: Go to your favorite restaurant (known good)
|
||
- Exploration: Try a new restaurant (might be better, might be worse)
|
||
|
||
|
||
### Why Exploration is Necessary
|
||
|
||
**Scenario**: GridWorld, Q-values initialized to 0.
|
||
|
||
**Without Exploration**:
|
||
|
||
```python
|
||
# Greedy policy
|
||
policy(s) = argmax(Q[s, :]) # Always 0 initially, picks arbitrary action
|
||
```
|
||
|
||
**Problem**: If first action happens to be BAD, Q[s,a] becomes negative, never tried again.
|
||
|
||
**Result**: Agent stuck in suboptimal policy (local optimum).
|
||
|
||
**With Exploration**:
|
||
|
||
```python
|
||
# ε-greedy
|
||
if random.random() < epsilon:
|
||
action = random.choice(actions) # Explore
|
||
else:
|
||
action = argmax(Q[s, :]) # Exploit
|
||
```
|
||
|
||
**Result**: Eventually tries all actions, discovers optimal.
|
||
|
||
|
||
### ε-Greedy Exploration
|
||
|
||
**Algorithm**:
|
||
|
||
```python
|
||
def epsilon_greedy(state, Q, epsilon=0.1):
|
||
if random.random() < epsilon:
|
||
return random.choice(actions) # Explore with prob ε
|
||
else:
|
||
return argmax(Q[state, :]) # Exploit with prob 1-ε
|
||
```
|
||
|
||
**Tuning ε**:
|
||
|
||
- **ε = 0**: No exploration (greedy, can get stuck)
|
||
- **ε = 1**: Random policy (no exploitation, never converges)
|
||
- **ε = 0.1**: Common choice (10% exploration)
|
||
|
||
**Decay Schedule**:
|
||
|
||
```python
|
||
epsilon = max(epsilon_min, epsilon * decay_rate)
|
||
# Start high (ε=1.0), decay to low (ε=0.01)
|
||
```
|
||
|
||
**Rationale**: Explore heavily early, exploit more as you learn.
|
||
|
||
|
||
### Upper Confidence Bound (UCB)
|
||
|
||
**Idea**: Choose action that balances value and uncertainty.
|
||
|
||
**UCB Formula**:
|
||
|
||
```python
|
||
action = argmax(Q[s,a] + c * sqrt(log(N[s]) / N[s,a]))
|
||
# ^^^^^^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||
# Exploitation Exploration bonus
|
||
```
|
||
|
||
**Where**:
|
||
|
||
- N[s] = number of times state s visited
|
||
- N[s,a] = number of times action a taken in state s
|
||
- c = exploration constant
|
||
|
||
**Intuition**: Actions tried less often get exploration bonus (uncertainty).
|
||
|
||
**Advantage over ε-greedy**: Adaptive exploration (focuses on uncertain actions).
|
||
|
||
|
||
### Optimistic Initialization
|
||
|
||
**Idea**: Initialize Q-values to high values (optimistic).
|
||
|
||
```python
|
||
Q = defaultdict(lambda: defaultdict(lambda: 10.0)) # Optimistic
|
||
```
|
||
|
||
**Effect**: All actions initially seem good, encourages exploration.
|
||
|
||
**How it works**:
|
||
|
||
1. All Q-values start high (optimistic)
|
||
2. Agent tries action, gets real reward (likely lower)
|
||
3. Q-value decreases, agent tries other actions
|
||
4. Continues until all actions explored
|
||
|
||
**Advantage**: Simple, no ε parameter.
|
||
|
||
**Disadvantage**: Only works for finite action spaces, exploration stops after initial phase.
|
||
|
||
|
||
### Boltzmann Exploration (Softmax)
|
||
|
||
**Idea**: Choose actions probabilistically based on Q-values.
|
||
|
||
```python
|
||
def softmax(Q, temperature=1.0):
|
||
exp_Q = np.exp(Q / temperature)
|
||
return exp_Q / np.sum(exp_Q)
|
||
|
||
probs = softmax(Q[state, :])
|
||
action = np.random.choice(actions, p=probs)
|
||
```
|
||
|
||
**Temperature**:
|
||
|
||
- High temperature (τ→∞): Uniform random (more exploration)
|
||
- Low temperature (τ→0): Greedy (more exploitation)
|
||
|
||
**Advantage**: Naturally weights exploration by Q-values (poor actions less likely).
|
||
|
||
**Disadvantage**: Requires tuning temperature, computationally more expensive.
|
||
|
||
|
||
### Exploration Pitfall #1: No Exploration
|
||
|
||
**Scenario**: Pure greedy policy.
|
||
|
||
```python
|
||
action = argmax(Q[state, :]) # No randomness
|
||
```
|
||
|
||
**Problem**: Agent never explores, gets stuck in local optimum.
|
||
|
||
**Example**: Q-values initialized to 0, first action is UP (arbitrary).
|
||
|
||
- Agent always chooses UP (Q still 0 for others)
|
||
- Never discovers RIGHT is optimal
|
||
- Stuck forever
|
||
|
||
**Solution**: Always use some exploration (ε-greedy with ε ≥ 0.01).
|
||
|
||
|
||
### Exploration Pitfall #2: Too Much Exploration
|
||
|
||
**Scenario**: ε = 0.5 (50% random actions).
|
||
|
||
**Problem**: Agent wastes time on known-bad actions.
|
||
|
||
**Effect**: Slow convergence, poor performance even after learning.
|
||
|
||
**Solution**: Decay ε over time (start high, end low).
|
||
|
||
```python
|
||
epsilon = max(0.01, epsilon * 0.995) # Decay to 1%
|
||
```
|
||
|
||
|
||
### Exploration Pitfall #3: Exploration at Test Time
|
||
|
||
**Scenario**: Evaluating learned policy with ε-greedy (ε=0.1).
|
||
|
||
**Problem**: Test performance artificially low (10% random actions).
|
||
|
||
**Solution**: Use greedy policy at test time.
|
||
|
||
```python
|
||
# Training
|
||
action = epsilon_greedy(state, Q, epsilon=0.1)
|
||
|
||
# Testing
|
||
action = argmax(Q[state, :]) # Greedy, no exploration
|
||
```
|
||
|
||
**Takeaway**: Exploration is for learning, not evaluation.
|
||
|
||
|
||
## Part 8: When Theory is Sufficient
|
||
|
||
### Theory vs Implementation
|
||
|
||
**When Understanding Theory is Enough**:
|
||
|
||
1. **Debugging**: Understanding Bellman equation explains why Q-values aren't converging
|
||
2. **Hyperparameter Tuning**: Understanding γ explains why agent is myopic
|
||
3. **Algorithm Selection**: Understanding model-free vs model-based explains why value iteration fails
|
||
4. **Conceptual Design**: Understanding exploration explains why agent gets stuck
|
||
|
||
**When You Need Implementation**:
|
||
|
||
1. **Real Problems**: Toy examples don't teach debugging real environments
|
||
2. **Scaling**: Neural networks, replay buffers, parallel environments
|
||
3. **Engineering**: Practical details (learning rate schedules, reward clipping)
|
||
|
||
**This Skill's Scope**: Theory, intuition, foundations.
|
||
|
||
**Other Skills for Implementation**: value-based-methods, policy-gradient-methods, actor-critic-methods.
|
||
|
||
|
||
### What This Skill Taught You
|
||
|
||
**1. MDP Formulation**: S, A, P, R, γ - the framework for RL.
|
||
|
||
**2. Value Functions**: V(s) = expected cumulative reward, Q(s,a) = value of action in state.
|
||
|
||
**3. Bellman Equations**: Recursive decomposition, foundation of all algorithms.
|
||
|
||
**4. Discount Factor**: γ controls planning horizon (1/(1-γ)).
|
||
|
||
**5. Policies**: Deterministic vs stochastic, optimal policy π*.
|
||
|
||
**6. Algorithms**:
|
||
|
||
- DP: Value iteration, policy iteration (model-based)
|
||
- MC: Monte Carlo (episodic, model-free)
|
||
- TD: Q-learning, SARSA (online, model-free)
|
||
|
||
**7. Exploration**: ε-greedy, UCB, necessary for learning.
|
||
|
||
**8. Theory-Practice Gap**: When theory suffices vs when to implement.
|
||
|
||
|
||
### Next Steps
|
||
|
||
After mastering foundations, route to:
|
||
|
||
**For Discrete Actions**:
|
||
|
||
- **value-based-methods**: DQN, Double DQN, Dueling DQN (Q-learning + neural networks)
|
||
|
||
**For Continuous Actions**:
|
||
|
||
- **actor-critic-methods**: SAC, TD3, A2C (policy + value function)
|
||
|
||
**For Any Action Space**:
|
||
|
||
- **policy-gradient-methods**: REINFORCE, PPO (direct policy optimization)
|
||
|
||
**For Debugging**:
|
||
|
||
- **rl-debugging**: Why agent not learning, reward issues, convergence problems
|
||
|
||
**For Environment Setup**:
|
||
|
||
- **rl-environments**: Gym, custom environments, wrappers
|
||
|
||
|
||
## Part 9: Common Pitfalls
|
||
|
||
### Pitfall #1: Skipping MDP Formulation
|
||
|
||
**Symptom**: Implementing Q-learning without defining states, actions, rewards clearly.
|
||
|
||
**Consequence**: Algorithm fails, user doesn't know why.
|
||
|
||
**Solution**: Always answer:
|
||
|
||
- What are states? (Markovian?)
|
||
- What are actions? (Discrete/continuous?)
|
||
- What is reward function? (Sparse/dense?)
|
||
- What is discount factor? (Based on horizon?)
|
||
|
||
|
||
### Pitfall #2: Confusing Value and Reward
|
||
|
||
**Symptom**: Setting V(s) = r(s).
|
||
|
||
**Consequence**: Ignores future rewards, policy suboptimal.
|
||
|
||
**Solution**: V(s) = E[r + γr' + γ²r'' + ...], not just r.
|
||
|
||
|
||
### Pitfall #3: Arbitrary Discount Factor
|
||
|
||
**Symptom**: "Let's use γ=0.9 because it's common."
|
||
|
||
**Consequence**: Agent can't see long-term goals (if γ too small) or values diverge (if γ=1 in continuing task).
|
||
|
||
**Solution**: Choose γ based on horizon (γ = 1 - 1/horizon).
|
||
|
||
|
||
### Pitfall #4: No Exploration
|
||
|
||
**Symptom**: Pure greedy policy during learning.
|
||
|
||
**Consequence**: Agent stuck in local optimum.
|
||
|
||
**Solution**: ε-greedy with ε ≥ 0.01, decay over time.
|
||
|
||
|
||
### Pitfall #5: Using DP Without Model
|
||
|
||
**Symptom**: Trying value iteration on real robot.
|
||
|
||
**Consequence**: Algorithm requires P(s'|s,a), R(s,a), which are unknown.
|
||
|
||
**Solution**: Use model-free methods (Q-learning, policy gradients).
|
||
|
||
|
||
### Pitfall #6: Monte Carlo on Continuing Tasks
|
||
|
||
**Symptom**: Using MC on task that never terminates.
|
||
|
||
**Consequence**: Cannot compute return (episode never ends).
|
||
|
||
**Solution**: Use TD methods (learn from partial trajectories).
|
||
|
||
|
||
### Pitfall #7: Confusing Q-Learning and SARSA
|
||
|
||
**Symptom**: Using Q-learning but expecting safe exploration.
|
||
|
||
**Consequence**: Q-learning learns optimal (risky) policy, ignores exploration safety.
|
||
|
||
**Solution**: Use SARSA for safe on-policy learning, Q-learning for optimal off-policy.
|
||
|
||
|
||
### Pitfall #8: Exploration at Test Time
|
||
|
||
**Symptom**: Evaluating with ε-greedy (ε > 0).
|
||
|
||
**Consequence**: Test performance artificially low.
|
||
|
||
**Solution**: Greedy policy at test time (ε=0).
|
||
|
||
|
||
### Pitfall #9: Treating Bellman as Black Box
|
||
|
||
**Symptom**: Using Q-learning update without understanding why.
|
||
|
||
**Consequence**: Cannot debug convergence issues, tune hyperparameters.
|
||
|
||
**Solution**: Derive Bellman equation, understand bootstrapping.
|
||
|
||
|
||
### Pitfall #10: Ignoring Transition Probabilities
|
||
|
||
**Symptom**: Using deterministic Bellman equation in stochastic environment.
|
||
|
||
**Consequence**: Wrong value estimates.
|
||
|
||
**Solution**: Weight by P(s'|s,a) in stochastic environments.
|
||
|
||
|
||
## Part 10: Rationalization Resistance
|
||
|
||
### Rationalization Table
|
||
|
||
| Rationalization | Reality | Counter-Guidance | Red Flag |
|
||
|-----------------|---------|------------------|----------|
|
||
| "I'll just copy Q-learning code" | Doesn't understand Q(s,a) meaning, cannot debug | "Let's understand what Q represents: expected cumulative reward. Why does Bellman equation have max?" | Jumping to code without theory |
|
||
| "V(s) is the reward at state s" | V is cumulative, r is immediate | "V(s) = E[r + γr' + ...], not just r. Value is long-term." | Confusing value and reward |
|
||
| "γ=0.9 is standard" | γ depends on task horizon | "What's your task horizon? γ=0.9 means ~10 steps. Need more?" | Arbitrary discount factor |
|
||
| "I don't need exploration, greedy is fine" | Gets stuck in local optimum | "Without exploration, you never try new actions. Use ε-greedy." | No exploration strategy |
|
||
| "Value iteration for Atari" | Atari doesn't have model (P, R unknown) | "Value iteration needs full model. Use model-free (DQN)." | DP on model-free problem |
|
||
| "Monte Carlo for continuing task" | MC requires episodes (termination) | "MC needs complete episodes. Use TD for continuing tasks." | MC on continuing task |
|
||
| "Q-learning and SARSA are the same" | Q-learning off-policy, SARSA on-policy | "Q-learning learns optimal, SARSA learns policy followed." | Confusing on-policy and off-policy |
|
||
| "I'll test with ε-greedy (ε=0.1)" | Test should be greedy (exploit only) | "Exploration is for learning. Test with ε=0 (greedy)." | Exploration at test time |
|
||
| "Bellman equation is just a formula" | It's the foundation of all algorithms | "Derive it. Understand why V(s) = r + γV(s'). Enables debugging." | Black-box understanding |
|
||
| "Deterministic transition, no need for P" | Correct, but must recognize when stochastic | "If stochastic, must weight by P(s'|s,a). Check environment." | Ignoring stochasticity |
|
||
|
||
|
||
## Part 11: Red Flags
|
||
|
||
Watch for these signs of misunderstanding:
|
||
|
||
- [ ] **Skipping MDP Formulation**: Implementing algorithm without defining S, A, P, R, γ
|
||
- [ ] **Value-Reward Confusion**: Treating V(s) as immediate reward instead of cumulative
|
||
- [ ] **Arbitrary γ**: Choosing discount factor without considering task horizon
|
||
- [ ] **No Exploration**: Pure greedy policy during learning
|
||
- [ ] **DP Without Model**: Using value/policy iteration when model unavailable
|
||
- [ ] **MC on Continuing**: Using Monte Carlo on non-episodic tasks
|
||
- [ ] **Q-SARSA Confusion**: Not understanding on-policy vs off-policy
|
||
- [ ] **Test Exploration**: Using ε-greedy during evaluation
|
||
- [ ] **Bellman Black Box**: Using TD updates without understanding Bellman equation
|
||
- [ ] **Ignoring Stochasticity**: Forgetting transition probabilities in stochastic environments
|
||
- [ ] **Planning Horizon Mismatch**: γ=0.9 for task requiring 100-step planning
|
||
- [ ] **Policy-Value Confusion**: Confusing π(s) and V(s), or Q(s,a) and π(a|s)
|
||
|
||
**If any red flag triggered → Explain theory → Derive equation → Connect to algorithm**
|
||
|
||
|
||
## Part 12: Code Examples
|
||
|
||
### Example 1: Value Iteration on GridWorld
|
||
|
||
```python
|
||
import numpy as np
|
||
|
||
# GridWorld: 4x4, goal at (3,3), walls at (1,1) and (2,2)
|
||
grid_size = 4
|
||
goal = (3, 3)
|
||
walls = {(1, 1), (2, 2)}
|
||
|
||
# MDP definition
|
||
gamma = 0.9
|
||
actions = ['UP', 'DOWN', 'LEFT', 'RIGHT']
|
||
|
||
def next_state(s, a):
|
||
"""Deterministic transition"""
|
||
x, y = s
|
||
if a == 'UP': x -= 1
|
||
elif a == 'DOWN': x += 1
|
||
elif a == 'LEFT': y -= 1
|
||
elif a == 'RIGHT': y += 1
|
||
|
||
# Boundary check
|
||
x = max(0, min(grid_size - 1, x))
|
||
y = max(0, min(grid_size - 1, y))
|
||
|
||
# Wall check
|
||
if (x, y) in walls:
|
||
return s # Bounce back
|
||
return (x, y)
|
||
|
||
def reward(s, a, s_next):
|
||
"""Reward function"""
|
||
if s_next == goal:
|
||
return 10
|
||
elif s_next in walls:
|
||
return -5
|
||
else:
|
||
return -1
|
||
|
||
# Value Iteration
|
||
V = np.zeros((grid_size, grid_size))
|
||
threshold = 0.01
|
||
max_iterations = 1000
|
||
|
||
for iteration in range(max_iterations):
|
||
V_new = np.zeros((grid_size, grid_size))
|
||
|
||
for x in range(grid_size):
|
||
for y in range(grid_size):
|
||
s = (x, y)
|
||
|
||
if s == goal:
|
||
V_new[x, y] = 0 # Terminal state
|
||
continue
|
||
|
||
# Bellman optimality backup
|
||
values = []
|
||
for a in actions:
|
||
s_next = next_state(s, a)
|
||
r = reward(s, a, s_next)
|
||
value = r + gamma * V[s_next[0], s_next[1]]
|
||
values.append(value)
|
||
|
||
V_new[x, y] = max(values)
|
||
|
||
# Check convergence
|
||
if np.max(np.abs(V_new - V)) < threshold:
|
||
print(f"Converged in {iteration} iterations")
|
||
break
|
||
|
||
V = V_new
|
||
|
||
# Extract policy
|
||
policy = {}
|
||
for x in range(grid_size):
|
||
for y in range(grid_size):
|
||
s = (x, y)
|
||
if s == goal:
|
||
policy[s] = None
|
||
continue
|
||
|
||
best_action = None
|
||
best_value = -float('inf')
|
||
for a in actions:
|
||
s_next = next_state(s, a)
|
||
r = reward(s, a, s_next)
|
||
value = r + gamma * V[s_next[0], s_next[1]]
|
||
if value > best_value:
|
||
best_value = value
|
||
best_action = a
|
||
policy[s] = best_action
|
||
|
||
print("Value Function:")
|
||
print(V)
|
||
print("\nOptimal Policy:")
|
||
for x in range(grid_size):
|
||
row = []
|
||
for y in range(grid_size):
|
||
action = policy.get((x, y), '')
|
||
if action == 'UP': symbol = '↑'
|
||
elif action == 'DOWN': symbol = '↓'
|
||
elif action == 'LEFT': symbol = '←'
|
||
elif action == 'RIGHT': symbol = '→'
|
||
else: symbol = 'G' # Goal
|
||
row.append(symbol)
|
||
print(' '.join(row))
|
||
```
|
||
|
||
**Output**:
|
||
|
||
```
|
||
Converged in 23 iterations
|
||
Value Function:
|
||
[[ 2.39 3.65 5.05 6.17]
|
||
[ 3.65 0. 6.17 7.59]
|
||
[ 5.05 0. 7.59 8.77]
|
||
[ 6.17 7.59 8.77 0. ]]
|
||
|
||
Optimal Policy:
|
||
→ → → ↓
|
||
↓ G → ↓
|
||
→ G → ↓
|
||
→ → → G
|
||
```
|
||
|
||
**Key Observations**:
|
||
|
||
- Values increase as you get closer to goal
|
||
- Policy points toward goal (shortest path)
|
||
- Walls (value=0) are avoided
|
||
|
||
|
||
### Example 2: Q-Learning on GridWorld
|
||
|
||
```python
|
||
import numpy as np
|
||
import random
|
||
|
||
# Same GridWorld setup
|
||
grid_size = 4
|
||
goal = (3, 3)
|
||
walls = {(1, 1), (2, 2)}
|
||
actions = ['UP', 'DOWN', 'LEFT', 'RIGHT']
|
||
gamma = 0.9
|
||
alpha = 0.1 # Learning rate
|
||
epsilon = 0.1 # Exploration
|
||
|
||
# Q-table
|
||
Q = {}
|
||
for x in range(grid_size):
|
||
for y in range(grid_size):
|
||
for a in actions:
|
||
Q[((x, y), a)] = 0.0
|
||
|
||
def epsilon_greedy(s, epsilon):
|
||
if random.random() < epsilon:
|
||
return random.choice(actions)
|
||
else:
|
||
# Greedy
|
||
best_action = actions[0]
|
||
best_value = Q[(s, best_action)]
|
||
for a in actions:
|
||
if Q[(s, a)] > best_value:
|
||
best_value = Q[(s, a)]
|
||
best_action = a
|
||
return best_action
|
||
|
||
# Training
|
||
num_episodes = 1000
|
||
|
||
for episode in range(num_episodes):
|
||
s = (0, 0) # Start state
|
||
|
||
while s != goal:
|
||
# Choose action
|
||
a = epsilon_greedy(s, epsilon)
|
||
|
||
# Take action
|
||
s_next = next_state(s, a)
|
||
r = reward(s, a, s_next)
|
||
|
||
# Q-learning update
|
||
if s_next == goal:
|
||
max_Q_next = 0 # Terminal
|
||
else:
|
||
max_Q_next = max(Q[(s_next, a_prime)] for a_prime in actions)
|
||
|
||
Q[(s, a)] += alpha * (r + gamma * max_Q_next - Q[(s, a)])
|
||
|
||
s = s_next
|
||
|
||
# Extract policy
|
||
print("Learned Policy:")
|
||
for x in range(grid_size):
|
||
row = []
|
||
for y in range(grid_size):
|
||
s = (x, y)
|
||
if s == goal:
|
||
row.append('G')
|
||
else:
|
||
best_action = max(actions, key=lambda a: Q[(s, a)])
|
||
if best_action == 'UP': symbol = '↑'
|
||
elif best_action == 'DOWN': symbol = '↓'
|
||
elif best_action == 'LEFT': symbol = '←'
|
||
elif best_action == 'RIGHT': symbol = '→'
|
||
row.append(symbol)
|
||
print(' '.join(row))
|
||
```
|
||
|
||
**Output** (similar to value iteration):
|
||
|
||
```
|
||
→ → → ↓
|
||
↓ G → ↓
|
||
→ G → ↓
|
||
→ → → G
|
||
```
|
||
|
||
**Key Differences from Value Iteration**:
|
||
|
||
- Q-learning is model-free (doesn't need P, R)
|
||
- Learns from experience (episodes)
|
||
- Uses ε-greedy exploration
|
||
- Requires many episodes to converge
|
||
|
||
|
||
### Example 3: Policy Evaluation (MC vs TD)
|
||
|
||
```python
|
||
import numpy as np
|
||
from collections import defaultdict
|
||
import random
|
||
|
||
# Simple chain MDP: s0 → s1 → s2 → goal
|
||
# Deterministic policy: always go right
|
||
# Reward: -1 per step, +10 at goal
|
||
# gamma = 0.9
|
||
|
||
gamma = 0.9
|
||
|
||
# Monte Carlo Policy Evaluation
|
||
def mc_policy_evaluation(num_episodes=1000):
|
||
V = defaultdict(float)
|
||
counts = defaultdict(int)
|
||
|
||
for _ in range(num_episodes):
|
||
# Generate episode
|
||
trajectory = [
|
||
(0, -1), # (state, reward)
|
||
(1, -1),
|
||
(2, -1),
|
||
(3, 10), # goal
|
||
]
|
||
|
||
# Compute returns
|
||
G = 0
|
||
visited = set()
|
||
for s, r in reversed(trajectory):
|
||
G = r + gamma * G
|
||
if s not in visited:
|
||
V[s] += G
|
||
counts[s] += 1
|
||
visited.add(s)
|
||
|
||
for s in V:
|
||
V[s] /= counts[s]
|
||
|
||
return V
|
||
|
||
# TD(0) Policy Evaluation
|
||
def td_policy_evaluation(num_episodes=1000, alpha=0.1):
|
||
V = defaultdict(float)
|
||
|
||
for _ in range(num_episodes):
|
||
s = 0
|
||
|
||
while s != 3: # Until goal
|
||
# Take action (deterministic policy)
|
||
s_next = s + 1
|
||
r = 10 if s_next == 3 else -1
|
||
|
||
# TD update
|
||
V[s] += alpha * (r + gamma * V[s_next] - V[s])
|
||
|
||
s = s_next
|
||
|
||
return V
|
||
|
||
# Compare
|
||
V_mc = mc_policy_evaluation()
|
||
V_td = td_policy_evaluation()
|
||
|
||
print("Monte Carlo V:")
|
||
print({s: round(V_mc[s], 2) for s in [0, 1, 2]})
|
||
|
||
print("\nTD(0) V:")
|
||
print({s: round(V_td[s], 2) for s in [0, 1, 2]})
|
||
|
||
# True values (analytical)
|
||
V_true = {
|
||
0: -1 + gamma * (-1 + gamma * (-1 + gamma * 10)),
|
||
1: -1 + gamma * (-1 + gamma * 10),
|
||
2: -1 + gamma * 10,
|
||
}
|
||
print("\nTrue V:")
|
||
print({s: round(V_true[s], 2) for s in [0, 1, 2]})
|
||
```
|
||
|
||
**Output**:
|
||
|
||
```
|
||
Monte Carlo V:
|
||
{0: 4.39, 1: 6.1, 2: 8.0}
|
||
|
||
TD(0) V:
|
||
{0: 4.41, 1: 6.12, 2: 8.01}
|
||
|
||
True V:
|
||
{0: 4.39, 1: 6.1, 2: 8.0}
|
||
```
|
||
|
||
**Observations**:
|
||
|
||
- Both MC and TD converge to true values
|
||
- TD uses bootstrapping (updates before episode ends)
|
||
- MC waits for complete episode
|
||
|
||
|
||
### Example 4: Discount Factor Impact
|
||
|
||
```python
|
||
import numpy as np
|
||
|
||
# Simple MDP: chain of 10 states, +1 reward at end
|
||
# Compare different gamma values
|
||
|
||
def value_iteration_chain(gamma, num_states=10):
|
||
V = np.zeros(num_states + 1) # +1 for goal
|
||
|
||
# Value iteration
|
||
for _ in range(100):
|
||
V_new = np.zeros(num_states + 1)
|
||
for s in range(num_states):
|
||
# Deterministic: s → s+1, reward = +1 at goal
|
||
s_next = s + 1
|
||
r = 1 if s_next == num_states else 0
|
||
V_new[s] = r + gamma * V[s_next]
|
||
V = V_new
|
||
|
||
return V[:num_states] # Exclude goal
|
||
|
||
# Compare gamma values
|
||
for gamma in [0.5, 0.9, 0.99, 1.0]:
|
||
V = value_iteration_chain(gamma)
|
||
print(f"γ={gamma}:")
|
||
print(f" V(s_0) = {V[0]:.4f}")
|
||
print(f" V(s_5) = {V[5]:.4f}")
|
||
print(f" V(s_9) = {V[9]:.4f}")
|
||
print(f" Effective horizon = {1/(1-gamma) if gamma < 1 else 'inf':.1f}\n")
|
||
```
|
||
|
||
**Output**:
|
||
|
||
```
|
||
γ=0.5:
|
||
V(s_0) = 0.0010
|
||
V(s_5) = 0.0313
|
||
V(s_9) = 0.5000
|
||
Effective horizon = 2.0
|
||
|
||
γ=0.9:
|
||
V(s_0) = 0.3487
|
||
V(s_5) = 0.5905
|
||
V(s_9) = 0.9000
|
||
Effective horizon = 10.0
|
||
|
||
γ=0.99:
|
||
V(s_0) = 0.9044
|
||
V(s_5) = 0.9510
|
||
V(s_9) = 0.9900
|
||
Effective horizon = 100.0
|
||
|
||
γ=1.0:
|
||
V(s_0) = 1.0000
|
||
V(s_5) = 1.0000
|
||
V(s_9) = 1.0000
|
||
Effective horizon = inf
|
||
```
|
||
|
||
**Key Insights**:
|
||
|
||
- γ=0.5: Value at s_0 is tiny (can't "see" reward 10 steps away)
|
||
- γ=0.9: Moderate values (horizon ≈ 10, matches task length)
|
||
- γ=0.99: High values (can plan far ahead)
|
||
- γ=1.0: All states have same value (no discounting)
|
||
|
||
**Lesson**: Choose γ based on how far ahead agent must plan.
|
||
|
||
|
||
### Example 5: Exploration Comparison
|
||
|
||
```python
|
||
import numpy as np
|
||
import random
|
||
|
||
# Simple bandit: 3 actions, true Q* = [1.0, 5.0, 3.0]
|
||
# Compare exploration strategies
|
||
|
||
true_Q = [1.0, 5.0, 3.0]
|
||
num_actions = 3
|
||
|
||
def sample_reward(action):
|
||
"""Stochastic reward"""
|
||
return true_Q[action] + np.random.randn() * 0.5
|
||
|
||
# Strategy 1: ε-greedy
|
||
def epsilon_greedy_experiment(epsilon=0.1, num_steps=1000):
|
||
Q = [0.0] * num_actions
|
||
counts = [0] * num_actions
|
||
|
||
total_reward = 0
|
||
for _ in range(num_steps):
|
||
# Choose action
|
||
if random.random() < epsilon:
|
||
action = random.randint(0, num_actions - 1)
|
||
else:
|
||
action = np.argmax(Q)
|
||
|
||
# Observe reward
|
||
reward = sample_reward(action)
|
||
total_reward += reward
|
||
|
||
# Update Q
|
||
counts[action] += 1
|
||
Q[action] += (reward - Q[action]) / counts[action]
|
||
|
||
return total_reward / num_steps
|
||
|
||
# Strategy 2: UCB
|
||
def ucb_experiment(c=2.0, num_steps=1000):
|
||
Q = [0.0] * num_actions
|
||
counts = [0] * num_actions
|
||
|
||
# Initialize: try each action once
|
||
for a in range(num_actions):
|
||
reward = sample_reward(a)
|
||
counts[a] = 1
|
||
Q[a] = reward
|
||
|
||
total_reward = 0
|
||
for t in range(num_actions, num_steps):
|
||
# UCB action selection
|
||
ucb_values = [Q[a] + c * np.sqrt(np.log(t) / counts[a])
|
||
for a in range(num_actions)]
|
||
action = np.argmax(ucb_values)
|
||
|
||
# Observe reward
|
||
reward = sample_reward(action)
|
||
total_reward += reward
|
||
|
||
# Update Q
|
||
counts[action] += 1
|
||
Q[action] += (reward - Q[action]) / counts[action]
|
||
|
||
return total_reward / num_steps
|
||
|
||
# Strategy 3: Greedy (no exploration)
|
||
def greedy_experiment(num_steps=1000):
|
||
Q = [0.0] * num_actions
|
||
counts = [0] * num_actions
|
||
|
||
total_reward = 0
|
||
for _ in range(num_steps):
|
||
action = np.argmax(Q)
|
||
reward = sample_reward(action)
|
||
total_reward += reward
|
||
|
||
counts[action] += 1
|
||
Q[action] += (reward - Q[action]) / counts[action]
|
||
|
||
return total_reward / num_steps
|
||
|
||
# Compare (average over 100 runs)
|
||
num_runs = 100
|
||
|
||
greedy_rewards = [greedy_experiment() for _ in range(num_runs)]
|
||
epsilon_rewards = [epsilon_greedy_experiment() for _ in range(num_runs)]
|
||
ucb_rewards = [ucb_experiment() for _ in range(num_runs)]
|
||
|
||
print(f"Greedy: {np.mean(greedy_rewards):.2f} ± {np.std(greedy_rewards):.2f}")
|
||
print(f"ε-greedy: {np.mean(epsilon_rewards):.2f} ± {np.std(epsilon_rewards):.2f}")
|
||
print(f"UCB: {np.mean(ucb_rewards):.2f} ± {np.std(ucb_rewards):.2f}")
|
||
print(f"\nOptimal: {max(true_Q):.2f}")
|
||
```
|
||
|
||
**Output**:
|
||
|
||
```
|
||
Greedy: 1.05 ± 0.52
|
||
ε-greedy: 4.62 ± 0.21
|
||
UCB: 4.83 ± 0.18
|
||
|
||
Optimal: 5.00
|
||
```
|
||
|
||
**Insights**:
|
||
|
||
- Greedy: Gets stuck on first action (often suboptimal)
|
||
- ε-greedy: Explores, finds near-optimal
|
||
- UCB: Slightly better, focuses exploration on uncertain actions
|
||
|
||
**Lesson**: Exploration is critical. UCB > ε-greedy > greedy.
|
||
|
||
|
||
## Part 13: When to Route Elsewhere
|
||
|
||
This skill covers **theory and foundations**. Route to other skills for:
|
||
|
||
**Implementation**:
|
||
|
||
- **value-based-methods**: DQN, Double DQN, Dueling DQN (Q-learning + neural networks)
|
||
- **policy-gradient-methods**: REINFORCE, PPO, TRPO (policy optimization)
|
||
- **actor-critic-methods**: A2C, SAC, TD3 (policy + value)
|
||
|
||
**Debugging**:
|
||
|
||
- **rl-debugging**: Agent not learning, reward issues, convergence problems
|
||
|
||
**Infrastructure**:
|
||
|
||
- **rl-environments**: Gym API, custom environments, wrappers
|
||
|
||
**Special Topics**:
|
||
|
||
- **exploration-strategies**: Curiosity, RND, intrinsic motivation
|
||
- **reward-shaping**: Potential-based shaping, inverse RL
|
||
- **multi-agent-rl**: QMIX, MADDPG, cooperative/competitive
|
||
- **offline-rl**: CQL, IQL, learning from fixed datasets
|
||
- **model-based-rl**: MBPO, Dreamer, world models
|
||
|
||
**Evaluation**:
|
||
|
||
- **rl-evaluation**: Proper evaluation methodology, metrics
|
||
|
||
|
||
## Summary
|
||
|
||
**You now understand**:
|
||
|
||
1. **MDP**: S, A, P, R, γ - the framework for RL
|
||
2. **Value Functions**: V(s) = cumulative expected reward, Q(s,a) = value of action
|
||
3. **Bellman Equations**: Recursive decomposition, foundation of algorithms
|
||
4. **Discount Factor**: γ controls planning horizon (1/(1-γ))
|
||
5. **Policies**: π(s) maps states to actions, π* is optimal
|
||
6. **Algorithms**: DP (value/policy iteration), MC, TD (Q-learning, SARSA)
|
||
7. **Exploration**: ε-greedy, UCB, necessary for learning
|
||
8. **Theory-Practice**: When understanding suffices vs when to implement
|
||
|
||
**Key Takeaways**:
|
||
|
||
- **MDP formulation comes first** (define S, A, P, R, γ before implementing)
|
||
- **Value ≠ Reward** (V is cumulative, r is immediate)
|
||
- **γ is not arbitrary** (choose based on task horizon)
|
||
- **Exploration is mandatory** (ε-greedy, UCB, not pure greedy)
|
||
- **Algorithm families differ** (DP needs model, MC needs episodes, TD is most flexible)
|
||
- **Bellman equations enable everything** (understand them to debug algorithms)
|
||
|
||
**Next**: Route to implementation skills (value-based, policy-gradient, actor-critic) to build real agents.
|
||
|
||
**This foundation will enable you to implement, debug, and extend RL algorithms effectively.**
|