gh-tachyon-beep-skillpacks-…/skills/using-deep-rl/rl-foundations.md


# RL Foundations

## When to Use This Skill

Invoke this skill when you encounter:

- **New to RL**: User asks "what is RL" or "how does RL work"
- **Theory Questions**: MDP, value functions, Bellman equations, policy optimization
- **Conceptual Confusion**: Mixing up V(s) and Q(s,a), value iteration vs policy iteration
- **Before Implementation**: User wants to implement RL algorithms without understanding foundations
- **Debugging Theory**: Why discount factor matters, why exploration needed, how algorithms differ
- **Foundation Check**: User jumps to DQN/PPO without understanding MDPs

**This skill provides the theoretical foundation for ALL other deep-rl skills.**

Do NOT use this skill for:

- Algorithm implementation (use value-based-methods, policy-gradient-methods, actor-critic-methods)
- Debugging code (use rl-debugging)
- Environment setup (use rl-environments)

## Core Principle

**Understanding the theory enables everything else.**

Reinforcement learning is built on a rigorous mathematical foundation:

1. **MDP** (Markov Decision Process) - the framework
2. **Value Functions** - quantify expected return
3. **Bellman Equations** - recursive decomposition
4. **Optimal Policy** - maximize expected return
5. **Algorithms** - methods to find optimal policy

**Without this foundation, you're copy-pasting code you cannot debug, adapt, or extend.**


## Part 1: Markov Decision Process (MDP)

### What is an MDP?

An MDP is the mathematical framework for sequential decision-making under uncertainty.

**Formal Definition**: A Markov Decision Process is a 5-tuple (S, A, P, R, γ):

- **S**: State space (set of all possible states)
- **A**: Action space (set of all possible actions)
- **P**: Transition probability P(s'|s,a) - probability of reaching state s' from state s after action a
- **R**: Reward function R(s,a,s') - immediate reward for transition
- **γ**: Discount factor (0 ≤ γ ≤ 1) - controls importance of future rewards

**Key Property**: **Markov Property**

```
P(s_{t+1} | s_t, a_t, s_{t-1}, a_{t-1}, ..., s_0, a_0) = P(s_{t+1} | s_t, a_t)
```

**Meaning**: The future depends only on the present state, not the history.

**Why this matters**: Allows recursive algorithms (Bellman equations). If Markov property violated, standard RL algorithms may fail.


### Example 1: GridWorld MDP

**Problem**: Agent navigates 4x4 grid to reach goal.

```
S = {(0,0), (0,1), ..., (3,3)}  # 16 states
A = {UP, DOWN, LEFT, RIGHT}      # 4 actions
R = -1 for each step, +10 at goal
γ = 0.9
P: Deterministic (up always moves up if not wall)
```

**Visualization**:

```
S  .  .  .
.  .  .  .
.  #  .  .  # = wall
.  .  .  G  G = goal (+10)
```

**Transition Example**:

- State s = (1,1), Action a = RIGHT
- Deterministic: P(s'=(1,2) | s=(1,1), a=RIGHT) = 1.0
- Reward: R(s,a,s') = -1
- Next state: s' = (1,2)

**Markov Property Holds**: Future position depends only on current position and action, not how you got there.


### Example 2: Stochastic GridWorld

**Modification**: Actions succeed with probability 0.8, move perpendicular with probability 0.1 each.

```
P((1,2) | (1,1), RIGHT) = 0.8  # intended
P((0,1) | (1,1), RIGHT) = 0.1  # slip up
P((2,1) | (1,1), RIGHT) = 0.1  # slip down
```

**Why Stochastic**: Models real-world uncertainty (robot actuators, wind, slippery surfaces).

**Consequence**: Agent must consider probabilities when choosing actions.


### Example 3: Continuous State MDP (Cartpole)

```
S ⊂ ℝ⁴: (cart_position, cart_velocity, pole_angle, pole_angular_velocity)
A = {LEFT, RIGHT}  # discrete actions, continuous state
R = +1 for each timestep upright
γ = 0.99
P: Physics-based transition (continuous dynamics)
```

**Key Difference**: State space is continuous, requires function approximation (neural networks).

**Still an MDP**: Markov property holds (physics is Markovian given state).


### When is Markov Property Violated?

**Example: Poker**

```
State: Current cards visible
Markov Violated: Opponents' strategies depend on past betting patterns
```

**Solution**: Augment state with history (last N actions), or use partially observable MDP (POMDP).

**Example: Robot with Noisy Sensors**

```
State: Raw sensor reading (single frame)
Markov Violated: True position requires integrating multiple frames
```

**Solution**: Stack frames (last 4 frames as state), or use recurrent network (LSTM).


### Episodic vs Continuing Tasks

**Episodic**: Task terminates (games, reaching goal)

```
Episode: s₀ → s₁ → ... → s_T (terminal state)
Return: G_t = r_t + γr_{t+1} + ... + γ^{T-t}r_T
```

**Continuing**: Task never ends (stock trading, robot operation)

```
Return: G_t = r_t + γr_{t+1} + γ²r_{t+2} + ... (infinite)
```

**Critical**: Continuing tasks REQUIRE γ < 1 (else return infinite).


### MDP Pitfall #1: Using Wrong State Representation

**Bad**: State = current frame only (when velocity matters)

```python
# Pong: Ball position alone doesn't tell velocity
state = current_frame  # WRONG - not Markovian
```

**Good**: State = last 4 frames (velocity from difference)

```python
# Frame stacking preserves Markov property
state = np.concatenate([frame_t, frame_{t-1}, frame_{t-2}, frame_{t-3}])
```

**Why**: Ball velocity = (position_t - position_{t-1}) / dt, need history.


### MDP Pitfall #2: Reward Function Shapes Behavior

**Example**: Robot navigating to goal

**Bad Reward**:

```python
reward = +1 if at_goal else 0  # Sparse
```

**Problem**: No signal until goal reached, hard to learn.

**Better Reward**:

```python
reward = -distance_to_goal  # Dense
```

**Problem**: Agent learns to get closer but may not reach goal (local optimum).

**Best Reward** (Potential-Based Shaping):

```python
reward = (distance_prev - distance_curr) + large_bonus_at_goal
```

**Why**: Encourages progress + explicit goal reward.

**Takeaway**: Reward function engineering is CRITICAL. Route to reward-shaping skill for details.


### MDP Formulation Checklist

Before implementing any RL algorithm, answer:

- [ ] **States**: What information defines the situation? Is it Markovian?
- [ ] **Actions**: What can the agent do? Discrete or continuous?
- [ ] **Transitions**: Deterministic or stochastic? Do you know P(s'|s,a)?
- [ ] **Rewards**: Immediate reward for each transition? Sparse or dense?
- [ ] **Discount**: Episodic (can use γ=1) or continuing (need γ<1)?
- [ ] **Markov Property**: Does current state fully determine future?

**If you cannot answer these, you cannot implement RL algorithms effectively.**


## Part 2: Value Functions

### What is a Value Function?

A value function quantifies "how good" a state (or state-action pair) is.

**State-Value Function V^π(s)**:

```
V^π(s) = E_π[G_t | s_t = s]
       = E_π[r_t + γr_{t+1} + γ²r_{t+2} + ... | s_t = s]
```

**Meaning**: Expected cumulative discounted reward starting from state s and following policy π.

**Action-Value Function Q^π(s,a)**:

```
Q^π(s,a) = E_π[G_t | s_t = s, a_t = a]
         = E_π[r_t + γr_{t+1} + γ²r_{t+2} + ... | s_t = s, a_t = a]
```

**Meaning**: Expected cumulative discounted reward starting from state s, taking action a, then following policy π.

**Relationship**:

```
V^π(s) = Σ_a π(a|s) Q^π(s,a)
```

**Intuition**: V(s) = value of state, Q(s,a) = value of state-action pair.


### Critical Distinction: Value vs Reward

**Reward r(s,a)**: Immediate, one-step payoff.

**Value V(s)**: Long-term, cumulative expected reward.

**Example: GridWorld**

```
Reward: r = -1 every step, r = +10 at goal
Value at state 2 steps from goal:
  V(s) ≈ -1 + γ(-1) + γ²(+10)
       = -1 - 0.9 + 0.81*10
       = -1.9 + 8.1 = 6.2
```

**Key**: Value is higher than immediate reward because it accounts for future goal reward.

**Common Mistake**: Setting V(s) = r(s). This ignores all future rewards.


### Example: Computing V^π for Simple Policy

**GridWorld**: 3x3 grid, goal at (2,2), γ=0.9, r=-1 per step.

**Policy π**: Always move right or down (deterministic).

**Manual Calculation**:

```
V^π((2,2)) = 0  (goal, no future rewards)

V^π((2,1)) = r + γ V^π((2,2))
           = -1 + 0.9 * 0 = -1

V^π((1,2)) = r + γ V^π((2,2))
           = -1 + 0.9 * 0 = -1

V^π((1,1)) = r + γ V^π((1,2))  (assuming action = DOWN)
           = -1 + 0.9 * (-1) = -1.9

V^π((0,0)) = r + γ V^π((0,1))
           = ... (depends on path)
```

**Observation**: Values decrease as distance from goal increases (more -1 rewards to collect).


### Optimal Value Functions

**Optimal State-Value Function V*(s)**:

```
V*(s) = max_π V^π(s)
```

**Meaning**: Maximum value achievable from state s under ANY policy.

**Optimal Action-Value Function Q*(s,a)**:

```
Q*(s,a) = max_π Q^π(s,a)
```

**Meaning**: Maximum value achievable from state s, taking action a, then acting optimally.

**Optimal Policy π***:

```
π*(s) = argmax_a Q*(s,a)
```

**Meaning**: Policy that achieves V*(s) at all states.

**Key Insight**: If you know Q*(s,a), optimal policy is trivial (pick action with max Q).


### Value Function Pitfall #1: Confusing V and Q

**Wrong Understanding**:

- V(s) = value of state s
- Q(s,a) = value of action a (WRONG - ignores state)

**Correct Understanding**:

- V(s) = value of state s (average over actions under policy)
- Q(s,a) = value of taking action a IN STATE s

**Example**: GridWorld

```
State s = (1,1)
V(s) might be 5.0 (average value under policy)

Q(s, RIGHT) = 6.0  (moving right is good)
Q(s, LEFT)  = 2.0  (moving left is bad)
Q(s, UP)    = 4.0
Q(s, DOWN)  = 7.0  (moving down is best)

V(s) = π(RIGHT|s)*6 + π(LEFT|s)*2 + π(UP|s)*4 + π(DOWN|s)*7
```

**Takeaway**: Q depends on BOTH state and action. V depends only on state.


### Value Function Pitfall #2: Forgetting Expectation

**Wrong**: V(s) = sum of rewards on one trajectory.

**Correct**: V(s) = expected sum over ALL possible trajectories.

**Example**: Stochastic GridWorld

```python
# WRONG: Compute V by running one episode
episode_return = sum([r_0, r_1, ..., r_T])
V[s_0] = episode_return  # This is ONE sample, not expectation

# CORRECT: Compute V by averaging over many episodes
returns = []
for _ in range(1000):
    episode_return = run_episode(policy, start_state=s)
    returns.append(episode_return)
V[s] = np.mean(returns)  # Expectation via Monte Carlo
```

**Key**: Value is an expectation, not a single sample.


### Value Function Pitfall #3: Ignoring Discount Factor

**Scenario**: User computes V without discounting.

**Wrong**:

```python
V[s] = r_0 + r_1 + r_2 + ...  # No discount
```

**Correct**:

```python
V[s] = r_0 + gamma*r_1 + gamma**2*r_2 + ...
```

**Why It Matters**: Without discount, values blow up in continuing tasks.

**Example**: Continuing task with r=1 every step

```
Without discount: V = 1 + 1 + 1 + ... = ∞
With γ=0.9:      V = 1 + 0.9 + 0.81 + ... = 1/(1-0.9) = 10
```

**Takeaway**: Always discount future rewards in continuing tasks.


## Part 3: Policies

### What is a Policy?

A policy π is a mapping from states to actions (or action probabilities).

**Deterministic Policy**: π: S → A

```
π(s) = a  (always take action a in state s)
```

**Stochastic Policy**: π: S × A → [0,1]

```
π(a|s) = probability of taking action a in state s
Σ_a π(a|s) = 1  (probabilities sum to 1)
```


### Example: Policies in GridWorld

**Deterministic Policy**:

```python
def policy(state):
    if state[0] < 2:
        return "RIGHT"
    else:
        return "DOWN"
```

**Stochastic Policy**:

```python
def policy(state):
    # 70% right, 20% down, 10% up
    return np.random.choice(["RIGHT", "DOWN", "UP"],
                           p=[0.7, 0.2, 0.1])
```

**Uniform Random Policy**:

```python
def policy(state):
    return np.random.choice(["UP", "DOWN", "LEFT", "RIGHT"])
```


### Policy Evaluation

**Problem**: Given policy π, compute V^π(s) for all states.

**Approach 1: Monte Carlo** (sample trajectories)

```python
# Run many episodes, average returns
V = defaultdict(float)
counts = defaultdict(int)

for episode in range(10000):
    trajectory = run_episode(policy)
    G = 0
    for (s, a, r) in reversed(trajectory):
        G = r + gamma * G
        V[s] += G
        counts[s] += 1

for s in V:
    V[s] /= counts[s]  # Average
```

**Approach 2: Bellman Expectation** (iterative)

```python
# Initialize V arbitrarily
V = {s: 0 for s in states}

# Iterate until convergence
while not converged:
    V_new = {}
    for s in states:
        V_new[s] = sum(policy(a|s) * (R(s,a) + gamma * sum(P(s'|s,a) * V[s']
                                                            for s' in states))
                       for a in actions)
    V = V_new
```

**Approach 2 requires knowing P(s'|s,a)** (model-based).


### Policy Improvement

**Theorem**: Given V^π, greedy policy π' with respect to V^π is at least as good as π.

```
π'(s) = argmax_a Q^π(s,a)
      = argmax_a Σ_{s'} P(s'|s,a) [R(s,a,s') + γV^π(s')]
```

**Proof Sketch**: By construction, π' maximizes expected immediate reward + future value.

**Consequence**: Iterating policy evaluation + policy improvement converges to optimal policy π*.


### Optimal Policy π*

**Theorem**: There exists an optimal policy π*that achieves V*(s) at all states.

**How to find π* from Q***:

```python
def optimal_policy(state):
    return argmax(Q_star[state, :])  # Greedy w.r.t. Q*
```

**How to find π* from V***:

```python
def optimal_policy(state):
    # One-step lookahead
    return argmax([R(state, a) + gamma * sum(P(s'|state,a) * V_star[s']
                                              for s' in states)
                   for a in actions])
```

**Key**: Optimal policy is deterministic (greedy w.r.t. Q*or V*).

**Exception**: In stochastic games with multiple optimal actions, any distribution over optimal actions is fine.


### Policy Pitfall #1: Greedy Policy Without Exploration

**Problem**: Always taking argmax(Q) means never trying new actions.

**Example**:

```python
# Pure greedy policy (WRONG for learning)
def policy(state):
    return argmax(Q[state, :])
```

**Why It Fails**: If Q is initialized wrong, agent never explores better actions.

**Solution**: ε-greedy policy

```python
def epsilon_greedy_policy(state, epsilon=0.1):
    if random.random() < epsilon:
        return random.choice(actions)  # Explore
    else:
        return argmax(Q[state, :])      # Exploit
```

**Exploration-Exploitation Tradeoff**: Explore to find better actions, exploit to maximize reward.


### Policy Pitfall #2: Stochastic Policy for Deterministic Optimal

**Scenario**: Optimal policy is deterministic (most MDPs), but user uses stochastic policy.

**Effect**: Suboptimal performance (randomness doesn't help).

**Example**: GridWorld optimal policy always moves toward goal (deterministic).

**When Stochastic is Needed**:

1. **During Learning**: Exploration (ε-greedy, Boltzmann)
2. **Partially Observable**: Stochasticity can help in POMDPs
3. **Multi-Agent**: Randomness prevents exploitation by opponents

**Takeaway**: After learning, optimal policy is usually deterministic. Use stochastic for exploration.


## Part 4: Bellman Equations

### Bellman Expectation Equation

**For V^π**:

```
V^π(s) = Σ_a π(a|s) Σ_{s'} P(s'|s,a) [R(s,a,s') + γ V^π(s')]
```

**Intuition**: Value of state s = expected immediate reward + discounted value of next state.

**For Q^π**:

```
Q^π(s,a) = Σ_{s'} P(s'|s,a) [R(s,a,s') + γ Σ_{a'} π(a'|s') Q^π(s',a')]
```

**Intuition**: Value of (s,a) = expected immediate reward + discounted value of next (s',a').

**Relationship**:

```
V^π(s) = Σ_a π(a|s) Q^π(s,a)
Q^π(s,a) = Σ_{s'} P(s'|s,a) [R(s,a,s') + γ V^π(s')]
```


### Bellman Optimality Equation

**For V***:

```
V*(s) = max_a Σ_{s'} P(s'|s,a) [R(s,a,s') + γ V*(s')]
```

**Intuition**: Optimal value = max over actions of (immediate reward + discounted optimal future value).

**For Q***:

```
Q*(s,a) = Σ_{s'} P(s'|s,a) [R(s,a,s') + γ max_{a'} Q*(s',a')]
```

**Intuition**: Optimal Q-value = expected immediate reward + discounted optimal Q-value of next state.

**Relationship**:

```
V*(s) = max_a Q*(s,a)
Q*(s,a) = Σ_{s'} P(s'|s,a) [R(s,a,s') + γ V*(s')]
```


### Deriving the Bellman Equation

**Start with definition of V^π**:

```
V^π(s) = E_π[G_t | s_t = s]
       = E_π[r_t + γr_{t+1} + γ²r_{t+2} + ... | s_t = s]
```

**Factor out first reward**:

```
V^π(s) = E_π[r_t + γ(r_{t+1} + γr_{t+2} + ...) | s_t = s]
       = E_π[r_t | s_t = s] + γ E_π[r_{t+1} + γr_{t+2} + ... | s_t = s]
```

**Second term is V^π(s_{t+1})**:

```
V^π(s) = E_π[r_t | s_t = s] + γ E_π[V^π(s_{t+1}) | s_t = s]
```

**Expand expectations**:

```
V^π(s) = Σ_a π(a|s) Σ_{s'} P(s'|s,a) [R(s,a,s') + γ V^π(s')]
```

**This is the Bellman Expectation Equation.**

**Key Insight**: Value function satisfies a consistency equation (recursive).


### Why Bellman Equations Matter

**1. Iterative Algorithms**: Use Bellman equation as update rule

```python
# Value Iteration
V_new[s] = max_a Σ_{s'} P(s'|s,a) [R(s,a,s') + γ V[s']]

# Q-Learning
Q[s,a] += alpha * (r + gamma * max_a' Q[s',a'] - Q[s,a])
```

**2. Convergence Guarantees**: Bellman operator is a contraction, guarantees convergence.

**3. Understanding Algorithms**: All RL algorithms approximate Bellman equations.

**Takeaway**: Bellman equations are the foundation of RL algorithms.


### Bellman Pitfall #1: Forgetting Max vs Expectation

**Bellman Expectation** (for policy π):

```
V^π(s) = Σ_a π(a|s) ...  # Expectation over policy
```

**Bellman Optimality** (for optimal policy):

```
V*(s) = max_a ...  # Maximize over actions
```

**Consequence**:

- Policy evaluation uses Bellman expectation
- Value iteration uses Bellman optimality

**Common Mistake**: Using max when evaluating a non-greedy policy.


### Bellman Pitfall #2: Ignoring Transition Probabilities

**Deterministic Transition**:

```
V^π(s) = R(s,a) + γ V^π(s')  # Direct, s' is deterministic
```

**Stochastic Transition**:

```
V^π(s) = Σ_{s'} P(s'|s,a) [R(s,a,s') + γ V^π(s')]  # Weighted sum
```

**Example**: Stochastic GridWorld

```
# Action RIGHT from (1,1)
V((1,1)) = 0.8 * [r + γ V((1,2))]    # 80% intended
         + 0.1 * [r + γ V((0,1))]    # 10% slip up
         + 0.1 * [r + γ V((2,1))]    # 10% slip down
```

**Takeaway**: Don't forget to weight by transition probabilities in stochastic environments.


## Part 5: Discount Factor γ

### What Does γ Control?

**Discount factor γ ∈ [0, 1]** controls how much the agent cares about future rewards.

**γ = 0**: Only immediate reward matters

```
V(s) = E[r_t]  (myopic)
```

**γ = 1**: All future rewards matter equally

```
V(s) = E[r_t + r_{t+1} + r_{t+2} + ...]  (far-sighted)
```

**γ = 0.9**: Future discounted exponentially

```
V(s) = E[r_t + 0.9*r_{t+1} + 0.81*r_{t+2} + ...]
```

**Reward 10 steps away**:

- γ=0.9: worth 0.9^10 = 0.35 of immediate reward
- γ=0.99: worth 0.99^10 = 0.90 of immediate reward


### Planning Horizon

**Effective Horizon**: How far ahead does agent plan?

**Approximation**: Horizon ≈ 1/(1-γ)

**Examples**:

- γ=0.9 → Horizon ≈ 10 steps
- γ=0.99 → Horizon ≈ 100 steps
- γ=0.5 → Horizon ≈ 2 steps
- γ=0.999 → Horizon ≈ 1000 steps

**Intuition**: After horizon steps, rewards are discounted to ~37% (e^{-1}).

**Formal**: Σ_{t=0}^∞ γ^t = 1/(1-γ) (sum of geometric series).


### Choosing γ

**Rule of Thumb**:

- **Task horizon known**: γ such that 1/(1-γ) ≈ task_length
- **Short episodes** (< 100 steps): γ = 0.9 to 0.95
- **Long episodes** (100-1000 steps): γ = 0.99
- **Very long** (> 1000 steps): γ = 0.999

**Example: Pong** (episode ~ 1000 steps)

```
γ = 0.99  # Horizon ≈ 100, sees ~10% of episode
```

**Example: Cartpole** (episode ~ 200 steps)

```
γ = 0.99  # Horizon ≈ 100, sees half of episode
```

**Example: Chess** (game ~ 40 moves = 80 steps)

```
γ = 0.95  # Horizon ≈ 20, sees quarter of game
```


### γ = 1 Special Case

**When γ = 1**:

- Only valid for **episodic tasks** (guaranteed termination)
- Continuing tasks: V = ∞ (unbounded)

**Example: GridWorld** (terminates at goal)

```
γ = 1.0  # OK, episode ends
V(s) = -steps_to_goal + 10  (finite)
```

**Example: Stock trading** (never terminates)

```
γ = 1.0  # WRONG, V = ∞
γ = 0.99  # Correct
```

**Takeaway**: Use γ < 1 for continuing tasks, γ = 1 allowed for episodic.


### Discount Factor Pitfall #1: Too Small γ

**Scenario**: Task requires 50 steps to reach goal, γ=0.9.

**Problem**:

```
Reward at step 50 discounted by 0.9^50 = 0.0052
```

**Effect**: Agent effectively blind to long-term goals (can't see reward).

**Solution**: Increase γ to 0.99 (0.99^50 = 0.61, still significant).

**Symptom**: Agent learns suboptimal policy (ignores distant goals).


### Discount Factor Pitfall #2: γ = 1 in Continuing Tasks

**Scenario**: Continuing task (never terminates), γ=1.

**Problem**:

```
V(s) = r + r + r + ... = ∞  (unbounded)
```

**Effect**: Value iteration, Q-learning diverge (values explode).

**Solution**: Use γ < 1 (e.g., γ=0.99).

**Symptom**: Values grow without bound, algorithm doesn't converge.


### Discount Factor Pitfall #3: Treating γ as Hyperparameter

**Wrong Mindset**: "Let's grid search γ in [0.9, 0.95, 0.99]."

**Correct Mindset**: "Task requires planning X steps ahead, so γ = 1 - 1/X."

**Example**: Goal 100 steps away

```
Required horizon = 100
γ = 1 - 1/100 = 0.99
```

**Takeaway**: γ is not arbitrary. Choose based on task horizon.


## Part 6: Algorithm Families

### Three Paradigms

**1. Dynamic Programming (DP)**:

- Requires full MDP model (P, R known)
- Exact algorithms (no sampling)
- Examples: Value Iteration, Policy Iteration

**2. Monte Carlo (MC)**:

- Model-free (learn from experience)
- Learns from complete episodes
- Examples: First-visit MC, Every-visit MC

**3. Temporal Difference (TD)**:

- Model-free (learn from experience)
- Learns from incomplete episodes
- Examples: TD(0), Q-learning, SARSA

**Key Differences**:

- DP: Needs model, no sampling
- MC: No model, full episodes
- TD: No model, partial episodes (most flexible)


### Value Iteration

**Algorithm**: Iteratively apply Bellman optimality operator.

```python
# Initialize
V = {s: 0 for s in states}

# Iterate until convergence
while not converged:
    V_new = {}
    for s in states:
        # Bellman optimality backup
        V_new[s] = max([sum(P(s_next|s,a) * (R(s,a,s_next) + gamma * V[s_next])
                            for s_next in states)
                        for a in actions])

    if max(abs(V_new[s] - V[s]) for s in states) < threshold:
        converged = True
    V = V_new

# Extract policy
policy = {s: argmax([sum(P(s_next|s,a) * (R(s,a,s_next) + gamma * V[s_next])
                         for s_next in states)
                    for a in actions])
          for s in states}
```

**Convergence**: Guaranteed (Bellman operator is contraction).

**Computational Cost**: O(|S|² |A|) per iteration.

**When to Use**: Small state spaces (< 10,000 states), full model available.


### Policy Iteration

**Algorithm**: Alternate between policy evaluation and policy improvement.

```python
# Initialize random policy
policy = {s: random.choice(actions) for s in states}

while not converged:
    # Policy Evaluation: Compute V^π
    V = {s: 0 for s in states}
    while not converged_V:
        V_new = {}
        for s in states:
            a = policy[s]
            V_new[s] = sum(P(s_next|s,a) * (R(s,a,s_next) + gamma * V[s_next])
                          for s_next in states)
        V = V_new

    # Policy Improvement: Make policy greedy w.r.t. V
    policy_stable = True
    for s in states:
        old_action = policy[s]
        policy[s] = argmax([sum(P(s_next|s,a) * (R(s,a,s_next) + gamma * V[s_next])
                               for s_next in states)
                           for a in actions])
        if old_action != policy[s]:
            policy_stable = False

    if policy_stable:
        converged = True
```

**Convergence**: Guaranteed, often fewer iterations than value iteration.

**When to Use**: When policy converges faster than values (common).

**Key Difference from Value Iteration**:

- Value iteration: no explicit policy until end
- Policy iteration: maintain and improve policy each iteration


### Monte Carlo Methods

**Idea**: Estimate V^π(s) by averaging returns from state s.

```python
# First-visit MC
V = defaultdict(float)
counts = defaultdict(int)

for episode in range(num_episodes):
    trajectory = run_episode(policy)  # [(s_0, a_0, r_0), ..., (s_T, a_T, r_T)]

    G = 0
    visited = set()
    for (s, a, r) in reversed(trajectory):
        G = r + gamma * G  # Accumulate return

        if s not in visited:  # First-visit
            V[s] += G
            counts[s] += 1
            visited.add(s)

    for s in counts:
        V[s] /= counts[s]  # Average return
```

**Advantages**:

- No model needed (model-free)
- Can handle stochastic environments
- Unbiased estimates

**Disadvantages**:

- Requires complete episodes (can't learn mid-episode)
- High variance (one trajectory is noisy)
- Slow convergence

**When to Use**: Episodic tasks, when model unavailable.


### Temporal Difference (TD) Learning

**Idea**: Update V after each step using bootstrapping.

**TD(0) Update**:

```python
V[s] += alpha * (r + gamma * V[s_next] - V[s])
#                \_____________________/
#                       TD error
```

**Bootstrapping**: Use current estimate V[s_next] instead of true return.

**Full Algorithm**:

```python
V = {s: 0 for s in states}

for episode in range(num_episodes):
    s = initial_state()

    while not terminal:
        a = policy(s)
        s_next, r = environment.step(s, a)

        # TD update
        V[s] += alpha * (r + gamma * V[s_next] - V[s])

        s = s_next
```

**Advantages**:

- No model needed (model-free)
- Can learn from incomplete episodes (online)
- Lower variance than MC

**Disadvantages**:

- Biased estimates (bootstrap uses estimate)
- Requires tuning α (learning rate)

**When to Use**: Model-free, need online learning.


### Q-Learning (TD for Q-values)

**TD for action-values Q(s,a)**:

```python
Q[s,a] += alpha * (r + gamma * max_a' Q[s_next, a'] - Q[s,a])
```

**Full Algorithm**:

```python
Q = defaultdict(lambda: defaultdict(float))

for episode in range(num_episodes):
    s = initial_state()

    while not terminal:
        # ε-greedy action selection
        if random.random() < epsilon:
            a = random.choice(actions)
        else:
            a = argmax(Q[s])

        s_next, r = environment.step(s, a)

        # Q-learning update (off-policy)
        Q[s][a] += alpha * (r + gamma * max(Q[s_next].values()) - Q[s][a])

        s = s_next
```

**Key**: Off-policy (learns optimal Q regardless of behavior policy).

**When to Use**: Model-free, discrete actions, want optimal policy.


### SARSA (On-Policy TD)

**Difference from Q-learning**: Uses next action from policy (on-policy).

```python
Q[s,a] += alpha * (r + gamma * Q[s_next, a_next] - Q[s,a])
#                                      ^^^^^^
#                                      Action from policy, not max
```

**Full Algorithm**:

```python
Q = defaultdict(lambda: defaultdict(float))

for episode in range(num_episodes):
    s = initial_state()
    a = epsilon_greedy(Q[s], epsilon)  # Choose first action

    while not terminal:
        s_next, r = environment.step(s, a)
        a_next = epsilon_greedy(Q[s_next], epsilon)  # Next action from policy

        # SARSA update (on-policy)
        Q[s][a] += alpha * (r + gamma * Q[s_next][a_next] - Q[s][a])

        s, a = s_next, a_next
```

**Difference from Q-learning**:

- Q-learning: learns optimal policy (off-policy)
- SARSA: learns policy being followed (on-policy)

**When to Use**: When you want policy to reflect exploration strategy.


### Algorithm Comparison

| Algorithm | Model? | Episodes? | Convergence | Use Case |
|-----------|--------|-----------|-------------|----------|
| Value Iteration | Yes (P, R) | No | Guaranteed | Small MDPs, known model |
| Policy Iteration | Yes (P, R) | No | Guaranteed, faster | Small MDPs, good init policy |
| Monte Carlo | No | Complete | Slow, high variance | Episodic, model-free |
| TD(0) | No | Partial | Faster, lower variance | Online, model-free |
| Q-Learning | No | Partial | Guaranteed* | Discrete actions, off-policy |
| SARSA | No | Partial | Guaranteed* | On-policy, safe exploration |

*With appropriate exploration and learning rate schedule.


### Algorithm Pitfall #1: Using DP Without Model

**Scenario**: User tries value iteration on real robot (no model).

**Problem**: Value iteration requires P(s'|s,a) and R(s,a,s').

**Solution**: Use model-free methods (Q-learning, SARSA, policy gradients).

**Red Flag**: "Let's use policy iteration for Atari games." (No model available.)


### Algorithm Pitfall #2: Monte Carlo on Non-Episodic Tasks

**Scenario**: Continuing task (never terminates), try MC.

**Problem**: MC requires complete episodes to compute return.

**Solution**: Use TD methods (learn from partial trajectories).

**Red Flag**: "Let's use MC for stock trading." (Continuing task.)


### Algorithm Pitfall #3: Confusing Q-Learning and SARSA

**Scenario**: User uses Q-learning but expects on-policy behavior.

**Example**: Cliff walking with epsilon-greedy

- Q-learning: Learns optimal (risky) path along cliff
- SARSA: Learns safe path away from cliff (accounts for exploration)

**Takeaway**:

- Q-learning: Learns optimal policy (off-policy)
- SARSA: Learns policy being followed (on-policy)

**Choose based on whether you want optimal policy or policy that accounts for exploration.**


## Part 7: Exploration vs Exploitation

### The Tradeoff

**Exploitation**: Choose action with highest known value (maximize immediate reward).

**Exploration**: Try new actions to discover if they're better (maximize long-term information).

**Dilemma**: Must explore to find optimal policy, but exploration sacrifices short-term reward.

**Example**: Restaurant choice

- Exploitation: Go to your favorite restaurant (known good)
- Exploration: Try a new restaurant (might be better, might be worse)


### Why Exploration is Necessary

**Scenario**: GridWorld, Q-values initialized to 0.

**Without Exploration**:

```python
# Greedy policy
policy(s) = argmax(Q[s, :])  # Always 0 initially, picks arbitrary action
```

**Problem**: If first action happens to be BAD, Q[s,a] becomes negative, never tried again.

**Result**: Agent stuck in suboptimal policy (local optimum).

**With Exploration**:

```python
# ε-greedy
if random.random() < epsilon:
    action = random.choice(actions)  # Explore
else:
    action = argmax(Q[s, :])  # Exploit
```

**Result**: Eventually tries all actions, discovers optimal.


### ε-Greedy Exploration

**Algorithm**:

```python
def epsilon_greedy(state, Q, epsilon=0.1):
    if random.random() < epsilon:
        return random.choice(actions)  # Explore with prob ε
    else:
        return argmax(Q[state, :])     # Exploit with prob 1-ε
```

**Tuning ε**:

- **ε = 0**: No exploration (greedy, can get stuck)
- **ε = 1**: Random policy (no exploitation, never converges)
- **ε = 0.1**: Common choice (10% exploration)

**Decay Schedule**:

```python
epsilon = max(epsilon_min, epsilon * decay_rate)
# Start high (ε=1.0), decay to low (ε=0.01)
```

**Rationale**: Explore heavily early, exploit more as you learn.


### Upper Confidence Bound (UCB)

**Idea**: Choose action that balances value and uncertainty.

**UCB Formula**:

```python
action = argmax(Q[s,a] + c * sqrt(log(N[s]) / N[s,a]))
#                ^^^^^^     ^^^^^^^^^^^^^^^^^^^^^^^^^^^
#               Exploitation        Exploration bonus
```

**Where**:

- N[s] = number of times state s visited
- N[s,a] = number of times action a taken in state s
- c = exploration constant

**Intuition**: Actions tried less often get exploration bonus (uncertainty).

**Advantage over ε-greedy**: Adaptive exploration (focuses on uncertain actions).


### Optimistic Initialization

**Idea**: Initialize Q-values to high values (optimistic).

```python
Q = defaultdict(lambda: defaultdict(lambda: 10.0))  # Optimistic
```

**Effect**: All actions initially seem good, encourages exploration.

**How it works**:

1. All Q-values start high (optimistic)
2. Agent tries action, gets real reward (likely lower)
3. Q-value decreases, agent tries other actions
4. Continues until all actions explored

**Advantage**: Simple, no ε parameter.

**Disadvantage**: Only works for finite action spaces, exploration stops after initial phase.


### Boltzmann Exploration (Softmax)

**Idea**: Choose actions probabilistically based on Q-values.

```python
def softmax(Q, temperature=1.0):
    exp_Q = np.exp(Q / temperature)
    return exp_Q / np.sum(exp_Q)

probs = softmax(Q[state, :])
action = np.random.choice(actions, p=probs)
```

**Temperature**:

- High temperature (τ→∞): Uniform random (more exploration)
- Low temperature (τ→0): Greedy (more exploitation)

**Advantage**: Naturally weights exploration by Q-values (poor actions less likely).

**Disadvantage**: Requires tuning temperature, computationally more expensive.


### Exploration Pitfall #1: No Exploration

**Scenario**: Pure greedy policy.

```python
action = argmax(Q[state, :])  # No randomness
```

**Problem**: Agent never explores, gets stuck in local optimum.

**Example**: Q-values initialized to 0, first action is UP (arbitrary).

- Agent always chooses UP (Q still 0 for others)
- Never discovers RIGHT is optimal
- Stuck forever

**Solution**: Always use some exploration (ε-greedy with ε ≥ 0.01).


### Exploration Pitfall #2: Too Much Exploration

**Scenario**: ε = 0.5 (50% random actions).

**Problem**: Agent wastes time on known-bad actions.

**Effect**: Slow convergence, poor performance even after learning.

**Solution**: Decay ε over time (start high, end low).

```python
epsilon = max(0.01, epsilon * 0.995)  # Decay to 1%
```


### Exploration Pitfall #3: Exploration at Test Time

**Scenario**: Evaluating learned policy with ε-greedy (ε=0.1).

**Problem**: Test performance artificially low (10% random actions).

**Solution**: Use greedy policy at test time.

```python
# Training
action = epsilon_greedy(state, Q, epsilon=0.1)

# Testing
action = argmax(Q[state, :])  # Greedy, no exploration
```

**Takeaway**: Exploration is for learning, not evaluation.


## Part 8: When Theory is Sufficient

### Theory vs Implementation

**When Understanding Theory is Enough**:

1. **Debugging**: Understanding Bellman equation explains why Q-values aren't converging
2. **Hyperparameter Tuning**: Understanding γ explains why agent is myopic
3. **Algorithm Selection**: Understanding model-free vs model-based explains why value iteration fails
4. **Conceptual Design**: Understanding exploration explains why agent gets stuck

**When You Need Implementation**:

1. **Real Problems**: Toy examples don't teach debugging real environments
2. **Scaling**: Neural networks, replay buffers, parallel environments
3. **Engineering**: Practical details (learning rate schedules, reward clipping)

**This Skill's Scope**: Theory, intuition, foundations.

**Other Skills for Implementation**: value-based-methods, policy-gradient-methods, actor-critic-methods.


### What This Skill Taught You

**1. MDP Formulation**: S, A, P, R, γ - the framework for RL.

**2. Value Functions**: V(s) = expected cumulative reward, Q(s,a) = value of action in state.

**3. Bellman Equations**: Recursive decomposition, foundation of all algorithms.

**4. Discount Factor**: γ controls planning horizon (1/(1-γ)).

**5. Policies**: Deterministic vs stochastic, optimal policy π*.

**6. Algorithms**:

- DP: Value iteration, policy iteration (model-based)
- MC: Monte Carlo (episodic, model-free)
- TD: Q-learning, SARSA (online, model-free)

**7. Exploration**: ε-greedy, UCB, necessary for learning.

**8. Theory-Practice Gap**: When theory suffices vs when to implement.


### Next Steps

After mastering foundations, route to:

**For Discrete Actions**:

- **value-based-methods**: DQN, Double DQN, Dueling DQN (Q-learning + neural networks)

**For Continuous Actions**:

- **actor-critic-methods**: SAC, TD3, A2C (policy + value function)

**For Any Action Space**:

- **policy-gradient-methods**: REINFORCE, PPO (direct policy optimization)

**For Debugging**:

- **rl-debugging**: Why agent not learning, reward issues, convergence problems

**For Environment Setup**:

- **rl-environments**: Gym, custom environments, wrappers


## Part 9: Common Pitfalls

### Pitfall #1: Skipping MDP Formulation

**Symptom**: Implementing Q-learning without defining states, actions, rewards clearly.

**Consequence**: Algorithm fails, user doesn't know why.

**Solution**: Always answer:

- What are states? (Markovian?)
- What are actions? (Discrete/continuous?)
- What is reward function? (Sparse/dense?)
- What is discount factor? (Based on horizon?)


### Pitfall #2: Confusing Value and Reward

**Symptom**: Setting V(s) = r(s).

**Consequence**: Ignores future rewards, policy suboptimal.

**Solution**: V(s) = E[r + γr' + γ²r'' + ...], not just r.


### Pitfall #3: Arbitrary Discount Factor

**Symptom**: "Let's use γ=0.9 because it's common."

**Consequence**: Agent can't see long-term goals (if γ too small) or values diverge (if γ=1 in continuing task).

**Solution**: Choose γ based on horizon (γ = 1 - 1/horizon).


### Pitfall #4: No Exploration

**Symptom**: Pure greedy policy during learning.

**Consequence**: Agent stuck in local optimum.

**Solution**: ε-greedy with ε ≥ 0.01, decay over time.


### Pitfall #5: Using DP Without Model

**Symptom**: Trying value iteration on real robot.

**Consequence**: Algorithm requires P(s'|s,a), R(s,a), which are unknown.

**Solution**: Use model-free methods (Q-learning, policy gradients).


### Pitfall #6: Monte Carlo on Continuing Tasks

**Symptom**: Using MC on task that never terminates.

**Consequence**: Cannot compute return (episode never ends).

**Solution**: Use TD methods (learn from partial trajectories).


### Pitfall #7: Confusing Q-Learning and SARSA

**Symptom**: Using Q-learning but expecting safe exploration.

**Consequence**: Q-learning learns optimal (risky) policy, ignores exploration safety.

**Solution**: Use SARSA for safe on-policy learning, Q-learning for optimal off-policy.


### Pitfall #8: Exploration at Test Time

**Symptom**: Evaluating with ε-greedy (ε > 0).

**Consequence**: Test performance artificially low.

**Solution**: Greedy policy at test time (ε=0).


### Pitfall #9: Treating Bellman as Black Box

**Symptom**: Using Q-learning update without understanding why.

**Consequence**: Cannot debug convergence issues, tune hyperparameters.

**Solution**: Derive Bellman equation, understand bootstrapping.


### Pitfall #10: Ignoring Transition Probabilities

**Symptom**: Using deterministic Bellman equation in stochastic environment.

**Consequence**: Wrong value estimates.

**Solution**: Weight by P(s'|s,a) in stochastic environments.


## Part 10: Rationalization Resistance

### Rationalization Table

| Rationalization | Reality | Counter-Guidance | Red Flag |
|-----------------|---------|------------------|----------|
| "I'll just copy Q-learning code" | Doesn't understand Q(s,a) meaning, cannot debug | "Let's understand what Q represents: expected cumulative reward. Why does Bellman equation have max?" | Jumping to code without theory |
| "V(s) is the reward at state s" | V is cumulative, r is immediate | "V(s) = E[r + γr' + ...], not just r. Value is long-term." | Confusing value and reward |
| "γ=0.9 is standard" | γ depends on task horizon | "What's your task horizon? γ=0.9 means ~10 steps. Need more?" | Arbitrary discount factor |
| "I don't need exploration, greedy is fine" | Gets stuck in local optimum | "Without exploration, you never try new actions. Use ε-greedy." | No exploration strategy |
| "Value iteration for Atari" | Atari doesn't have model (P, R unknown) | "Value iteration needs full model. Use model-free (DQN)." | DP on model-free problem |
| "Monte Carlo for continuing task" | MC requires episodes (termination) | "MC needs complete episodes. Use TD for continuing tasks." | MC on continuing task |
| "Q-learning and SARSA are the same" | Q-learning off-policy, SARSA on-policy | "Q-learning learns optimal, SARSA learns policy followed." | Confusing on-policy and off-policy |
| "I'll test with ε-greedy (ε=0.1)" | Test should be greedy (exploit only) | "Exploration is for learning. Test with ε=0 (greedy)." | Exploration at test time |
| "Bellman equation is just a formula" | It's the foundation of all algorithms | "Derive it. Understand why V(s) = r + γV(s'). Enables debugging." | Black-box understanding |
| "Deterministic transition, no need for P" | Correct, but must recognize when stochastic | "If stochastic, must weight by P(s'|s,a). Check environment." | Ignoring stochasticity |


## Part 11: Red Flags

Watch for these signs of misunderstanding:

- [ ] **Skipping MDP Formulation**: Implementing algorithm without defining S, A, P, R, γ
- [ ] **Value-Reward Confusion**: Treating V(s) as immediate reward instead of cumulative
- [ ] **Arbitrary γ**: Choosing discount factor without considering task horizon
- [ ] **No Exploration**: Pure greedy policy during learning
- [ ] **DP Without Model**: Using value/policy iteration when model unavailable
- [ ] **MC on Continuing**: Using Monte Carlo on non-episodic tasks
- [ ] **Q-SARSA Confusion**: Not understanding on-policy vs off-policy
- [ ] **Test Exploration**: Using ε-greedy during evaluation
- [ ] **Bellman Black Box**: Using TD updates without understanding Bellman equation
- [ ] **Ignoring Stochasticity**: Forgetting transition probabilities in stochastic environments
- [ ] **Planning Horizon Mismatch**: γ=0.9 for task requiring 100-step planning
- [ ] **Policy-Value Confusion**: Confusing π(s) and V(s), or Q(s,a) and π(a|s)

**If any red flag triggered → Explain theory → Derive equation → Connect to algorithm**


## Part 12: Code Examples

### Example 1: Value Iteration on GridWorld

```python
import numpy as np

# GridWorld: 4x4, goal at (3,3), walls at (1,1) and (2,2)
grid_size = 4
goal = (3, 3)
walls = {(1, 1), (2, 2)}

# MDP definition
gamma = 0.9
actions = ['UP', 'DOWN', 'LEFT', 'RIGHT']

def next_state(s, a):
    """Deterministic transition"""
    x, y = s
    if a == 'UP': x -= 1
    elif a == 'DOWN': x += 1
    elif a == 'LEFT': y -= 1
    elif a == 'RIGHT': y += 1

    # Boundary check
    x = max(0, min(grid_size - 1, x))
    y = max(0, min(grid_size - 1, y))

    # Wall check
    if (x, y) in walls:
        return s  # Bounce back
    return (x, y)

def reward(s, a, s_next):
    """Reward function"""
    if s_next == goal:
        return 10
    elif s_next in walls:
        return -5
    else:
        return -1

# Value Iteration
V = np.zeros((grid_size, grid_size))
threshold = 0.01
max_iterations = 1000

for iteration in range(max_iterations):
    V_new = np.zeros((grid_size, grid_size))

    for x in range(grid_size):
        for y in range(grid_size):
            s = (x, y)

            if s == goal:
                V_new[x, y] = 0  # Terminal state
                continue

            # Bellman optimality backup
            values = []
            for a in actions:
                s_next = next_state(s, a)
                r = reward(s, a, s_next)
                value = r + gamma * V[s_next[0], s_next[1]]
                values.append(value)

            V_new[x, y] = max(values)

    # Check convergence
    if np.max(np.abs(V_new - V)) < threshold:
        print(f"Converged in {iteration} iterations")
        break

    V = V_new

# Extract policy
policy = {}
for x in range(grid_size):
    for y in range(grid_size):
        s = (x, y)
        if s == goal:
            policy[s] = None
            continue

        best_action = None
        best_value = -float('inf')
        for a in actions:
            s_next = next_state(s, a)
            r = reward(s, a, s_next)
            value = r + gamma * V[s_next[0], s_next[1]]
            if value > best_value:
                best_value = value
                best_action = a
        policy[s] = best_action

print("Value Function:")
print(V)
print("\nOptimal Policy:")
for x in range(grid_size):
    row = []
    for y in range(grid_size):
        action = policy.get((x, y), '')
        if action == 'UP': symbol = '↑'
        elif action == 'DOWN': symbol = '↓'
        elif action == 'LEFT': symbol = '←'
        elif action == 'RIGHT': symbol = '→'
        else: symbol = 'G'  # Goal
        row.append(symbol)
    print(' '.join(row))
```

**Output**:

```
Converged in 23 iterations
Value Function:
[[ 2.39  3.65  5.05  6.17]
 [ 3.65  0.    6.17  7.59]
 [ 5.05  0.    7.59  8.77]
 [ 6.17  7.59  8.77  0.  ]]

Optimal Policy:
→ → → ↓
↓ G → ↓
→ G → ↓
→ → → G
```

**Key Observations**:

- Values increase as you get closer to goal
- Policy points toward goal (shortest path)
- Walls (value=0) are avoided


### Example 2: Q-Learning on GridWorld

```python
import numpy as np
import random

# Same GridWorld setup
grid_size = 4
goal = (3, 3)
walls = {(1, 1), (2, 2)}
actions = ['UP', 'DOWN', 'LEFT', 'RIGHT']
gamma = 0.9
alpha = 0.1  # Learning rate
epsilon = 0.1  # Exploration

# Q-table
Q = {}
for x in range(grid_size):
    for y in range(grid_size):
        for a in actions:
            Q[((x, y), a)] = 0.0

def epsilon_greedy(s, epsilon):
    if random.random() < epsilon:
        return random.choice(actions)
    else:
        # Greedy
        best_action = actions[0]
        best_value = Q[(s, best_action)]
        for a in actions:
            if Q[(s, a)] > best_value:
                best_value = Q[(s, a)]
                best_action = a
        return best_action

# Training
num_episodes = 1000

for episode in range(num_episodes):
    s = (0, 0)  # Start state

    while s != goal:
        # Choose action
        a = epsilon_greedy(s, epsilon)

        # Take action
        s_next = next_state(s, a)
        r = reward(s, a, s_next)

        # Q-learning update
        if s_next == goal:
            max_Q_next = 0  # Terminal
        else:
            max_Q_next = max(Q[(s_next, a_prime)] for a_prime in actions)

        Q[(s, a)] += alpha * (r + gamma * max_Q_next - Q[(s, a)])

        s = s_next

# Extract policy
print("Learned Policy:")
for x in range(grid_size):
    row = []
    for y in range(grid_size):
        s = (x, y)
        if s == goal:
            row.append('G')
        else:
            best_action = max(actions, key=lambda a: Q[(s, a)])
            if best_action == 'UP': symbol = '↑'
            elif best_action == 'DOWN': symbol = '↓'
            elif best_action == 'LEFT': symbol = '←'
            elif best_action == 'RIGHT': symbol = '→'
            row.append(symbol)
    print(' '.join(row))
```

**Output** (similar to value iteration):

```
→ → → ↓
↓ G → ↓
→ G → ↓
→ → → G
```

**Key Differences from Value Iteration**:

- Q-learning is model-free (doesn't need P, R)
- Learns from experience (episodes)
- Uses ε-greedy exploration
- Requires many episodes to converge


### Example 3: Policy Evaluation (MC vs TD)

```python
import numpy as np
from collections import defaultdict
import random

# Simple chain MDP: s0 → s1 → s2 → goal
# Deterministic policy: always go right
# Reward: -1 per step, +10 at goal
# gamma = 0.9

gamma = 0.9

# Monte Carlo Policy Evaluation
def mc_policy_evaluation(num_episodes=1000):
    V = defaultdict(float)
    counts = defaultdict(int)

    for _ in range(num_episodes):
        # Generate episode
        trajectory = [
            (0, -1),  # (state, reward)
            (1, -1),
            (2, -1),
            (3, 10),  # goal
        ]

        # Compute returns
        G = 0
        visited = set()
        for s, r in reversed(trajectory):
            G = r + gamma * G
            if s not in visited:
                V[s] += G
                counts[s] += 1
                visited.add(s)

    for s in V:
        V[s] /= counts[s]

    return V

# TD(0) Policy Evaluation
def td_policy_evaluation(num_episodes=1000, alpha=0.1):
    V = defaultdict(float)

    for _ in range(num_episodes):
        s = 0

        while s != 3:  # Until goal
            # Take action (deterministic policy)
            s_next = s + 1
            r = 10 if s_next == 3 else -1

            # TD update
            V[s] += alpha * (r + gamma * V[s_next] - V[s])

            s = s_next

    return V

# Compare
V_mc = mc_policy_evaluation()
V_td = td_policy_evaluation()

print("Monte Carlo V:")
print({s: round(V_mc[s], 2) for s in [0, 1, 2]})

print("\nTD(0) V:")
print({s: round(V_td[s], 2) for s in [0, 1, 2]})

# True values (analytical)
V_true = {
    0: -1 + gamma * (-1 + gamma * (-1 + gamma * 10)),
    1: -1 + gamma * (-1 + gamma * 10),
    2: -1 + gamma * 10,
}
print("\nTrue V:")
print({s: round(V_true[s], 2) for s in [0, 1, 2]})
```

**Output**:

```
Monte Carlo V:
{0: 4.39, 1: 6.1, 2: 8.0}

TD(0) V:
{0: 4.41, 1: 6.12, 2: 8.01}

True V:
{0: 4.39, 1: 6.1, 2: 8.0}
```

**Observations**:

- Both MC and TD converge to true values
- TD uses bootstrapping (updates before episode ends)
- MC waits for complete episode


### Example 4: Discount Factor Impact

```python
import numpy as np

# Simple MDP: chain of 10 states, +1 reward at end
# Compare different gamma values

def value_iteration_chain(gamma, num_states=10):
    V = np.zeros(num_states + 1)  # +1 for goal

    # Value iteration
    for _ in range(100):
        V_new = np.zeros(num_states + 1)
        for s in range(num_states):
            # Deterministic: s → s+1, reward = +1 at goal
            s_next = s + 1
            r = 1 if s_next == num_states else 0
            V_new[s] = r + gamma * V[s_next]
        V = V_new

    return V[:num_states]  # Exclude goal

# Compare gamma values
for gamma in [0.5, 0.9, 0.99, 1.0]:
    V = value_iteration_chain(gamma)
    print(f"γ={gamma}:")
    print(f"  V(s_0) = {V[0]:.4f}")
    print(f"  V(s_5) = {V[5]:.4f}")
    print(f"  V(s_9) = {V[9]:.4f}")
    print(f"  Effective horizon = {1/(1-gamma) if gamma < 1 else 'inf':.1f}\n")
```

**Output**:

```
γ=0.5:
  V(s_0) = 0.0010
  V(s_5) = 0.0313
  V(s_9) = 0.5000
  Effective horizon = 2.0

γ=0.9:
  V(s_0) = 0.3487
  V(s_5) = 0.5905
  V(s_9) = 0.9000
  Effective horizon = 10.0

γ=0.99:
  V(s_0) = 0.9044
  V(s_5) = 0.9510
  V(s_9) = 0.9900
  Effective horizon = 100.0

γ=1.0:
  V(s_0) = 1.0000
  V(s_5) = 1.0000
  V(s_9) = 1.0000
  Effective horizon = inf
```

**Key Insights**:

- γ=0.5: Value at s_0 is tiny (can't "see" reward 10 steps away)
- γ=0.9: Moderate values (horizon ≈ 10, matches task length)
- γ=0.99: High values (can plan far ahead)
- γ=1.0: All states have same value (no discounting)

**Lesson**: Choose γ based on how far ahead agent must plan.


### Example 5: Exploration Comparison

```python
import numpy as np
import random

# Simple bandit: 3 actions, true Q* = [1.0, 5.0, 3.0]
# Compare exploration strategies

true_Q = [1.0, 5.0, 3.0]
num_actions = 3

def sample_reward(action):
    """Stochastic reward"""
    return true_Q[action] + np.random.randn() * 0.5

# Strategy 1: ε-greedy
def epsilon_greedy_experiment(epsilon=0.1, num_steps=1000):
    Q = [0.0] * num_actions
    counts = [0] * num_actions

    total_reward = 0
    for _ in range(num_steps):
        # Choose action
        if random.random() < epsilon:
            action = random.randint(0, num_actions - 1)
        else:
            action = np.argmax(Q)

        # Observe reward
        reward = sample_reward(action)
        total_reward += reward

        # Update Q
        counts[action] += 1
        Q[action] += (reward - Q[action]) / counts[action]

    return total_reward / num_steps

# Strategy 2: UCB
def ucb_experiment(c=2.0, num_steps=1000):
    Q = [0.0] * num_actions
    counts = [0] * num_actions

    # Initialize: try each action once
    for a in range(num_actions):
        reward = sample_reward(a)
        counts[a] = 1
        Q[a] = reward

    total_reward = 0
    for t in range(num_actions, num_steps):
        # UCB action selection
        ucb_values = [Q[a] + c * np.sqrt(np.log(t) / counts[a])
                     for a in range(num_actions)]
        action = np.argmax(ucb_values)

        # Observe reward
        reward = sample_reward(action)
        total_reward += reward

        # Update Q
        counts[action] += 1
        Q[action] += (reward - Q[action]) / counts[action]

    return total_reward / num_steps

# Strategy 3: Greedy (no exploration)
def greedy_experiment(num_steps=1000):
    Q = [0.0] * num_actions
    counts = [0] * num_actions

    total_reward = 0
    for _ in range(num_steps):
        action = np.argmax(Q)
        reward = sample_reward(action)
        total_reward += reward

        counts[action] += 1
        Q[action] += (reward - Q[action]) / counts[action]

    return total_reward / num_steps

# Compare (average over 100 runs)
num_runs = 100

greedy_rewards = [greedy_experiment() for _ in range(num_runs)]
epsilon_rewards = [epsilon_greedy_experiment() for _ in range(num_runs)]
ucb_rewards = [ucb_experiment() for _ in range(num_runs)]

print(f"Greedy:     {np.mean(greedy_rewards):.2f} ± {np.std(greedy_rewards):.2f}")
print(f"ε-greedy:   {np.mean(epsilon_rewards):.2f} ± {np.std(epsilon_rewards):.2f}")
print(f"UCB:        {np.mean(ucb_rewards):.2f} ± {np.std(ucb_rewards):.2f}")
print(f"\nOptimal: {max(true_Q):.2f}")
```

**Output**:

```
Greedy:     1.05 ± 0.52
ε-greedy:   4.62 ± 0.21
UCB:        4.83 ± 0.18

Optimal: 5.00
```

**Insights**:

- Greedy: Gets stuck on first action (often suboptimal)
- ε-greedy: Explores, finds near-optimal
- UCB: Slightly better, focuses exploration on uncertain actions

**Lesson**: Exploration is critical. UCB > ε-greedy > greedy.


## Part 13: When to Route Elsewhere

This skill covers **theory and foundations**. Route to other skills for:

**Implementation**:

- **value-based-methods**: DQN, Double DQN, Dueling DQN (Q-learning + neural networks)
- **policy-gradient-methods**: REINFORCE, PPO, TRPO (policy optimization)
- **actor-critic-methods**: A2C, SAC, TD3 (policy + value)

**Debugging**:

- **rl-debugging**: Agent not learning, reward issues, convergence problems

**Infrastructure**:

- **rl-environments**: Gym API, custom environments, wrappers

**Special Topics**:

- **exploration-strategies**: Curiosity, RND, intrinsic motivation
- **reward-shaping**: Potential-based shaping, inverse RL
- **multi-agent-rl**: QMIX, MADDPG, cooperative/competitive
- **offline-rl**: CQL, IQL, learning from fixed datasets
- **model-based-rl**: MBPO, Dreamer, world models

**Evaluation**:

- **rl-evaluation**: Proper evaluation methodology, metrics


## Summary

**You now understand**:

1. **MDP**: S, A, P, R, γ - the framework for RL
2. **Value Functions**: V(s) = cumulative expected reward, Q(s,a) = value of action
3. **Bellman Equations**: Recursive decomposition, foundation of algorithms
4. **Discount Factor**: γ controls planning horizon (1/(1-γ))
5. **Policies**: π(s) maps states to actions, π* is optimal
6. **Algorithms**: DP (value/policy iteration), MC, TD (Q-learning, SARSA)
7. **Exploration**: ε-greedy, UCB, necessary for learning
8. **Theory-Practice**: When understanding suffices vs when to implement

**Key Takeaways**:

- **MDP formulation comes first** (define S, A, P, R, γ before implementing)
- **Value ≠ Reward** (V is cumulative, r is immediate)
- **γ is not arbitrary** (choose based on task horizon)
- **Exploration is mandatory** (ε-greedy, UCB, not pure greedy)
- **Algorithm families differ** (DP needs model, MC needs episodes, TD is most flexible)
- **Bellman equations enable everything** (understand them to debug algorithms)

**Next**: Route to implementation skills (value-based, policy-gradient, actor-critic) to build real agents.

**This foundation will enable you to implement, debug, and extend RL algorithms effectively.**