# Learning Rate Scheduling Skill

## When to Use This Skill

Use this skill when:
- User asks "should I use a learning rate scheduler?"
- Training plateaus or loss stops improving
- Training transformers or large models (warmup critical)
- User wants to implement OneCycleLR or specific scheduler
- Training is unstable in early epochs
- User asks "what learning rate should I use?"
- Deciding between constant LR and scheduled LR
- User is copying a paper's training recipe
- Implementing modern training pipelines (vision, NLP, RL)
- User suggests "just use constant LR" (rationalization)

Do NOT use when:
- User has specific bugs unrelated to scheduling
- Only discussing optimizer choice (no schedule questions)
- Training already working well and no LR questions asked


## Core Principles

### 1. Why Learning Rate Scheduling Matters

Learning rate scheduling is one of the MOST IMPACTFUL hyperparameters:

**High LR Early (Exploration):**
- Fast initial progress through parameter space
- Escape poor local minima
- Rapid loss reduction in early epochs

**Low LR Late (Exploitation):**
- Fine-tune to sharper, better minima
- Improve generalization (test accuracy)
- Stable convergence without oscillation

**Quantitative Impact:**
- Proper scheduling improves final test accuracy by 2-5% (SIGNIFICANT)
- Standard practice in all SOTA papers (ResNet, EfficientNet, ViT, BERT, GPT)
- Not optional for competitive performance

**When Constant LR Fails:**
- Can't explore quickly AND converge precisely
- Either too high (never converges) or too low (too slow)
- Leaves 2-5% performance on the table


### 2. Decision Framework: When to Schedule vs Constant LR

## Use Scheduler When:

✅ **Long training (>30 epochs)**
- Scheduling essential for multi-stage training
- Different LR regimes needed across training
- Example: 90-epoch ImageNet training

✅ **Large model on large dataset**
- Training from scratch on ImageNet, COCO, etc.
- Benefits from exploration → exploitation strategy
- Example: ResNet-50 on ImageNet

✅ **Training plateaus or loss stops improving**
- Current LR too high for current parameter regime
- Reducing LR breaks plateau
- Example: Validation loss stuck for 10+ epochs

✅ **Following established training recipes**
- Papers publish schedules for reproducibility
- Vision models typically use MultiStepLR or Cosine
- Example: ResNet paper specifies drop at epochs 30, 60, 90

✅ **Want competitive SOTA performance**
- Squeezing out last 2-5% accuracy
- Required for benchmarks and competitions
- Example: Targeting SOTA on CIFAR-10

## Maybe Don't Need Scheduler When:

❌ **Very short training (<10 epochs)**
- Not enough time for multi-stage scheduling
- Constant LR or simple linear decay sufficient
- Example: Quick fine-tuning for 5 epochs

❌ **OneCycle is the strategy itself**
- OneCycleLR IS the training strategy (not separate)
- Don't combine OneCycle with another scheduler
- Example: FastAI-style 20-epoch training

❌ **Hyperparameter search phase**
- Constant LR simpler to compare across runs
- Add scheduling after finding good architecture/optimizer
- Example: Running 50 architecture trials

❌ **Transfer learning fine-tuning**
- Small number of epochs on pretrained model
- Constant small LR often sufficient
- Example: Fine-tuning BERT for 3 epochs

❌ **Reinforcement learning**
- RL typically uses constant LR (exploration/exploitation balance different)
- Some exceptions (PPO sometimes uses linear decay)
- Example: DQN, A3C usually constant LR

## Default Recommendation:

**For >30 epoch training:** USE A SCHEDULER (typically CosineAnnealingLR)
**For <10 epoch training:** Constant LR usually fine
**For 10-30 epochs:** Try both, scheduler usually wins


### 3. Major Scheduler Types - Complete Comparison

## StepLR / MultiStepLR (Classic Vision)

**Use When:**
- Training CNNs (ResNet, VGG, etc.)
- Following established recipe from paper
- Want simple, interpretable schedule

**How It Works:**
- Drop LR by constant factor at specific epochs
- StepLR: every N epochs
- MultiStepLR: at specified milestone epochs

**Implementation:**

```python
# StepLR: Drop every 30 epochs
scheduler = torch.optim.lr_scheduler.StepLR(
    optimizer,
    step_size=30,    # Drop every 30 epochs
    gamma=0.1        # Multiply LR by 0.1 (10x reduction)
)

# MultiStepLR: Drop at specific milestones (more control)
scheduler = torch.optim.lr_scheduler.MultiStepLR(
    optimizer,
    milestones=[30, 60, 90],  # Drop at these epochs
    gamma=0.1                  # Multiply by 0.1 each time
)

# Training loop
for epoch in range(100):
    train_one_epoch(model, train_loader, optimizer)
    scheduler.step()  # Call AFTER each epoch
```

**Example Schedule (initial_lr=0.1):**
- Epochs 0-29: LR = 0.1
- Epochs 30-59: LR = 0.01 (dropped by 10x)
- Epochs 60-89: LR = 0.001 (dropped by 10x again)
- Epochs 90-99: LR = 0.0001

**Pros:**
- Simple and interpretable
- Well-established in papers (easy to reproduce)
- Works well for vision models

**Cons:**
- Requires manual milestone selection
- Sharp LR drops can cause temporary instability
- Need to know total training epochs in advance

**Best For:** Classical CNN training (ResNet, VGG) following paper recipes


## CosineAnnealingLR (Modern Default)

**Use When:**
- Training modern vision models (ViT, EfficientNet)
- Want smooth decay without manual milestones
- Don't want to tune milestone positions

**How It Works:**
- Smooth cosine curve from initial_lr to eta_min
- Gradual decay, no sharp drops
- LR = eta_min + (initial_lr - eta_min) * (1 + cos(π * epoch / T_max)) / 2

**Implementation:**

```python
scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(
    optimizer,
    T_max=100,      # Total epochs (LR reaches eta_min at epoch 100)
    eta_min=1e-5    # Minimum LR (default: 0)
)

# Training loop
for epoch in range(100):
    train_one_epoch(model, train_loader, optimizer)
    scheduler.step()  # Call AFTER each epoch
```

**Example Schedule (initial_lr=0.1, eta_min=1e-5):**
- Epoch 0: LR = 0.1
- Epoch 25: LR ≈ 0.075
- Epoch 50: LR ≈ 0.05
- Epoch 75: LR ≈ 0.025
- Epoch 100: LR = 0.00001

**Pros:**
- No milestone tuning required
- Smooth decay (no instability from sharp drops)
- Widely used in modern papers
- Works well across many domains

**Cons:**
- Must know total epochs in advance
- Can't adjust schedule during training

**Best Practice: ALWAYS COMBINE WITH WARMUP for large models:**

```python
# Warmup for 5 epochs, then cosine for 95 epochs
warmup = torch.optim.lr_scheduler.LinearLR(
    optimizer,
    start_factor=0.01,  # Start at 1% of base LR
    end_factor=1.0,     # Ramp to 100%
    total_iters=5       # Over 5 epochs
)

cosine = torch.optim.lr_scheduler.CosineAnnealingLR(
    optimizer,
    T_max=95,          # 95 epochs after warmup
    eta_min=1e-5
)

scheduler = torch.optim.lr_scheduler.SequentialLR(
    optimizer,
    schedulers=[warmup, cosine],
    milestones=[5]  # Switch to cosine after 5 epochs
)
```

**Best For:** Modern vision models, transformers, default choice for most problems


## ReduceLROnPlateau (Adaptive)

**Use When:**
- Don't know optimal schedule in advance
- Want adaptive approach based on validation performance
- Training plateaus and you want automatic LR reduction

**How It Works:**
- Monitors validation metric (loss or accuracy)
- Reduces LR when metric stops improving
- Requires passing metric to scheduler.step()

**Implementation:**

```python
scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(
    optimizer,
    mode='min',          # 'min' for loss, 'max' for accuracy
    factor=0.1,          # Reduce LR by 10x when plateau detected
    patience=10,         # Wait 10 epochs before reducing
    threshold=1e-4,      # Minimum change to count as improvement
    threshold_mode='rel', # 'rel' or 'abs'
    cooldown=0,          # Epochs to wait after LR reduction
    min_lr=1e-6,         # Don't reduce below this
    verbose=True         # Print when LR reduced
)

# Training loop
for epoch in range(100):
    train_loss = train_one_epoch(model, train_loader, optimizer)
    val_loss = validate(model, val_loader)

    # IMPORTANT: Pass validation metric to step()
    scheduler.step(val_loss)  # NOT scheduler.step() alone!
```

**Example Behavior (patience=10, factor=0.1):**
- Epochs 0-30: Val loss improving, LR = 0.001
- Epochs 31-40: Val loss plateaus at 0.15, patience counting
- Epoch 41: Plateau detected, LR reduced to 0.0001
- Epochs 42-60: Val loss improving again with lower LR
- Epoch 61: Plateau again, LR reduced to 0.00001

**Pros:**
- Adaptive - no manual tuning required
- Based on actual training progress
- Good for unknown optimal schedule

**Cons:**
- Can be too conservative (waits long before reducing)
- Requires validation metric (can't use train loss alone)
- May reduce LR too late or not enough

**Tuning Tips:**
- Smaller patience (5-10) for faster adaptation
- Larger patience (10-20) for more conservative
- Factor of 0.1 (10x) is standard, but 0.5 (2x) more gradual

**Best For:** Exploratory training, unknown optimal schedule, adaptive pipelines


## OneCycleLR (Fast Training)

**Use When:**
- Limited compute budget (want fast convergence)
- Training for relatively few epochs (10-30)
- Following FastAI-style training
- Want aggressive schedule for quick results

**How It Works:**
- Ramps UP from low LR to max_lr (first 30% by default)
- Ramps DOWN from max_lr to very low LR (remaining 70%)
- Steps EVERY BATCH (not every epoch) - CRITICAL DIFFERENCE

**Implementation:**

```python
scheduler = torch.optim.lr_scheduler.OneCycleLR(
    optimizer,
    max_lr=0.1,                    # Peak learning rate (TUNE THIS!)
    steps_per_epoch=len(train_loader),  # Batches per epoch
    epochs=20,                     # Total epochs
    pct_start=0.3,                 # Ramp up for first 30%
    anneal_strategy='cos',         # 'cos' or 'linear'
    div_factor=25,                 # initial_lr = max_lr / 25
    final_div_factor=10000         # final_lr = max_lr / 10000
)

# Training loop - NOTE: step() EVERY BATCH
for epoch in range(20):
    for batch in train_loader:
        loss = train_step(model, batch, optimizer)
        optimizer.step()
        scheduler.step()  # CALL EVERY BATCH, NOT EVERY EPOCH!
```

**Example Schedule (max_lr=0.1, 20 epochs, 400 batches/epoch):**
- Batches 0-2400 (epochs 0-6): LR ramps from 0.004 → 0.1
- Batches 2400-8000 (epochs 6-20): LR ramps from 0.1 → 0.00001

**CRITICAL: Tuning max_lr:**

OneCycleLR is VERY sensitive to max_lr choice. Too high = instability.

**Method 1 - LR Finder (RECOMMENDED):**
```python
# Run LR finder first (see LR Finder section)
optimal_lr = find_lr(model, train_loader, optimizer)  # e.g., 0.01
max_lr = optimal_lr * 10  # Use 10x optimal as max_lr
```

**Method 2 - Manual tuning:**
- Start with max_lr = 0.1
- If training unstable, try 0.03, 0.01
- If training too slow, try 0.3, 1.0

**Pros:**
- Very fast convergence (fewer epochs needed)
- Strong final performance
- Popular in FastAI community

**Cons:**
- Sensitive to max_lr (requires tuning)
- Steps every batch (easy to mess up)
- Not ideal for very long training (>50 epochs)

**Common Mistakes:**
1. Calling scheduler.step() per epoch instead of per batch
2. Not tuning max_lr (using default blindly)
3. Using for very long training (OneCycle designed for shorter cycles)

**Best For:** FastAI-style training, limited compute budget, 10-30 epoch training


## Advanced OneCycleLR Tuning

If lowering max_lr doesn't resolve instability, try these advanced tuning options:

**1. Adjust pct_start (warmup fraction):**

```python
# Default: 0.3 (30% warmup, 70% cooldown)
scheduler = OneCycleLR(optimizer, max_lr=0.1, epochs=20,
                       steps_per_epoch=len(train_loader),
                       pct_start=0.3)  # Default

# If unstable at peak: Increase to 0.4 or 0.5 (longer warmup)
scheduler = OneCycleLR(optimizer, max_lr=0.1, epochs=20,
                       steps_per_epoch=len(train_loader),
                       pct_start=0.5)  # Gentler ramp to peak

# If unstable in cooldown: Decrease to 0.2 (shorter warmup, gentler descent)
scheduler = OneCycleLR(optimizer, max_lr=0.1, epochs=20,
                       steps_per_epoch=len(train_loader),
                       pct_start=0.2)
```

**2. Adjust div_factor (initial LR):**

```python
# Default: 25 (initial_lr = max_lr / 25)
scheduler = OneCycleLR(optimizer, max_lr=0.1, epochs=20,
                       steps_per_epoch=len(train_loader),
                       div_factor=25)  # Start at 0.004

# If unstable at start: Increase to 50 or 100 (start even lower)
scheduler = OneCycleLR(optimizer, max_lr=0.1, epochs=20,
                       steps_per_epoch=len(train_loader),
                       div_factor=100)  # Start at 0.001
```

**3. Adjust final_div_factor (final LR):**

```python
# Default: 10000 (final_lr = max_lr / 10000)
scheduler = OneCycleLR(optimizer, max_lr=0.1, epochs=20,
                       steps_per_epoch=len(train_loader),
                       final_div_factor=10000)  # End at 0.00001

# If unstable at end: Decrease to 1000 (end at higher LR)
scheduler = OneCycleLR(optimizer, max_lr=0.1, epochs=20,
                       steps_per_epoch=len(train_loader),
                       final_div_factor=1000)  # End at 0.0001
```

**4. Add gradient clipping:**

```python
# In training loop
for batch in train_loader:
    loss = train_step(model, batch, optimizer)
    loss.backward()

    # Clip gradients to prevent instability
    torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)

    optimizer.step()
    scheduler.step()
```

**5. Consider OneCycle may not be right for your problem:**

- **Very deep networks (>100 layers):** May be too unstable for OneCycle's aggressive schedule
- **Large models (>100M params):** May need gentler schedule (Cosine + warmup)
- **Sensitive architectures (some transformers):** OneCycle's rapid LR changes can destabilize

**Alternative:** Use CosineAnnealing + warmup for more stable training:

```python
# More stable alternative to OneCycle
warmup = LinearLR(optimizer, start_factor=0.01, total_iters=5)
cosine = CosineAnnealingLR(optimizer, T_max=15, eta_min=1e-5)
scheduler = SequentialLR(optimizer, [warmup, cosine], [5])
```


## LinearLR (Warmup)

**Use When:**
- Need warmup at training start
- Ramping up LR gradually over first few epochs
- Combining with another scheduler (SequentialLR)

**How It Works:**
- Linearly interpolates LR from start_factor to end_factor
- Typically used for warmup: start_factor=0.01, end_factor=1.0

**Implementation:**

```python
# Standalone linear warmup
scheduler = torch.optim.lr_scheduler.LinearLR(
    optimizer,
    start_factor=0.01,  # Start at 1% of base LR
    end_factor=1.0,     # End at 100% of base LR
    total_iters=5       # Over 5 epochs
)

# More common: Combine with main scheduler
warmup = torch.optim.lr_scheduler.LinearLR(
    optimizer,
    start_factor=0.01,
    total_iters=5
)

main = torch.optim.lr_scheduler.CosineAnnealingLR(
    optimizer,
    T_max=95
)

scheduler = torch.optim.lr_scheduler.SequentialLR(
    optimizer,
    schedulers=[warmup, main],
    milestones=[5]  # Switch after 5 epochs
)

# Training loop
for epoch in range(100):
    train_one_epoch(model, train_loader, optimizer)
    scheduler.step()
```

**Example Schedule (base_lr=0.1):**
- Epoch 0: LR = 0.001 (1%)
- Epoch 1: LR = 0.0208 (20.8%)
- Epoch 2: LR = 0.0406 (40.6%)
- Epoch 3: LR = 0.0604 (60.4%)
- Epoch 4: LR = 0.0802 (80.2%)
- Epoch 5: LR = 0.1 (100%, then switch to main scheduler)

**Best For:** Warmup phase for transformers and large models


## ExponentialLR (Continuous Decay)

**Use When:**
- Want smooth, continuous decay
- Simpler alternative to Cosine
- Prefer exponential over linear decay

**How It Works:**
- Multiply LR by gamma every epoch
- LR(epoch) = initial_lr * gamma^epoch

**Implementation:**

```python
scheduler = torch.optim.lr_scheduler.ExponentialLR(
    optimizer,
    gamma=0.95  # Multiply by 0.95 each epoch
)

# Training loop
for epoch in range(100):
    train_one_epoch(model, train_loader, optimizer)
    scheduler.step()
```

**Example Schedule (initial_lr=0.1, gamma=0.95):**
- Epoch 0: LR = 0.1
- Epoch 10: LR = 0.0599
- Epoch 50: LR = 0.0077
- Epoch 100: LR = 0.0059

**Tuning gamma:**
- Want 10x decay over 100 epochs: gamma = 0.977 (0.1^(1/100))
- Want 100x decay over 100 epochs: gamma = 0.955 (0.01^(1/100))
- General formula: gamma = (target_lr / initial_lr)^(1/epochs)

**Pros:**
- Very smooth decay
- Simple to implement

**Cons:**
- Hard to intuit gamma value for desired final LR
- Less popular than Cosine (Cosine is better default)

**Best For:** Cases where you want exponential decay specifically


## LambdaLR (Custom Schedules)

**Use When:**
- Need custom schedule not provided by standard schedulers
- Implementing paper-specific schedule
- Advanced use cases (e.g., transformer inverse sqrt schedule)

**How It Works:**
- Provide function that computes LR multiplier for each epoch
- LR(epoch) = initial_lr * lambda(epoch)

**Implementation:**

```python
# Example: Warmup then constant
def warmup_lambda(epoch):
    if epoch < 5:
        return (epoch + 1) / 5  # Linear warmup
    else:
        return 1.0  # Constant after warmup

scheduler = torch.optim.lr_scheduler.LambdaLR(
    optimizer,
    lr_lambda=warmup_lambda
)

# Example: Transformer inverse square root schedule
def transformer_schedule(epoch):
    warmup_steps = 4000
    step = epoch + 1
    return min(step ** (-0.5), step * warmup_steps ** (-1.5))

scheduler = torch.optim.lr_scheduler.LambdaLR(
    optimizer,
    lr_lambda=transformer_schedule
)

# Example: Polynomial decay
def polynomial_decay(epoch):
    return (1 - epoch / 100) ** 0.9  # Decay to 0 at epoch 100

scheduler = torch.optim.lr_scheduler.LambdaLR(
    optimizer,
    lr_lambda=polynomial_decay
)
```

**Best For:** Custom schedules, implementing specific papers, advanced users


### 4. Warmup Strategies - CRITICAL FOR TRANSFORMERS

## Why Warmup is Essential

**Problem at Training Start:**
- Weights are randomly initialized
- Gradients can be very large and unstable
- BatchNorm statistics are uninitialized
- High LR can cause immediate divergence (NaN loss)

**Solution: Gradual LR Increase**
- Start with very low LR (1% of target)
- Linearly increase to target LR over first few epochs
- Allows model to stabilize before aggressive learning

**Quantitative Impact:**
- Transformers WITHOUT warmup: Often diverge or train very unstably
- Transformers WITH warmup: Stable training, better final performance
- Vision models: Warmup improves stability, sometimes +0.5-1% accuracy


## When Warmup is MANDATORY

**ALWAYS use warmup when:**

✅ **Training transformers (ViT, BERT, GPT, T5, etc.)**
- Transformers REQUIRE warmup - not optional
- Without warmup, training often diverges
- Standard practice in all transformer papers

✅ **Large batch sizes (>512)**
- Large batches → larger effective learning rate
- Warmup prevents early instability
- Standard for distributed training

✅ **High initial learning rates**
- If starting with LR > 0.001, use warmup
- Warmup allows higher peak LR safely

✅ **Training from scratch (not fine-tuning)**
- Random initialization needs gentle start
- Fine-tuning can often skip warmup (weights already good)

**Usually use warmup when:**

✅ Large models (>100M parameters)
✅ Using AdamW optimizer (common with transformers)
✅ Following modern training recipes

**May skip warmup when:**

❌ Fine-tuning pretrained models (weights already trained)
❌ Small learning rates (< 0.0001)
❌ Small models (<10M parameters)
❌ Established recipe without warmup (e.g., some CNN papers)


## Warmup Implementation Patterns

### Pattern 1: Linear Warmup + Cosine Decay (Most Common)

```python
import torch.optim.lr_scheduler as lr_scheduler

# Warmup for 5 epochs
warmup = lr_scheduler.LinearLR(
    optimizer,
    start_factor=0.01,  # Start at 1% of base LR
    end_factor=1.0,     # End at 100% of base LR
    total_iters=5       # Over 5 epochs
)

# Cosine decay for remaining 95 epochs
cosine = lr_scheduler.CosineAnnealingLR(
    optimizer,
    T_max=95,          # 95 epochs after warmup
    eta_min=1e-5       # Final LR
)

# Combine sequentially
scheduler = lr_scheduler.SequentialLR(
    optimizer,
    schedulers=[warmup, cosine],
    milestones=[5]  # Switch to cosine after epoch 5
)

# Training loop
for epoch in range(100):
    train_one_epoch(model, train_loader, optimizer)
    scheduler.step()
```

**Schedule Visualization (base_lr=0.001):**
- Epochs 0-4: Linear ramp from 0.00001 → 0.001 (warmup)
- Epochs 5-99: Cosine decay from 0.001 → 0.00001

**Use For:** Vision transformers, modern CNNs, most large-scale training


### Pattern 2: Linear Warmup + MultiStepLR

```python
# Warmup for 5 epochs
warmup = lr_scheduler.LinearLR(
    optimizer,
    start_factor=0.01,
    total_iters=5
)

# Step decay at 30, 60, 90
steps = lr_scheduler.MultiStepLR(
    optimizer,
    milestones=[30, 60, 90],
    gamma=0.1
)

# Combine
scheduler = lr_scheduler.SequentialLR(
    optimizer,
    schedulers=[warmup, steps],
    milestones=[5]
)
```

**Use For:** Classical CNN training with warmup


### Pattern 3: Manual Warmup (More Control)

```python
def get_lr_schedule(epoch, total_epochs, base_lr, warmup_epochs=5):
    """
    Custom schedule with warmup and cosine decay.
    """
    if epoch < warmup_epochs:
        # Linear warmup
        return base_lr * (epoch + 1) / warmup_epochs
    else:
        # Cosine decay
        progress = (epoch - warmup_epochs) / (total_epochs - warmup_epochs)
        return base_lr * 0.5 * (1 + math.cos(math.pi * progress))

# Training loop
for epoch in range(100):
    lr = get_lr_schedule(epoch, total_epochs=100, base_lr=0.001)
    for param_group in optimizer.param_groups:
        param_group['lr'] = lr

    train_one_epoch(model, train_loader, optimizer)
```

**Use For:** Custom schedules, research, maximum control


### Pattern 4: Transformer-Style Warmup (Inverse Square Root)

```python
def transformer_lr_schedule(step, d_model, warmup_steps):
    """
    Transformer schedule from "Attention is All You Need".
    LR increases during warmup, then decreases proportionally to inverse sqrt of step.
    """
    step = step + 1  # 1-indexed
    return (d_model ** -0.5) * min(step ** -0.5, step * warmup_steps ** -1.5)

scheduler = lr_scheduler.LambdaLR(
    optimizer,
    lr_lambda=lambda step: transformer_lr_schedule(step, d_model=512, warmup_steps=4000)
)

# Training loop - NOTE: step every BATCH for this schedule
for epoch in range(epochs):
    for batch in train_loader:
        train_step(model, batch, optimizer)
        optimizer.step()
        scheduler.step()  # Step every batch
```

**Use For:** Transformer models (BERT, GPT), following original papers


## Warmup Duration Guidelines

**How many warmup epochs?**

- **Transformers:** 5-20 epochs (or 5-10% of total training)
- **Vision models:** 5-10 epochs
- **Very large models (>1B params):** 10-20 epochs
- **Small models:** 3-5 epochs

**Rule of thumb:** 5-10% of total training epochs

**Examples:**
- 100-epoch training: 5-10 epoch warmup
- 20-epoch training: 2-3 epoch warmup
- 300-epoch training: 15-30 epoch warmup


## "But My Transformer Trained Fine Without Warmup"

Some users report training transformers without warmup successfully. Here's the reality:

**What "fine" actually means:**
- Training didn't diverge (NaN loss) - that's a low bar
- Got reasonable accuracy - but NOT optimal accuracy
- One successful run doesn't mean it's optimal or reliable

**What you're missing without warmup:**

**1. Performance gap (1-3% accuracy):**

```
Without warmup: Training works, achieves 85% accuracy
With warmup: Same model achieves 87-88% accuracy
```

That 2-3% is SIGNIFICANT:
- Difference between competitive and SOTA
- Difference between accepted and rejected paper
- Difference between passing and failing business metrics

**2. Training stability:**

```
Without warmup:
- Some runs diverge → need to restart with lower LR
- Sensitive to initialization seed
- Requires careful LR tuning
- Success rate: 60-80% of runs

With warmup:
- Stable training → consistent results
- Robust to initialization
- Wider stable LR range
- Success rate: 95-100% of runs
```

**3. Hyperparameter sensitivity:**

Without warmup:
- Very sensitive to initial LR choice (0.001 works, 0.0015 diverges)
- Sensitive to batch size
- Sensitive to optimizer settings

With warmup:
- More forgiving LR range (0.0005-0.002 all work)
- Less sensitive to batch size
- Robust optimizer configuration

**Empirical Evidence - Published Papers:**

Check transformer papers - ALL use warmup:

| Model | Paper | Warmup |
|-------|-------|--------|
| ViT | Dosovitskiy et al., 2020 | ✅ Linear, 10k steps |
| DeiT | Touvron et al., 2021 | ✅ Linear, 5 epochs |
| Swin | Liu et al., 2021 | ✅ Linear, 20 epochs |
| BERT | Devlin et al., 2018 | ✅ Linear, 10k steps |
| GPT-2 | Radford et al., 2019 | ✅ Linear warmup |
| GPT-3 | Brown et al., 2020 | ✅ Linear warmup |
| T5 | Raffel et al., 2020 | ✅ Inverse sqrt warmup |

**Every competitive transformer model uses warmup - there's a reason.**

**"But I got 85% accuracy without warmup!"**

Great! Now try with warmup and see if you get 87-88%. You probably will.

**The cost-benefit analysis:**

```python
# Cost: One line of code
warmup = LinearLR(optimizer, start_factor=0.01, total_iters=5)
scheduler = SequentialLR(optimizer, [warmup, main], [5])

# Benefit:
# - 1-3% better accuracy
# - More stable training
# - Higher success rate
# - Wider stable hyperparameter range
```

**Recommendation:**

1. Run ablation study: Train your model with and without warmup
2. Compare: Final test accuracy, training stability, number of failed runs
3. You'll find warmup gives better results with minimal complexity

**Bottom line:** Just because something "works" doesn't mean it's optimal. Warmup is standard practice for transformers because it consistently improves results.


### 5. LR Finder - Finding Optimal Initial LR

## What is LR Finder?

**Method from Leslie Smith (2015):** Cyclical Learning Rates paper

**Core Idea:**
1. Start with very small LR (1e-8)
2. Gradually increase LR (multiply by ~1.1 each batch)
3. Train for a few hundred steps, record loss at each LR
4. Plot loss vs LR
5. Choose LR where loss decreases fastest (steepest descent)

**Why It Works:**
- Too low LR: Loss decreases very slowly
- Optimal LR: Loss decreases rapidly (steepest slope)
- Too high LR: Loss plateaus or increases (instability)

**Typical Findings:**
- Loss decreases fastest at some LR (e.g., 0.01)
- Loss starts increasing at higher LR (e.g., 0.1)
- Choose LR slightly below fastest descent point (e.g., 0.003-0.01)


## LR Finder Implementation

```python
import torch
import matplotlib.pyplot as plt
import numpy as np

def find_lr(model, train_loader, optimizer, loss_fn, device,
            start_lr=1e-8, end_lr=10, num_iter=100, smooth_f=0.05):
    """
    LR Finder: Sweep learning rates and plot loss curve.

    Args:
        model: PyTorch model
        train_loader: Training data loader
        optimizer: Optimizer (will be modified)
        loss_fn: Loss function
        device: Device to train on
        start_lr: Starting learning rate (default: 1e-8)
        end_lr: Ending learning rate (default: 10)
        num_iter: Number of iterations (default: 100)
        smooth_f: Smoothing factor for loss (default: 0.05)

    Returns:
        lrs: List of learning rates tested
        losses: List of losses at each LR
    """
    # Save initial model state to restore later
    model.train()
    initial_state = model.state_dict()

    # Calculate LR multiplier for exponential increase
    lr_mult = (end_lr / start_lr) ** (1 / num_iter)

    lrs = []
    losses = []
    best_loss = float('inf')
    avg_loss = 0

    lr = start_lr

    # Iterate through training data
    iterator = iter(train_loader)
    for iteration in range(num_iter):
        try:
            data, target = next(iterator)
        except StopIteration:
            # Restart iterator if we run out of data
            iterator = iter(train_loader)
            data, target = next(iterator)

        # Set learning rate
        for param_group in optimizer.param_groups:
            param_group['lr'] = lr

        # Forward pass
        data, target = data.to(device), target.to(device)
        optimizer.zero_grad()
        output = model(data)
        loss = loss_fn(output, target)

        # Compute smoothed loss (exponential moving average)
        if iteration == 0:
            avg_loss = loss.item()
        else:
            avg_loss = smooth_f * loss.item() + (1 - smooth_f) * avg_loss

        # Record
        lrs.append(lr)
        losses.append(avg_loss)

        # Track best loss
        if avg_loss < best_loss:
            best_loss = avg_loss

        # Stop if loss explodes (>4x best loss)
        if avg_loss > 4 * best_loss:
            print(f"Stopping early at iteration {iteration}: loss exploded")
            break

        # Backward pass
        loss.backward()
        optimizer.step()

        # Increase learning rate
        lr *= lr_mult
        if lr > end_lr:
            break

    # Restore model to initial state
    model.load_state_dict(initial_state)

    # Plot results
    plt.figure(figsize=(10, 6))
    plt.plot(lrs, losses)
    plt.xscale('log')
    plt.xlabel('Learning Rate (log scale)')
    plt.ylabel('Loss')
    plt.title('LR Finder')
    plt.grid(True, alpha=0.3)

    # Mark suggested LR (10x below minimum loss)
    min_loss_idx = np.argmin(losses)
    suggested_lr = lrs[max(0, min_loss_idx - 5)]  # A bit before minimum
    plt.axvline(suggested_lr, color='red', linestyle='--',
                label=f'Suggested LR: {suggested_lr:.2e}')
    plt.legend()
    plt.show()

    print(f"\nLR Finder Results:")
    print(f"  Minimum loss at LR: {lrs[min_loss_idx]:.2e}")
    print(f"  Suggested starting LR: {suggested_lr:.2e}")
    print(f"  (Choose LR where loss decreases fastest, before minimum)")

    return lrs, losses


def suggest_lr_from_finder(lrs, losses):
    """
    Suggest optimal learning rate from LR finder results.

    Strategy: Find LR where loss gradient is steepest (fastest decrease).
    """
    # Compute gradient of loss w.r.t. log(LR)
    log_lrs = np.log10(lrs)
    gradients = np.gradient(losses, log_lrs)

    # Find steepest descent (most negative gradient)
    steepest_idx = np.argmin(gradients)

    # Suggested LR is at steepest point or slightly before
    suggested_lr = lrs[steepest_idx]

    return suggested_lr
```


## Using LR Finder

### Basic Usage:

```python
# Setup model, optimizer, loss
model = YourModel().to(device)
optimizer = torch.optim.SGD(model.parameters(), lr=0.1)  # LR will be overridden
loss_fn = torch.nn.CrossEntropyLoss()

# Run LR finder
lrs, losses = find_lr(model, train_loader, optimizer, loss_fn, device)

# Manually inspect plot and choose LR
# Look for: steepest descent point (fastest loss decrease)
# Typically: 10x lower than loss minimum

# Example: If minimum is at 0.1, choose 0.01 as starting LR
base_lr = 0.01  # Based on plot inspection
```

### Automated LR Selection:

```python
# Run LR finder
lrs, losses = find_lr(model, train_loader, optimizer, loss_fn, device)

# Get suggested LR
suggested_lr = suggest_lr_from_finder(lrs, losses)

# Use suggested LR
optimizer = torch.optim.SGD(model.parameters(), lr=suggested_lr)
```

### Using with OneCycleLR:

```python
# Find optimal LR
lrs, losses = find_lr(model, train_loader, optimizer, loss_fn, device)
optimal_lr = suggest_lr_from_finder(lrs, losses)  # e.g., 0.01

# OneCycleLR: Use 5-10x optimal as max_lr
max_lr = optimal_lr * 10  # e.g., 0.1

scheduler = torch.optim.lr_scheduler.OneCycleLR(
    optimizer,
    max_lr=max_lr,
    steps_per_epoch=len(train_loader),
    epochs=20
)
```


## Interpreting LR Finder Results

**Typical Plot Patterns:**

```
Loss
|
|         X  <-- Loss explodes (LR too high)
|        X
|       X
|      X     <-- Loss minimum (still too high)
|     X
|    X       <-- CHOOSE HERE (steepest descent)
|   X
|  X
| X
|X___________
  1e-8  1e-4  1e-2  0.1  1.0  10
              Learning Rate
```

**How to Choose:**

1. **Steepest Descent (BEST):**
   - Find where loss decreases fastest (steepest downward slope)
   - This is optimal LR for rapid convergence
   - Example: If steepest at 0.01, choose 0.01

2. **Before Minimum (SAFE):**
   - Find minimum loss LR (e.g., 0.1)
   - Choose 10x lower (e.g., 0.01)
   - More conservative, safer choice

3. **Avoid:**
   - Don't choose minimum itself (often too high)
   - Don't choose where loss is flat (too low, slow progress)
   - Don't choose where loss increases (way too high)

**Guidelines:**
- For SGD: Choose at steepest descent
- For Adam: Choose 10x below steepest (Adam more sensitive)
- For OneCycle: Use steepest as optimal, 5-10x as max_lr


## When to Use LR Finder

**Use LR Finder When:**

✅ Starting new project (unknown optimal LR)
✅ New architecture or dataset
✅ Tuning OneCycleLR (finding max_lr)
✅ Transitioning between optimizers
✅ Having training instability issues

**Can Skip When:**

❌ Following established paper recipe (LR already known)
❌ Fine-tuning (small LR like 1e-5 typically works)
❌ Very constrained time/resources
❌ Using adaptive methods (ReduceLROnPlateau)

**Best Practice:**
- Run LR finder once at project start
- Use found LR for all subsequent runs
- Re-run if changing optimizer, architecture, or batch size significantly


### 6. Scheduler Selection Guide

## Selection Flowchart

**1. What's your training duration?**

- **<10 epochs:** Constant LR or simple linear decay
- **10-30 epochs:** OneCycleLR (fast) or CosineAnnealingLR
- **>30 epochs:** CosineAnnealingLR or MultiStepLR

**2. What's your model type?**

- **Transformer (ViT, BERT, GPT):** CosineAnnealing + WARMUP (mandatory)
- **CNN (ResNet, EfficientNet):** MultiStepLR or CosineAnnealing + optional warmup
- **Small model:** Simpler schedulers (StepLR) or constant LR

**3. Do you know optimal schedule?**

- **Yes (from paper):** Use paper's schedule (MultiStepLR usually)
- **No (exploring):** ReduceLROnPlateau or CosineAnnealing
- **Want fast results:** OneCycleLR + LR finder

**4. What's your compute budget?**

- **High budget (100+ epochs):** CosineAnnealing or MultiStepLR
- **Low budget (10-20 epochs):** OneCycleLR
- **Adaptive budget:** ReduceLROnPlateau (stops when plateau)


## Paper Recipe vs Modern Best Practices

**If goal is EXACT REPRODUCTION:**

Use paper's exact schedule (down to every detail):

```python
# Example: Reproducing ResNet paper (He et al., 2015)
optimizer = torch.optim.SGD(model.parameters(), lr=0.1, momentum=0.9, weight_decay=1e-4)
scheduler = torch.optim.lr_scheduler.MultiStepLR(optimizer, milestones=[30, 60, 90], gamma=0.1)
# No warmup (paper didn't use it)
# Train for 100 epochs
```

**Rationale:**
- Reproduce results exactly
- Enable apples-to-apples comparison
- Validate paper's claims
- Establish baseline before improvements

**If goal is BEST PERFORMANCE:**

Use modern recipe (benefit from years of community learning):

```python
# Modern equivalent: ResNet with modern practices
optimizer = torch.optim.SGD(model.parameters(), lr=0.1, momentum=0.9, weight_decay=1e-4)
warmup = LinearLR(optimizer, start_factor=0.01, total_iters=5)
cosine = CosineAnnealingLR(optimizer, T_max=95, eta_min=1e-5)
scheduler = SequentialLR(optimizer, [warmup, cosine], [5])
# Train for 100 epochs
```

**Rationale:**
- Typically +0.5-2% better accuracy than original paper
- More stable training
- Reflects 5-10 years of community improvements
- SOTA competitive performance

**Evolution of LR Scheduling Practices:**

**Early Deep Learning (2012-2016):**
- Scheduler: StepLR with manual milestones
- Warmup: Not used (not yet discovered)
- Optimizer: SGD with momentum
- Examples: AlexNet, VGG, ResNet, Inception

**Mid Period (2017-2019):**
- Scheduler: CosineAnnealing introduced, OneCycleLR popular
- Warmup: Starting to be used for large batch training
- Optimizer: SGD still dominant, Adam increasingly common
- Examples: ResNeXt, DenseNet, MobileNet

**Modern Era (2020-2025):**
- Scheduler: CosineAnnealing default, OneCycle for fast training
- Warmup: Standard practice (mandatory for transformers)
- Optimizer: AdamW increasingly preferred for transformers
- Examples: ViT, EfficientNet, ConvNeXt, Swin, DeiT

**Practical Workflow:**

**Step 1: Reproduce paper recipe**
```python
# Use exact paper settings
optimizer = torch.optim.SGD(model.parameters(), lr=0.1, momentum=0.9)
scheduler = MultiStepLR(optimizer, milestones=[30, 60, 90], gamma=0.1)
# Should match paper's reported accuracy (e.g., 76.5%)
```

**Step 2: Validate reproduction**
- If you get 76.5% (matches paper): ✅ Reproduction successful
- If you get 74% (2% worse): ❌ Implementation bug, fix first
- If you get 78% (2% better): ✅ Great! Proceed to modern recipe

**Step 3: Try modern recipe**
```python
# Add warmup + cosine
warmup = LinearLR(optimizer, start_factor=0.01, total_iters=5)
cosine = CosineAnnealingLR(optimizer, T_max=95, eta_min=1e-5)
scheduler = SequentialLR(optimizer, [warmup, cosine], [5])
# Expect +0.5-2% improvement (e.g., 77-78.5%)
```

**Step 4: Compare results**

| Version | Accuracy | Notes |
|---------|----------|-------|
| Paper recipe | 76.5% | Baseline (reproduces paper) |
| Modern recipe | 78.0% | +1.5% from warmup + cosine |

**When to Use Which:**

**Use Paper Recipe:**
- Publishing reproduction study
- Comparing to paper's baseline
- Validating implementation correctness
- Research requiring exact reproducibility

**Use Modern Recipe:**
- Building production system (want best performance)
- Competing in benchmark (need SOTA results)
- Publishing new method (should use modern baseline)
- Limited compute (modern practices more efficient)

**Trade-off Table:**

| Aspect | Paper Recipe | Modern Recipe |
|--------|--------------|---------------|
| Reproducibility | ✅ Exact | ⚠️ Better but different |
| Performance | ⚠️ Good (for its time) | ✅ Better (+0.5-2%) |
| Comparability | ✅ To paper | ✅ To SOTA |
| Compute efficiency | ⚠️ May be suboptimal | ✅ Modern optimizations |
| Training stability | ⚠️ Variable | ✅ More stable (warmup) |

**Bottom Line:**

Both are valid depending on your goal:
- **Research/reproduction:** Start with paper recipe
- **Production/competition:** Use modern recipe
- **Best practice:** Validate with paper recipe, deploy with modern recipe


## Domain-Specific Recommendations

### Image Classification (CNNs)

**Standard Recipe (ResNet, VGG):**
```python
optimizer = torch.optim.SGD(model.parameters(), lr=0.1, momentum=0.9, weight_decay=1e-4)
scheduler = torch.optim.lr_scheduler.MultiStepLR(optimizer, milestones=[30, 60, 90], gamma=0.1)
# Train for 100 epochs
```

**Modern Recipe (EfficientNet, RegNet):**
```python
optimizer = torch.optim.SGD(model.parameters(), lr=0.1, momentum=0.9, weight_decay=1e-5)
warmup = LinearLR(optimizer, start_factor=0.01, total_iters=5)
cosine = CosineAnnealingLR(optimizer, T_max=95, eta_min=1e-5)
scheduler = SequentialLR(optimizer, [warmup, cosine], [5])
# Train for 100 epochs
```

### Vision Transformers (ViT, Swin, DeiT)

**Standard Recipe:**
```python
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-3, weight_decay=0.05)
warmup = LinearLR(optimizer, start_factor=0.01, total_iters=10)
cosine = CosineAnnealingLR(optimizer, T_max=290, eta_min=1e-5)
scheduler = SequentialLR(optimizer, [warmup, cosine], [10])
# Train for 300 epochs
# WARMUP IS MANDATORY
```

### NLP Transformers (BERT, GPT, T5)

**Standard Recipe:**
```python
optimizer = torch.optim.AdamW(model.parameters(), lr=5e-4, weight_decay=0.01)

# Linear warmup + linear decay
def lr_lambda(step):
    warmup_steps = 10000
    total_steps = 100000
    if step < warmup_steps:
        return step / warmup_steps
    else:
        return max(0.0, (total_steps - step) / (total_steps - warmup_steps))

scheduler = torch.optim.lr_scheduler.LambdaLR(optimizer, lr_lambda)
# Step every batch, not epoch
# WARMUP IS MANDATORY
```

### Object Detection (Faster R-CNN, YOLO)

**Standard Recipe:**
```python
optimizer = torch.optim.SGD(model.parameters(), lr=0.02, momentum=0.9, weight_decay=1e-4)
scheduler = torch.optim.lr_scheduler.MultiStepLR(optimizer, milestones=[16, 22], gamma=0.1)
# Train for 26 epochs
```

### Fast Training (Limited Compute)

**FastAI Recipe:**
```python
# Run LR finder first
optimal_lr = find_lr(model, train_loader, optimizer, loss_fn, device)
max_lr = optimal_lr * 10

optimizer = torch.optim.SGD(model.parameters(), lr=max_lr)
scheduler = torch.optim.lr_scheduler.OneCycleLR(
    optimizer,
    max_lr=max_lr,
    steps_per_epoch=len(train_loader),
    epochs=20,
    pct_start=0.3
)
# Train for 20 epochs
# Step every batch
```


### 7. Common Scheduling Pitfalls

## Pitfall 1: No Warmup for Transformers

**WRONG:**
```python
# Training Vision Transformer
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-3)
scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=100)
# ❌ No warmup - training will be very unstable or diverge
```

**RIGHT:**
```python
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-3)
warmup = LinearLR(optimizer, start_factor=0.01, total_iters=5)
cosine = CosineAnnealingLR(optimizer, T_max=95)
scheduler = SequentialLR(optimizer, [warmup, cosine], [5])
# ✅ Warmup prevents early instability
```

**Why It Matters:**
- Transformers with high LR at start → NaN loss, divergence
- Random initialization needs gradual LR ramp
- 5-10 epoch warmup is STANDARD practice

**How to Detect:**
- Loss is NaN or explodes in first few epochs
- Training very unstable early, stabilizes later
- Gradients extremely large at start


## Pitfall 2: Wrong scheduler.step() Placement

**WRONG (Most Schedulers):**
```python
for epoch in range(epochs):
    for batch in train_loader:
        loss = train_step(model, batch, optimizer)
        optimizer.step()
        scheduler.step()  # ❌ Stepping every batch, not every epoch
```

**RIGHT:**
```python
for epoch in range(epochs):
    for batch in train_loader:
        loss = train_step(model, batch, optimizer)
        optimizer.step()

    scheduler.step()  # ✅ Step AFTER each epoch
```

**EXCEPTION (OneCycleLR):**
```python
for epoch in range(epochs):
    for batch in train_loader:
        loss = train_step(model, batch, optimizer)
        optimizer.step()
        scheduler.step()  # ✅ OneCycle steps EVERY BATCH
```

**Why It Matters:**
- CosineAnnealing with T_max=100 expects 100 steps (epochs)
- Stepping every batch: If 390 batches/epoch, LR decays in <1 epoch
- LR reaches minimum way too fast

**How to Detect:**
- LR decays to minimum in first epoch
- Print LR each step: `print(optimizer.param_groups[0]['lr'])`
- Check if LR changes every batch (wrong) vs every epoch (right)

**Rule:**
- **Most schedulers (Step, Cosine, Exponential):** Step per epoch
- **OneCycleLR only:** Step per batch
- **ReduceLROnPlateau:** Step per epoch with validation metric


## Pitfall 3: scheduler.step() Before optimizer.step()

**WRONG:**
```python
loss.backward()
scheduler.step()      # ❌ Wrong order
optimizer.step()
```

**RIGHT:**
```python
loss.backward()
optimizer.step()      # ✅ Update weights first
scheduler.step()      # Then update LR
```

**Why It Matters:**
- Scheduler updates LR based on current epoch/step
- Should update weights with current LR, THEN move to next LR
- Wrong order = off-by-one error in schedule

**How to Detect:**
- Usually subtle, hard to notice
- Best practice: always optimizer.step() then scheduler.step()


## Pitfall 4: Not Passing Metric to ReduceLROnPlateau

**WRONG:**
```python
scheduler = ReduceLROnPlateau(optimizer)
for epoch in range(epochs):
    train_loss = train_one_epoch(model, train_loader, optimizer)
    scheduler.step()  # ❌ No metric passed
```

**RIGHT:**
```python
scheduler = ReduceLROnPlateau(optimizer, mode='min')
for epoch in range(epochs):
    train_loss = train_one_epoch(model, train_loader, optimizer)
    val_loss = validate(model, val_loader)
    scheduler.step(val_loss)  # ✅ Pass validation metric
```

**Why It Matters:**
- ReduceLROnPlateau NEEDS metric to detect plateau
- Without metric, scheduler doesn't know when to reduce LR
- Will get error or incorrect behavior

**How to Detect:**
- Error message: "ReduceLROnPlateau needs a metric"
- LR never reduces even when training plateaus


## Pitfall 5: Using OneCycle for Long Training

**SUBOPTIMAL:**
```python
# Training for 200 epochs
scheduler = OneCycleLR(optimizer, max_lr=0.1, epochs=200, steps_per_epoch=len(train_loader))
# ❌ OneCycle designed for shorter training (10-30 epochs)
```

**BETTER:**
```python
# For long training, use Cosine
warmup = LinearLR(optimizer, start_factor=0.01, total_iters=10)
cosine = CosineAnnealingLR(optimizer, T_max=190, eta_min=1e-5)
scheduler = SequentialLR(optimizer, [warmup, cosine], [10])
# ✅ Cosine better suited for long training
```

**Why It Matters:**
- OneCycle's aggressive up-then-down profile works for short training
- For long training, gentler cosine decay more stable
- OneCycle typically used for 10-30 epochs in FastAI style

**When to Use Each:**
- **OneCycle:** 10-30 epochs, limited compute, want fast results
- **Cosine:** 50+ epochs, full training, want best final performance


## Pitfall 6: Not Tuning max_lr for OneCycle

**WRONG:**
```python
# Just guessing max_lr
scheduler = OneCycleLR(optimizer, max_lr=0.1, epochs=20, steps_per_epoch=len(train_loader))
# ❌ Random max_lr without tuning
# Might be too high (unstable) or too low (slow)
```

**RIGHT:**
```python
# Step 1: Run LR finder
lrs, losses = find_lr(model, train_loader, optimizer, loss_fn, device)
optimal_lr = suggest_lr_from_finder(lrs, losses)  # e.g., 0.01

# Step 2: Use 5-10x optimal as max_lr
max_lr = optimal_lr * 10  # e.g., 0.1

scheduler = OneCycleLR(optimizer, max_lr=max_lr, epochs=20, steps_per_epoch=len(train_loader))
# ✅ Tuned max_lr based on LR finder
```

**Why It Matters:**
- OneCycle is VERY sensitive to max_lr
- Too high: Training unstable, loss explodes
- Too low: Slow training, underperforms
- LR finder finds optimal, use 5-10x as max_lr

**How to Tune:**
1. Run LR finder (see LR Finder section)
2. Find optimal LR (steepest descent point)
3. Use 5-10x optimal as max_lr for OneCycle
4. If still unstable, reduce max_lr (try 3x, 2x)


## Pitfall 7: Forgetting to Adjust T_max After Adding Warmup

**WRONG:**
```python
# Want 100 epoch training
warmup = LinearLR(optimizer, start_factor=0.01, total_iters=5)
cosine = CosineAnnealingLR(optimizer, T_max=100)  # ❌ Should be 95
scheduler = SequentialLR(optimizer, [warmup, cosine], [5])
```

**RIGHT:**
```python
# Want 100 epoch training
warmup = LinearLR(optimizer, start_factor=0.01, total_iters=5)
cosine = CosineAnnealingLR(optimizer, T_max=95)  # ✅ 100 - 5 = 95
scheduler = SequentialLR(optimizer, [warmup, cosine], [5])
```

**Why It Matters:**
- Total training is warmup + main schedule
- If warmup is 5 epochs and cosine is 100, total is 105 epochs
- T_max should be (total_epochs - warmup_epochs)

**How to Calculate:**
```python
total_epochs = 100
warmup_epochs = 5
T_max = total_epochs - warmup_epochs  # 95
```


## Pitfall 8: Using Same LR for All Param Groups

**SUBOPTIMAL:**
```python
# Fine-tuning: applying same LR to all layers
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
# ❌ Backbone and head both use 1e-3
```

**BETTER:**
```python
# Fine-tuning: lower LR for pretrained backbone, higher for new head
optimizer = torch.optim.Adam([
    {'params': model.backbone.parameters(), 'lr': 1e-4},  # Lower LR for pretrained
    {'params': model.head.parameters(), 'lr': 1e-3}       # Higher LR for random init
])
scheduler = CosineAnnealingLR(optimizer, T_max=100)
# ✅ Scheduler applies to all param groups proportionally
```

**Why It Matters:**
- Pretrained layers need smaller LR (already trained)
- New layers need higher LR (random initialization)
- Schedulers work with param groups automatically

**Note:** Schedulers multiply all param groups by same factor, preserving relative ratios


## Pitfall 9: Not Monitoring LR During Training

**PROBLEM:**
- Schedule not behaving as expected
- Hard to debug without visibility into LR

**SOLUTION:**
```python
# Log LR every epoch
for epoch in range(epochs):
    current_lr = optimizer.param_groups[0]['lr']
    print(f"Epoch {epoch}: LR = {current_lr:.6f}")

    train_one_epoch(model, train_loader, optimizer)
    scheduler.step()

# Or use TensorBoard
from torch.utils.tensorboard import SummaryWriter
writer = SummaryWriter()

for epoch in range(epochs):
    current_lr = optimizer.param_groups[0]['lr']
    writer.add_scalar('Learning Rate', current_lr, epoch)

    train_one_epoch(model, train_loader, optimizer)
    scheduler.step()
```

**Best Practice:**
- Always log LR to console or TensorBoard
- Plot LR schedule before training (see next section)
- Verify schedule matches expectations


## Pitfall 10: Not Validating Schedule Before Training

**PROBLEM:**
- Run full training, discover schedule was wrong
- Waste compute on incorrect schedule

**SOLUTION: Dry-run the schedule:**
```python
def plot_schedule(scheduler_fn, num_epochs):
    """
    Plot LR schedule before training to verify it's correct.
    """
    # Create dummy model and optimizer
    model = torch.nn.Linear(1, 1)
    optimizer = torch.optim.SGD(model.parameters(), lr=0.1)
    scheduler = scheduler_fn(optimizer)

    lrs = []
    for epoch in range(num_epochs):
        lrs.append(optimizer.param_groups[0]['lr'])
        optimizer.step()  # Dummy step
        scheduler.step()

    # Plot
    plt.figure(figsize=(10, 6))
    plt.plot(lrs)
    plt.xlabel('Epoch')
    plt.ylabel('Learning Rate')
    plt.title('LR Schedule')
    plt.grid(True, alpha=0.3)
    plt.show()

# Usage
def my_scheduler(opt):
    warmup = LinearLR(opt, start_factor=0.01, total_iters=5)
    cosine = CosineAnnealingLR(opt, T_max=95)
    return SequentialLR(opt, [warmup, cosine], [5])

plot_schedule(my_scheduler, num_epochs=100)
# Verify plot looks correct BEFORE training
```

**Best Practice:**
- Plot schedule before every major training run
- Verify warmup duration, decay shape, final LR
- Catch mistakes early (T_max wrong, step placement, etc.)


### 8. Modern Best Practices (2024-2025)

## Vision Models (CNNs, ResNets, ConvNeXt)

**Standard Recipe:**
```python
# Optimizer
optimizer = torch.optim.SGD(
    model.parameters(),
    lr=0.1,
    momentum=0.9,
    weight_decay=1e-4
)

# Scheduler: MultiStepLR or CosineAnnealing
# Option 1: MultiStepLR (classical)
scheduler = MultiStepLR(optimizer, milestones=[30, 60, 90], gamma=0.1)

# Option 2: CosineAnnealing (modern)
warmup = LinearLR(optimizer, start_factor=0.01, total_iters=5)
cosine = CosineAnnealingLR(optimizer, T_max=95, eta_min=1e-5)
scheduler = SequentialLR(optimizer, [warmup, cosine], [5])

# Training
epochs = 100
for epoch in range(epochs):
    train_one_epoch(model, train_loader, optimizer)
    scheduler.step()
```

**Key Points:**
- SGD with momentum (0.9) is standard for CNNs
- LR = 0.1 for batch size 256 (scale linearly for other batch sizes)
- Warmup optional but beneficial (5 epochs)
- CosineAnnealing increasingly preferred over MultiStepLR


## Vision Transformers (ViT, Swin, DeiT)

**Standard Recipe:**
```python
# Optimizer
optimizer = torch.optim.AdamW(
    model.parameters(),
    lr=1e-3,
    weight_decay=0.05,
    betas=(0.9, 0.999)
)

# Scheduler: MUST include warmup
warmup_epochs = 10
cosine_epochs = 290
warmup = LinearLR(optimizer, start_factor=0.01, total_iters=warmup_epochs)
cosine = CosineAnnealingLR(optimizer, T_max=cosine_epochs, eta_min=1e-5)
scheduler = SequentialLR(optimizer, [warmup, cosine], [warmup_epochs])

# Training
epochs = 300
for epoch in range(epochs):
    train_one_epoch(model, train_loader, optimizer)
    scheduler.step()
```

**Key Points:**
- AdamW optimizer (not SGD)
- Warmup is MANDATORY (10-20 epochs)
- Long training (300 epochs typical)
- LR = 1e-3 for batch size 512 (scale for other sizes)
- Cosine decay to very small LR (1e-5)

**Why Warmup is Critical for ViT:**
- Self-attention layers highly sensitive to initialization
- High LR at start causes gradient explosion
- Warmup allows attention patterns to stabilize


## NLP Transformers (BERT, GPT, T5)

**Standard Recipe:**
```python
# Optimizer
optimizer = torch.optim.AdamW(
    model.parameters(),
    lr=5e-4,
    weight_decay=0.01,
    betas=(0.9, 0.999)
)

# Scheduler: Linear warmup + linear decay (or inverse sqrt)
total_steps = len(train_loader) * epochs
warmup_steps = int(0.1 * total_steps)  # 10% warmup

def lr_lambda(step):
    if step < warmup_steps:
        return step / warmup_steps
    else:
        return max(0.0, (total_steps - step) / (total_steps - warmup_steps))

scheduler = LambdaLR(optimizer, lr_lambda)

# Training: step EVERY BATCH
for epoch in range(epochs):
    for batch in train_loader:
        train_step(model, batch, optimizer)
        optimizer.step()
        scheduler.step()  # Step every batch, not epoch
```

**Key Points:**
- AdamW optimizer
- Warmup is MANDATORY (typically 10% of training)
- Linear warmup + linear decay (BERT, GPT-2 style)
- Step scheduler EVERY BATCH (not every epoch)
- LR typically 1e-4 to 5e-4

**Alternative: Inverse Square Root (Original Transformer):**
```python
def transformer_schedule(step):
    warmup_steps = 4000
    step = step + 1
    return (d_model ** -0.5) * min(step ** -0.5, step * warmup_steps ** -1.5)

scheduler = LambdaLR(optimizer, transformer_schedule)
```


## Object Detection (Faster R-CNN, YOLO, DETR)

**Standard Recipe (Two-stage detectors):**
```python
# Optimizer
optimizer = torch.optim.SGD(
    model.parameters(),
    lr=0.02,
    momentum=0.9,
    weight_decay=1e-4
)

# Scheduler: MultiStepLR with short schedule
scheduler = MultiStepLR(optimizer, milestones=[16, 22], gamma=0.1)

# Training
epochs = 26  # Shorter than classification
for epoch in range(epochs):
    train_one_epoch(model, train_loader, optimizer)
    scheduler.step()
```

**Standard Recipe (Transformer detectors like DETR):**
```python
# Optimizer
optimizer = torch.optim.AdamW(
    [
        {'params': model.backbone.parameters(), 'lr': 1e-5},  # Lower for backbone
        {'params': model.transformer.parameters(), 'lr': 1e-4}  # Higher for transformer
    ],
    weight_decay=1e-4
)

# Scheduler: Step decay
scheduler = MultiStepLR(optimizer, milestones=[200], gamma=0.1)

# Training: Long schedule for DETR
epochs = 300
```

**Key Points:**
- Detection typically shorter training than classification
- Lower LR (0.02 vs 0.1) due to task difficulty
- DETR needs very long training (300 epochs)


## Semantic Segmentation (U-Net, DeepLab, SegFormer)

**Standard Recipe (CNN-based):**
```python
# Optimizer
optimizer = torch.optim.SGD(
    model.parameters(),
    lr=0.01,
    momentum=0.9,
    weight_decay=1e-4
)

# Scheduler: Polynomial decay (common in segmentation)
def poly_lr_lambda(epoch):
    return (1 - epoch / total_epochs) ** 0.9

scheduler = LambdaLR(optimizer, poly_lr_lambda)

# Training
epochs = 100
for epoch in range(epochs):
    train_one_epoch(model, train_loader, optimizer)
    scheduler.step()
```

**Key Points:**
- Polynomial decay common in segmentation (DeepLab papers)
- Lower initial LR (0.01) than classification
- Power of 0.9 standard


## Fast Training / Limited Compute (FastAI Style)

**OneCycle Recipe:**
```python
# Step 1: Find optimal LR
lrs, losses = find_lr(model, train_loader, optimizer, loss_fn, device)
optimal_lr = suggest_lr_from_finder(lrs, losses)  # e.g., 0.01
max_lr = optimal_lr * 10  # e.g., 0.1

# Step 2: OneCycleLR
optimizer = torch.optim.SGD(model.parameters(), lr=max_lr, momentum=0.9)
scheduler = OneCycleLR(
    optimizer,
    max_lr=max_lr,
    steps_per_epoch=len(train_loader),
    epochs=20,
    pct_start=0.3,        # 30% warmup, 70% cooldown
    anneal_strategy='cos'
)

# Step 3: Train (step every batch)
for epoch in range(20):
    for batch in train_loader:
        train_step(model, batch, optimizer)
        optimizer.step()
        scheduler.step()  # Every batch
```

**Key Points:**
- Use LR finder to tune max_lr (CRITICAL)
- Train for fewer epochs (10-30)
- Step scheduler every batch
- Often achieves 90-95% of full training performance in 20-30% of time


## Fine-Tuning Pretrained Models

**Standard Recipe:**
```python
# Optimizer: Different LRs for backbone vs head
optimizer = torch.optim.AdamW([
    {'params': model.backbone.parameters(), 'lr': 1e-5},  # Very low for pretrained
    {'params': model.head.parameters(), 'lr': 1e-3}       # Higher for new head
])

# Scheduler: Simple cosine or even constant
# Option 1: Constant LR (fine-tuning often doesn't need scheduling)
scheduler = None

# Option 2: Gentle cosine decay
scheduler = CosineAnnealingLR(optimizer, T_max=10, eta_min=1e-6)

# Training: Short duration
epochs = 10  # Fine-tuning is quick
for epoch in range(epochs):
    train_one_epoch(model, train_loader, optimizer)
    if scheduler:
        scheduler.step()
```

**Key Points:**
- Much lower LR for pretrained parts (1e-5)
- Higher LR for new/random parts (1e-3)
- Short training (3-10 epochs)
- Scheduling often optional (constant LR works)
- No warmup needed (weights already good)


## Large Batch Training (Batch Size > 1024)

**Standard Recipe:**
```python
# Linear LR scaling rule: LR scales with batch size
base_lr = 0.1  # For batch size 256
batch_size = 2048
scaled_lr = base_lr * (batch_size / 256)  # 0.8 for batch 2048

# Optimizer
optimizer = torch.optim.SGD(model.parameters(), lr=scaled_lr, momentum=0.9)

# Scheduler: MUST include warmup (critical for large batch)
warmup_epochs = 5
warmup = LinearLR(optimizer, start_factor=0.01, total_iters=warmup_epochs)
cosine = CosineAnnealingLR(optimizer, T_max=95, eta_min=1e-5)
scheduler = SequentialLR(optimizer, [warmup, cosine], [warmup_epochs])

# Training
epochs = 100
for epoch in range(epochs):
    train_one_epoch(model, train_loader, optimizer)
    scheduler.step()
```

**Key Points:**
- Scale LR linearly with batch size (LR = base_lr * batch_size / base_batch_size)
- Warmup is MANDATORY for large batch (5-10 epochs minimum)
- Longer warmup for very large batches (>4096: use 10-20 epochs)

**Why Warmup Critical for Large Batch:**
- Large batch = larger effective LR
- High effective LR at start causes instability
- Warmup prevents divergence


## Modern Defaults by Domain (2025)

| Domain | Optimizer | Scheduler | Warmup | Epochs |
|--------|-----------|-----------|--------|--------|
| Vision (CNN) | SGD (0.9) | Cosine or MultiStep | Optional (5) | 100-200 |
| Vision (ViT) | AdamW | Cosine | MANDATORY (10-20) | 300 |
| NLP (BERT/GPT) | AdamW | Linear | MANDATORY (10%) | Varies |
| Detection | SGD | MultiStep | Optional | 26-300 |
| Segmentation | SGD | Polynomial | Optional | 100 |
| Fast/OneCycle | SGD | OneCycle | Built-in | 10-30 |
| Fine-tuning | AdamW | Constant/Cosine | No | 3-10 |
| Large Batch | SGD | Cosine | MANDATORY (5-20) | 100-200 |


### 9. Debugging Scheduler Issues

## Issue: Training Unstable / Loss Spikes

**Symptoms:**
- Loss increases suddenly during training
- NaN or Inf loss
- Training was stable, then becomes unstable

**Likely Causes:**

1. **No warmup (transformers, large models)**
   - Solution: Add 5-10 epoch warmup

2. **LR too high at start**
   - Solution: Lower initial LR or extend warmup

3. **LR drop too sharp (MultiStepLR)**
   - Solution: Use gentler scheduler (Cosine) or smaller gamma

**Debugging Steps:**

```python
# 1. Print LR every epoch
for epoch in range(epochs):
    current_lr = optimizer.param_groups[0]['lr']
    print(f"Epoch {epoch}: LR = {current_lr:.6e}")

    # 2. Check if loss spike correlates with LR change
    loss = train_one_epoch(model, train_loader, optimizer)
    print(f"  Loss = {loss:.4f}")

    scheduler.step()

# 3. Plot LR and loss together
import matplotlib.pyplot as plt
plt.figure(figsize=(12, 5))
plt.subplot(1, 2, 1)
plt.plot(lr_history)
plt.xlabel('Epoch')
plt.ylabel('Learning Rate')
plt.subplot(1, 2, 2)
plt.plot(loss_history)
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.show()
```

**Solutions:**

- Add/extend warmup: `LinearLR(optimizer, start_factor=0.01, total_iters=10)`
- Lower initial LR: `lr = 0.01` instead of `lr = 0.1`
- Gentler scheduler: `CosineAnnealingLR` instead of `MultiStepLR`
- Gradient clipping: `torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)`


## Issue: Training Plateaus Too Early

**Symptoms:**
- Loss stops decreasing after 20-30 epochs
- Validation accuracy flat
- Training seems stuck

**Likely Causes:**

1. **Not using scheduler (constant LR too high for current regime)**
   - Solution: Add scheduler (CosineAnnealing or ReduceLROnPlateau)

2. **Scheduler reducing LR too early**
   - Solution: Push back milestones or increase patience

3. **LR already too low**
   - Solution: Check current LR, may need to restart with higher initial LR

**Debugging Steps:**

```python
# Check current LR
current_lr = optimizer.param_groups[0]['lr']
print(f"Current LR: {current_lr:.6e}")

# If LR very low (<1e-6), plateau might be due to other issues (architecture, data, etc.)
# If LR still high (>1e-3), should reduce LR to break plateau
```

**Solutions:**

- Add ReduceLROnPlateau: Automatically reduces when plateau detected
  ```python
  scheduler = ReduceLROnPlateau(optimizer, mode='min', factor=0.1, patience=10)
  ```

- Manual LR reduction: If at epoch 30 and plateaued, reduce LR by 10x now
  ```python
  for param_group in optimizer.param_groups:
      param_group['lr'] *= 0.1
  ```

- Use scheduler from start next time:
  ```python
  scheduler = CosineAnnealingLR(optimizer, T_max=100)
  ```


## Issue: Poor Final Performance (Train > Val Gap)

**Symptoms:**
- Training accuracy high (95%), validation lower (88%)
- Model overfitting
- Test performance disappointing

**Likely Causes (Scheduling Related):**

1. **LR not low enough at end**
   - Solution: Lower eta_min or extend training

2. **Not using scheduler (constant LR doesn't fine-tune)**
   - Solution: Add scheduler to reduce LR in late training

3. **Scheduler ending too early**
   - Solution: Extend training or adjust T_max

**Debugging Steps:**

```python
# Check final LR
final_lr = optimizer.param_groups[0]['lr']
print(f"Final LR: {final_lr:.6e}")

# Final LR should be very low (1e-5 to 1e-6)
# If final LR still high (>1e-3), model didn't fine-tune properly
```

**Solutions:**

- Lower eta_min: `CosineAnnealingLR(optimizer, T_max=100, eta_min=1e-6)`
- Extend training: Train for more epochs to allow LR to decay further
- Add late-stage fine-tuning:
  ```python
  # After main training, do 10 more epochs with very low LR
  for param_group in optimizer.param_groups:
      param_group['lr'] = 1e-5
  for epoch in range(10):
      train_one_epoch(model, train_loader, optimizer)
  ```

**Note:** If train-val gap large, may also need regularization (not scheduling issue)


## Issue: LR Decays Too Fast

**Symptoms:**
- LR reaches minimum in first few epochs
- Training very slow after initial epochs
- Looks like constant very low LR

**Likely Causes:**

1. **scheduler.step() called every batch instead of epoch**
   - Solution: Move scheduler.step() outside batch loop

2. **T_max too small (e.g., T_max=10 but training for 100 epochs)**
   - Solution: Set T_max = total_epochs

3. **Using OneCycle unintentionally**
   - Solution: Verify scheduler type

**Debugging Steps:**

```python
# Print LR first few epochs
for epoch in range(10):
    print(f"Epoch {epoch}: LR = {optimizer.param_groups[0]['lr']:.6e}")
    for batch in train_loader:
        train_step(model, batch, optimizer)
        # scheduler.step()  # ❌ If this is here, that's the bug
    scheduler.step()  # ✅ Should be here
```

**Solutions:**

- Move scheduler.step() to correct location (after epoch, not after batch)
- Fix T_max: `T_max = total_epochs` or `T_max = total_epochs - warmup_epochs`
- Verify scheduler type: `print(type(scheduler))`


## Issue: OneCycleLR Not Working

**Symptoms:**
- Training with OneCycle becomes unstable around peak LR
- Loss increases during ramp-up phase
- Worse performance than expected

**Likely Causes:**

1. **max_lr too high**
   - Solution: Run LR finder, use lower max_lr

2. **scheduler.step() placement wrong (should be per batch)**
   - Solution: Call scheduler.step() every batch

3. **Not tuning max_lr**
   - Solution: Use LR finder to find optimal, use 5-10x as max_lr

**Debugging Steps:**

```python
# Plot LR schedule
lrs = []
for epoch in range(epochs):
    for batch in train_loader:
        lrs.append(optimizer.param_groups[0]['lr'])
        scheduler.step()

plt.plot(lrs)
plt.xlabel('Batch')
plt.ylabel('Learning Rate')
plt.title('OneCycle LR Schedule')
plt.show()

# Should see: ramp up to max_lr, then ramp down
# If doesn't look like that, scheduler.step() placement wrong
```

**Solutions:**

- Run LR finder first:
  ```python
  optimal_lr = find_lr(model, train_loader, optimizer, loss_fn, device)
  max_lr = optimal_lr * 10  # Or try 5x, 3x if 10x unstable
  ```

- Lower max_lr manually:
  ```python
  # If max_lr=0.1 unstable, try 0.03 or 0.01
  scheduler = OneCycleLR(optimizer, max_lr=0.03, ...)
  ```

- Verify step() every batch:
  ```python
  for epoch in range(epochs):
      for batch in train_loader:
          train_step(model, batch, optimizer)
          optimizer.step()
          scheduler.step()  # ✅ Every batch
  ```


## Issue: Warmup Not Working

**Symptoms:**
- Training still unstable in first few epochs despite warmup
- Loss spikes even with warmup
- NaN loss at start

**Likely Causes:**

1. **Warmup too short (need longer ramp-up)**
   - Solution: Extend warmup from 5 to 10-20 epochs

2. **start_factor too high (not starting low enough)**
   - Solution: Use start_factor=0.001 instead of 0.01

3. **Warmup not actually being used (SequentialLR bug)**
   - Solution: Verify warmup scheduler is active early

**Debugging Steps:**

```python
# Print LR first 10 epochs
for epoch in range(10):
    current_lr = optimizer.param_groups[0]['lr']
    print(f"Epoch {epoch}: LR = {current_lr:.6e}")
    # Should see gradual increase from low to high
    # If jumps immediately to high, warmup not working

    train_one_epoch(model, train_loader, optimizer)
    scheduler.step()
```

**Solutions:**

- Extend warmup:
  ```python
  warmup = LinearLR(optimizer, start_factor=0.01, total_iters=20)  # 20 epochs
  ```

- Lower start_factor:
  ```python
  warmup = LinearLR(optimizer, start_factor=0.001, total_iters=5)  # Start at 0.1%
  ```

- Verify SequentialLR milestone:
  ```python
  # Milestone should match warmup duration
  scheduler = SequentialLR(optimizer, [warmup, cosine], milestones=[20])
  ```

- Add gradient clipping as additional safeguard:
  ```python
  torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
  ```


## Issue: ReduceLROnPlateau Never Reduces LR

**Symptoms:**
- Using ReduceLROnPlateau for 50+ epochs
- Validation loss clearly plateaued
- Learning rate never reduces

**Debugging Steps:**

**1. Verify metric is being passed:**

```python
val_loss = validate(model, val_loader)
print(f"Epoch {epoch}: val_loss = {val_loss:.6f}")  # Print metric
scheduler.step(val_loss)  # Ensure passing metric
```

**2. Check mode is correct:**

```python
# For loss (want to minimize):
scheduler = ReduceLROnPlateau(optimizer, mode='min')

# For accuracy (want to maximize):
scheduler = ReduceLROnPlateau(optimizer, mode='max')
```

Wrong mode means scheduler waits for opposite direction (loss increasing instead of decreasing).

**3. Check threshold isn't too strict:**

```python
# Default threshold=1e-4 (0.01% improvement threshold)
# If val_loss 0.5000 → 0.4999 (0.02% improvement), counts as improvement
# If threshold too high, tiny improvements prevent reduction

# Solution: Lower threshold to be more sensitive
scheduler = ReduceLROnPlateau(optimizer, threshold=1e-5)

# Or remove threshold entirely
scheduler = ReduceLROnPlateau(optimizer, threshold=0)
```

**4. Enable verbose logging:**

```python
scheduler = ReduceLROnPlateau(optimizer, verbose=True)
# Prints: "Epoch 00042: reducing learning rate of group 0 to 1.0000e-04"
# when it reduces
```

**5. Verify plateau is real:**

```python
# Plot validation loss over time
import matplotlib.pyplot as plt
plt.figure(figsize=(10, 6))
plt.plot(val_losses)
plt.xlabel('Epoch')
plt.ylabel('Validation Loss')
plt.title('Validation Loss Over Time')
plt.grid(True, alpha=0.3)
plt.show()

# Check: Is loss truly flat, or still slowly improving?
# Tiny improvements (0.4500 → 0.4499) count as progress
```

**6. Check cooldown isn't preventing reduction:**

```python
# Default cooldown=0, but if set higher, prevents reduction after recent reduction
scheduler = ReduceLROnPlateau(optimizer, cooldown=0)  # No cooldown
```

**Common Causes Table:**

| Problem | Symptom | Solution |
|---------|---------|----------|
| Not passing metric | Error or no reduction | `scheduler.step(val_loss)` |
| Wrong mode | Never reduces | `mode='min'` for loss, `mode='max'` for accuracy |
| Threshold too strict | Ignores small improvements | Lower to `threshold=1e-5` or `0` |
| Metric still improving | Not actually plateaued | Increase patience or accept slow progress |
| Cooldown active | Reducing but waiting | Set `cooldown=0` |
| Min_lr reached | Can't reduce further | Check current LR, may be at min_lr |

**Example Fix:**

```python
scheduler = ReduceLROnPlateau(
    optimizer,
    mode='min',          # For loss minimization
    factor=0.1,          # Reduce by 10x
    patience=10,         # Wait 10 epochs
    threshold=0,         # Accept any improvement (most sensitive)
    threshold_mode='rel',
    cooldown=0,          # No cooldown period
    min_lr=1e-6,         # Minimum LR allowed
    verbose=True         # Print when reducing
)

# Training loop
for epoch in range(epochs):
    train_loss = train_one_epoch(model, train_loader, optimizer)
    val_loss = validate(model, val_loader)

    print(f"Epoch {epoch}: train_loss={train_loss:.4f}, val_loss={val_loss:.4f}")

    scheduler.step(val_loss)  # Pass validation loss

    # Print current LR
    current_lr = optimizer.param_groups[0]['lr']
    print(f"  Current LR: {current_lr:.6e}")
```

**Advanced Debugging:**

If still not reducing, manually check scheduler logic:

```python
# Get scheduler state
print(f"Best metric so far: {scheduler.best}")
print(f"Epochs without improvement: {scheduler.num_bad_epochs}")
print(f"Patience: {scheduler.patience}")

# If num_bad_epochs < patience, it's still waiting
# If num_bad_epochs >= patience, should reduce next step
```


### 10. Rationalization Table

When users rationalize away proper LR scheduling, counter with:

| Rationalization | Reality | Counter-Argument |
|-----------------|---------|------------------|
| "Constant LR is simpler" | Leaves 2-5% performance on table | "One line of code for 2-5% better accuracy is excellent ROI" |
| "Warmup seems optional" | MANDATORY for transformers | "Without warmup, transformers diverge or train unstably" |
| "I don't know which scheduler to use" | CosineAnnealing is great default | "CosineAnnealingLR works well for most cases, zero tuning" |
| "Scheduling is too complicated" | Modern frameworks make it trivial | "scheduler = CosineAnnealingLR(optimizer, T_max=100) - that's it" |
| "Papers don't mention scheduling" | They do, in implementation details | "Check paper's code repo or appendix - scheduling always there" |
| "My model is too small to need scheduling" | Even small models benefit | "Scheduling helps all models converge to better minima" |
| "Just use Adam, it adapts automatically" | Adam still benefits from scheduling | "SOTA transformers use AdamW + scheduling (BERT, GPT, ViT)" |
| "I'll tune it later" | Scheduling should be from start | "Scheduling is core hyperparameter, not optional add-on" |
| "OneCycle always best" | Only for specific scenarios | "OneCycle great for fast training (<30 epochs), not long training" |
| "I don't have time to run LR finder" | Takes 5 minutes, saves hours | "LR finder runs in minutes, prevents wasted training runs" |
| "Warmup adds complexity" | One extra line of code | "SequentialLR([warmup, cosine], [5]) - that's the complexity" |
| "My training is already good enough" | Could be 2-5% better | "SOTA papers all use scheduling - it's standard practice" |
| "Reducing LR will slow training" | Reduces LR when high LR hurts | "High LR early (fast), low LR late (fine-tune) = best of both" |
| "I don't know what T_max to use" | T_max = total_epochs | "Just set T_max to your total training epochs" |


### 11. Red Flags Checklist

Watch for these warning signs that indicate scheduling problems:

**Critical Red Flags (Fix Immediately):**

🚨 Training transformer without warmup
   - **Impact:** High risk of divergence, NaN loss
   - **Fix:** Add 5-10 epoch warmup immediately

🚨 Loss NaN or exploding in first few epochs
   - **Impact:** Training failed
   - **Fix:** Add warmup, lower initial LR, gradient clipping

🚨 scheduler.step() called every batch for Cosine/Step schedulers
   - **Impact:** LR decays 100x too fast
   - **Fix:** Move scheduler.step() outside batch loop

🚨 Not passing metric to ReduceLROnPlateau
   - **Impact:** Scheduler doesn't work at all
   - **Fix:** scheduler.step(val_loss)

**Important Red Flags (Should Fix):**

⚠️ Training >30 epochs without scheduler
   - **Impact:** Leaving 2-5% performance on table
   - **Fix:** Add CosineAnnealingLR or MultiStepLR

⚠️ OneCycle with random max_lr (not tuned)
   - **Impact:** Unstable training or suboptimal performance
   - **Fix:** Run LR finder, tune max_lr

⚠️ Large batch (>512) without warmup
   - **Impact:** Training instability
   - **Fix:** Add 5-10 epoch warmup

⚠️ Vision transformer with constant LR
   - **Impact:** Poor convergence, unstable training
   - **Fix:** Add warmup + cosine schedule

⚠️ Training plateaus but no scheduler to reduce LR
   - **Impact:** Stuck at local minimum
   - **Fix:** Add ReduceLROnPlateau or manually reduce LR

**Minor Red Flags (Consider Fixing):**

⚡ CNN training without any scheduling
   - **Impact:** Missing 1-3% accuracy
   - **Fix:** Add MultiStepLR or CosineAnnealingLR

⚡ Not monitoring LR during training
   - **Impact:** Hard to debug schedule issues
   - **Fix:** Log LR every epoch

⚡ T_max doesn't match training duration
   - **Impact:** Schedule ends too early/late
   - **Fix:** Set T_max = total_epochs - warmup_epochs

⚡ Using same LR for pretrained and new layers (fine-tuning)
   - **Impact:** Suboptimal fine-tuning
   - **Fix:** Use different LRs for param groups

⚡ Not validating schedule before full training
   - **Impact:** Risk wasting compute on wrong schedule
   - **Fix:** Plot schedule dry-run before training


### 12. Quick Reference

## Scheduler Selection Cheatsheet

```
Q: What should I use for...

Vision CNN (100 epochs)?
→ CosineAnnealingLR(optimizer, T_max=100, eta_min=1e-5)

Vision Transformer?
→ LinearLR(warmup 5) + CosineAnnealingLR(T_max=95) [WARMUP MANDATORY]

NLP Transformer?
→ LinearLR(warmup 10%) + LinearLR(decay) [WARMUP MANDATORY]

Fast training (<30 epochs)?
→ OneCycleLR(max_lr=tune_with_LR_finder)

Don't know optimal schedule?
→ ReduceLROnPlateau(mode='min', patience=10)

Training plateaued?
→ Add ReduceLROnPlateau or manually reduce LR by 10x now

Following paper recipe?
→ Use paper's exact schedule (usually MultiStepLR)

Fine-tuning pretrained model?
→ Constant low LR (1e-5) or gentle CosineAnnealing

Large batch (>512)?
→ LinearLR(warmup 5-10) + CosineAnnealingLR [WARMUP MANDATORY]
```


## Step Placement Quick Reference

```python
# Most schedulers (Step, Cosine, Exponential)
for epoch in range(epochs):
    for batch in train_loader:
        train_step(...)
    scheduler.step()  # AFTER epoch

# OneCycleLR (EXCEPTION)
for epoch in range(epochs):
    for batch in train_loader:
        train_step(...)
        scheduler.step()  # AFTER each batch

# ReduceLROnPlateau (pass metric)
for epoch in range(epochs):
    for batch in train_loader:
        train_step(...)
    val_loss = validate(...)
    scheduler.step(val_loss)  # Pass metric
```


## Warmup Quick Reference

```python
# Pattern: Warmup + Cosine (most common)
warmup = LinearLR(optimizer, start_factor=0.01, total_iters=5)
cosine = CosineAnnealingLR(optimizer, T_max=95, eta_min=1e-5)
scheduler = SequentialLR(optimizer, [warmup, cosine], [5])

# When warmup is MANDATORY:
# ✅ Transformers (ViT, BERT, GPT)
# ✅ Large batch (>512)
# ✅ High initial LR
# ✅ Training from scratch

# When warmup is optional:
# ❌ Fine-tuning
# ❌ Small LR (<1e-4)
# ❌ Small models
```


## LR Finder Quick Reference

```python
# Run LR finder
lrs, losses = find_lr(model, train_loader, optimizer, loss_fn, device)

# Find optimal (steepest descent)
optimal_lr = suggest_lr_from_finder(lrs, losses)

# Use cases:
# - Direct use: optimizer = SGD(params, lr=optimal_lr)
# - OneCycle: max_lr = optimal_lr * 10
# - Conservative: base_lr = optimal_lr * 0.1
```


## Summary

Learning rate scheduling is CRITICAL for competitive model performance:

**Key Takeaways:**

1. **Scheduling improves final accuracy by 2-5%** - not optional for SOTA
2. **Warmup is MANDATORY for transformers** - prevents divergence
3. **CosineAnnealingLR is best default** - works well, zero tuning
4. **Use LR finder for new problems** - finds optimal initial LR in minutes
5. **OneCycleLR needs max_lr tuning** - run LR finder first
6. **Watch scheduler.step() placement** - most per epoch, OneCycle per batch
7. **Always monitor LR during training** - log to console or TensorBoard
8. **Plot schedule before training** - catch mistakes early

**Modern Defaults (2025):**
- **Vision CNNs:** SGD + CosineAnnealingLR (optional warmup)
- **Vision Transformers:** AdamW + Warmup + CosineAnnealingLR (warmup mandatory)
- **NLP Transformers:** AdamW + Warmup + Linear decay (warmup mandatory)
- **Fast Training:** SGD + OneCycleLR (tune max_lr with LR finder)

**When In Doubt:**
- Use CosineAnnealingLR with T_max = total_epochs
- Add 5-epoch warmup for large models
- Run LR finder if unsure about initial LR
- Log LR every epoch to monitor schedule

Learning rate scheduling is one of the highest-ROI hyperparameters - master it for significantly better model performance.