Files
2025-11-30 09:00:11 +08:00

2724 lines
77 KiB
Markdown

# Learning Rate Scheduling Skill
## When to Use This Skill
Use this skill when:
- User asks "should I use a learning rate scheduler?"
- Training plateaus or loss stops improving
- Training transformers or large models (warmup critical)
- User wants to implement OneCycleLR or specific scheduler
- Training is unstable in early epochs
- User asks "what learning rate should I use?"
- Deciding between constant LR and scheduled LR
- User is copying a paper's training recipe
- Implementing modern training pipelines (vision, NLP, RL)
- User suggests "just use constant LR" (rationalization)
Do NOT use when:
- User has specific bugs unrelated to scheduling
- Only discussing optimizer choice (no schedule questions)
- Training already working well and no LR questions asked
## Core Principles
### 1. Why Learning Rate Scheduling Matters
Learning rate scheduling is one of the MOST IMPACTFUL hyperparameters:
**High LR Early (Exploration):**
- Fast initial progress through parameter space
- Escape poor local minima
- Rapid loss reduction in early epochs
**Low LR Late (Exploitation):**
- Fine-tune to sharper, better minima
- Improve generalization (test accuracy)
- Stable convergence without oscillation
**Quantitative Impact:**
- Proper scheduling improves final test accuracy by 2-5% (SIGNIFICANT)
- Standard practice in all SOTA papers (ResNet, EfficientNet, ViT, BERT, GPT)
- Not optional for competitive performance
**When Constant LR Fails:**
- Can't explore quickly AND converge precisely
- Either too high (never converges) or too low (too slow)
- Leaves 2-5% performance on the table
### 2. Decision Framework: When to Schedule vs Constant LR
## Use Scheduler When:
**Long training (>30 epochs)**
- Scheduling essential for multi-stage training
- Different LR regimes needed across training
- Example: 90-epoch ImageNet training
**Large model on large dataset**
- Training from scratch on ImageNet, COCO, etc.
- Benefits from exploration → exploitation strategy
- Example: ResNet-50 on ImageNet
**Training plateaus or loss stops improving**
- Current LR too high for current parameter regime
- Reducing LR breaks plateau
- Example: Validation loss stuck for 10+ epochs
**Following established training recipes**
- Papers publish schedules for reproducibility
- Vision models typically use MultiStepLR or Cosine
- Example: ResNet paper specifies drop at epochs 30, 60, 90
**Want competitive SOTA performance**
- Squeezing out last 2-5% accuracy
- Required for benchmarks and competitions
- Example: Targeting SOTA on CIFAR-10
## Maybe Don't Need Scheduler When:
**Very short training (<10 epochs)**
- Not enough time for multi-stage scheduling
- Constant LR or simple linear decay sufficient
- Example: Quick fine-tuning for 5 epochs
**OneCycle is the strategy itself**
- OneCycleLR IS the training strategy (not separate)
- Don't combine OneCycle with another scheduler
- Example: FastAI-style 20-epoch training
**Hyperparameter search phase**
- Constant LR simpler to compare across runs
- Add scheduling after finding good architecture/optimizer
- Example: Running 50 architecture trials
**Transfer learning fine-tuning**
- Small number of epochs on pretrained model
- Constant small LR often sufficient
- Example: Fine-tuning BERT for 3 epochs
**Reinforcement learning**
- RL typically uses constant LR (exploration/exploitation balance different)
- Some exceptions (PPO sometimes uses linear decay)
- Example: DQN, A3C usually constant LR
## Default Recommendation:
**For >30 epoch training:** USE A SCHEDULER (typically CosineAnnealingLR)
**For <10 epoch training:** Constant LR usually fine
**For 10-30 epochs:** Try both, scheduler usually wins
### 3. Major Scheduler Types - Complete Comparison
## StepLR / MultiStepLR (Classic Vision)
**Use When:**
- Training CNNs (ResNet, VGG, etc.)
- Following established recipe from paper
- Want simple, interpretable schedule
**How It Works:**
- Drop LR by constant factor at specific epochs
- StepLR: every N epochs
- MultiStepLR: at specified milestone epochs
**Implementation:**
```python
# StepLR: Drop every 30 epochs
scheduler = torch.optim.lr_scheduler.StepLR(
optimizer,
step_size=30, # Drop every 30 epochs
gamma=0.1 # Multiply LR by 0.1 (10x reduction)
)
# MultiStepLR: Drop at specific milestones (more control)
scheduler = torch.optim.lr_scheduler.MultiStepLR(
optimizer,
milestones=[30, 60, 90], # Drop at these epochs
gamma=0.1 # Multiply by 0.1 each time
)
# Training loop
for epoch in range(100):
train_one_epoch(model, train_loader, optimizer)
scheduler.step() # Call AFTER each epoch
```
**Example Schedule (initial_lr=0.1):**
- Epochs 0-29: LR = 0.1
- Epochs 30-59: LR = 0.01 (dropped by 10x)
- Epochs 60-89: LR = 0.001 (dropped by 10x again)
- Epochs 90-99: LR = 0.0001
**Pros:**
- Simple and interpretable
- Well-established in papers (easy to reproduce)
- Works well for vision models
**Cons:**
- Requires manual milestone selection
- Sharp LR drops can cause temporary instability
- Need to know total training epochs in advance
**Best For:** Classical CNN training (ResNet, VGG) following paper recipes
## CosineAnnealingLR (Modern Default)
**Use When:**
- Training modern vision models (ViT, EfficientNet)
- Want smooth decay without manual milestones
- Don't want to tune milestone positions
**How It Works:**
- Smooth cosine curve from initial_lr to eta_min
- Gradual decay, no sharp drops
- LR = eta_min + (initial_lr - eta_min) * (1 + cos(π * epoch / T_max)) / 2
**Implementation:**
```python
scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(
optimizer,
T_max=100, # Total epochs (LR reaches eta_min at epoch 100)
eta_min=1e-5 # Minimum LR (default: 0)
)
# Training loop
for epoch in range(100):
train_one_epoch(model, train_loader, optimizer)
scheduler.step() # Call AFTER each epoch
```
**Example Schedule (initial_lr=0.1, eta_min=1e-5):**
- Epoch 0: LR = 0.1
- Epoch 25: LR ≈ 0.075
- Epoch 50: LR ≈ 0.05
- Epoch 75: LR ≈ 0.025
- Epoch 100: LR = 0.00001
**Pros:**
- No milestone tuning required
- Smooth decay (no instability from sharp drops)
- Widely used in modern papers
- Works well across many domains
**Cons:**
- Must know total epochs in advance
- Can't adjust schedule during training
**Best Practice: ALWAYS COMBINE WITH WARMUP for large models:**
```python
# Warmup for 5 epochs, then cosine for 95 epochs
warmup = torch.optim.lr_scheduler.LinearLR(
optimizer,
start_factor=0.01, # Start at 1% of base LR
end_factor=1.0, # Ramp to 100%
total_iters=5 # Over 5 epochs
)
cosine = torch.optim.lr_scheduler.CosineAnnealingLR(
optimizer,
T_max=95, # 95 epochs after warmup
eta_min=1e-5
)
scheduler = torch.optim.lr_scheduler.SequentialLR(
optimizer,
schedulers=[warmup, cosine],
milestones=[5] # Switch to cosine after 5 epochs
)
```
**Best For:** Modern vision models, transformers, default choice for most problems
## ReduceLROnPlateau (Adaptive)
**Use When:**
- Don't know optimal schedule in advance
- Want adaptive approach based on validation performance
- Training plateaus and you want automatic LR reduction
**How It Works:**
- Monitors validation metric (loss or accuracy)
- Reduces LR when metric stops improving
- Requires passing metric to scheduler.step()
**Implementation:**
```python
scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(
optimizer,
mode='min', # 'min' for loss, 'max' for accuracy
factor=0.1, # Reduce LR by 10x when plateau detected
patience=10, # Wait 10 epochs before reducing
threshold=1e-4, # Minimum change to count as improvement
threshold_mode='rel', # 'rel' or 'abs'
cooldown=0, # Epochs to wait after LR reduction
min_lr=1e-6, # Don't reduce below this
verbose=True # Print when LR reduced
)
# Training loop
for epoch in range(100):
train_loss = train_one_epoch(model, train_loader, optimizer)
val_loss = validate(model, val_loader)
# IMPORTANT: Pass validation metric to step()
scheduler.step(val_loss) # NOT scheduler.step() alone!
```
**Example Behavior (patience=10, factor=0.1):**
- Epochs 0-30: Val loss improving, LR = 0.001
- Epochs 31-40: Val loss plateaus at 0.15, patience counting
- Epoch 41: Plateau detected, LR reduced to 0.0001
- Epochs 42-60: Val loss improving again with lower LR
- Epoch 61: Plateau again, LR reduced to 0.00001
**Pros:**
- Adaptive - no manual tuning required
- Based on actual training progress
- Good for unknown optimal schedule
**Cons:**
- Can be too conservative (waits long before reducing)
- Requires validation metric (can't use train loss alone)
- May reduce LR too late or not enough
**Tuning Tips:**
- Smaller patience (5-10) for faster adaptation
- Larger patience (10-20) for more conservative
- Factor of 0.1 (10x) is standard, but 0.5 (2x) more gradual
**Best For:** Exploratory training, unknown optimal schedule, adaptive pipelines
## OneCycleLR (Fast Training)
**Use When:**
- Limited compute budget (want fast convergence)
- Training for relatively few epochs (10-30)
- Following FastAI-style training
- Want aggressive schedule for quick results
**How It Works:**
- Ramps UP from low LR to max_lr (first 30% by default)
- Ramps DOWN from max_lr to very low LR (remaining 70%)
- Steps EVERY BATCH (not every epoch) - CRITICAL DIFFERENCE
**Implementation:**
```python
scheduler = torch.optim.lr_scheduler.OneCycleLR(
optimizer,
max_lr=0.1, # Peak learning rate (TUNE THIS!)
steps_per_epoch=len(train_loader), # Batches per epoch
epochs=20, # Total epochs
pct_start=0.3, # Ramp up for first 30%
anneal_strategy='cos', # 'cos' or 'linear'
div_factor=25, # initial_lr = max_lr / 25
final_div_factor=10000 # final_lr = max_lr / 10000
)
# Training loop - NOTE: step() EVERY BATCH
for epoch in range(20):
for batch in train_loader:
loss = train_step(model, batch, optimizer)
optimizer.step()
scheduler.step() # CALL EVERY BATCH, NOT EVERY EPOCH!
```
**Example Schedule (max_lr=0.1, 20 epochs, 400 batches/epoch):**
- Batches 0-2400 (epochs 0-6): LR ramps from 0.004 → 0.1
- Batches 2400-8000 (epochs 6-20): LR ramps from 0.1 → 0.00001
**CRITICAL: Tuning max_lr:**
OneCycleLR is VERY sensitive to max_lr choice. Too high = instability.
**Method 1 - LR Finder (RECOMMENDED):**
```python
# Run LR finder first (see LR Finder section)
optimal_lr = find_lr(model, train_loader, optimizer) # e.g., 0.01
max_lr = optimal_lr * 10 # Use 10x optimal as max_lr
```
**Method 2 - Manual tuning:**
- Start with max_lr = 0.1
- If training unstable, try 0.03, 0.01
- If training too slow, try 0.3, 1.0
**Pros:**
- Very fast convergence (fewer epochs needed)
- Strong final performance
- Popular in FastAI community
**Cons:**
- Sensitive to max_lr (requires tuning)
- Steps every batch (easy to mess up)
- Not ideal for very long training (>50 epochs)
**Common Mistakes:**
1. Calling scheduler.step() per epoch instead of per batch
2. Not tuning max_lr (using default blindly)
3. Using for very long training (OneCycle designed for shorter cycles)
**Best For:** FastAI-style training, limited compute budget, 10-30 epoch training
## Advanced OneCycleLR Tuning
If lowering max_lr doesn't resolve instability, try these advanced tuning options:
**1. Adjust pct_start (warmup fraction):**
```python
# Default: 0.3 (30% warmup, 70% cooldown)
scheduler = OneCycleLR(optimizer, max_lr=0.1, epochs=20,
steps_per_epoch=len(train_loader),
pct_start=0.3) # Default
# If unstable at peak: Increase to 0.4 or 0.5 (longer warmup)
scheduler = OneCycleLR(optimizer, max_lr=0.1, epochs=20,
steps_per_epoch=len(train_loader),
pct_start=0.5) # Gentler ramp to peak
# If unstable in cooldown: Decrease to 0.2 (shorter warmup, gentler descent)
scheduler = OneCycleLR(optimizer, max_lr=0.1, epochs=20,
steps_per_epoch=len(train_loader),
pct_start=0.2)
```
**2. Adjust div_factor (initial LR):**
```python
# Default: 25 (initial_lr = max_lr / 25)
scheduler = OneCycleLR(optimizer, max_lr=0.1, epochs=20,
steps_per_epoch=len(train_loader),
div_factor=25) # Start at 0.004
# If unstable at start: Increase to 50 or 100 (start even lower)
scheduler = OneCycleLR(optimizer, max_lr=0.1, epochs=20,
steps_per_epoch=len(train_loader),
div_factor=100) # Start at 0.001
```
**3. Adjust final_div_factor (final LR):**
```python
# Default: 10000 (final_lr = max_lr / 10000)
scheduler = OneCycleLR(optimizer, max_lr=0.1, epochs=20,
steps_per_epoch=len(train_loader),
final_div_factor=10000) # End at 0.00001
# If unstable at end: Decrease to 1000 (end at higher LR)
scheduler = OneCycleLR(optimizer, max_lr=0.1, epochs=20,
steps_per_epoch=len(train_loader),
final_div_factor=1000) # End at 0.0001
```
**4. Add gradient clipping:**
```python
# In training loop
for batch in train_loader:
loss = train_step(model, batch, optimizer)
loss.backward()
# Clip gradients to prevent instability
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
optimizer.step()
scheduler.step()
```
**5. Consider OneCycle may not be right for your problem:**
- **Very deep networks (>100 layers):** May be too unstable for OneCycle's aggressive schedule
- **Large models (>100M params):** May need gentler schedule (Cosine + warmup)
- **Sensitive architectures (some transformers):** OneCycle's rapid LR changes can destabilize
**Alternative:** Use CosineAnnealing + warmup for more stable training:
```python
# More stable alternative to OneCycle
warmup = LinearLR(optimizer, start_factor=0.01, total_iters=5)
cosine = CosineAnnealingLR(optimizer, T_max=15, eta_min=1e-5)
scheduler = SequentialLR(optimizer, [warmup, cosine], [5])
```
## LinearLR (Warmup)
**Use When:**
- Need warmup at training start
- Ramping up LR gradually over first few epochs
- Combining with another scheduler (SequentialLR)
**How It Works:**
- Linearly interpolates LR from start_factor to end_factor
- Typically used for warmup: start_factor=0.01, end_factor=1.0
**Implementation:**
```python
# Standalone linear warmup
scheduler = torch.optim.lr_scheduler.LinearLR(
optimizer,
start_factor=0.01, # Start at 1% of base LR
end_factor=1.0, # End at 100% of base LR
total_iters=5 # Over 5 epochs
)
# More common: Combine with main scheduler
warmup = torch.optim.lr_scheduler.LinearLR(
optimizer,
start_factor=0.01,
total_iters=5
)
main = torch.optim.lr_scheduler.CosineAnnealingLR(
optimizer,
T_max=95
)
scheduler = torch.optim.lr_scheduler.SequentialLR(
optimizer,
schedulers=[warmup, main],
milestones=[5] # Switch after 5 epochs
)
# Training loop
for epoch in range(100):
train_one_epoch(model, train_loader, optimizer)
scheduler.step()
```
**Example Schedule (base_lr=0.1):**
- Epoch 0: LR = 0.001 (1%)
- Epoch 1: LR = 0.0208 (20.8%)
- Epoch 2: LR = 0.0406 (40.6%)
- Epoch 3: LR = 0.0604 (60.4%)
- Epoch 4: LR = 0.0802 (80.2%)
- Epoch 5: LR = 0.1 (100%, then switch to main scheduler)
**Best For:** Warmup phase for transformers and large models
## ExponentialLR (Continuous Decay)
**Use When:**
- Want smooth, continuous decay
- Simpler alternative to Cosine
- Prefer exponential over linear decay
**How It Works:**
- Multiply LR by gamma every epoch
- LR(epoch) = initial_lr * gamma^epoch
**Implementation:**
```python
scheduler = torch.optim.lr_scheduler.ExponentialLR(
optimizer,
gamma=0.95 # Multiply by 0.95 each epoch
)
# Training loop
for epoch in range(100):
train_one_epoch(model, train_loader, optimizer)
scheduler.step()
```
**Example Schedule (initial_lr=0.1, gamma=0.95):**
- Epoch 0: LR = 0.1
- Epoch 10: LR = 0.0599
- Epoch 50: LR = 0.0077
- Epoch 100: LR = 0.0059
**Tuning gamma:**
- Want 10x decay over 100 epochs: gamma = 0.977 (0.1^(1/100))
- Want 100x decay over 100 epochs: gamma = 0.955 (0.01^(1/100))
- General formula: gamma = (target_lr / initial_lr)^(1/epochs)
**Pros:**
- Very smooth decay
- Simple to implement
**Cons:**
- Hard to intuit gamma value for desired final LR
- Less popular than Cosine (Cosine is better default)
**Best For:** Cases where you want exponential decay specifically
## LambdaLR (Custom Schedules)
**Use When:**
- Need custom schedule not provided by standard schedulers
- Implementing paper-specific schedule
- Advanced use cases (e.g., transformer inverse sqrt schedule)
**How It Works:**
- Provide function that computes LR multiplier for each epoch
- LR(epoch) = initial_lr * lambda(epoch)
**Implementation:**
```python
# Example: Warmup then constant
def warmup_lambda(epoch):
if epoch < 5:
return (epoch + 1) / 5 # Linear warmup
else:
return 1.0 # Constant after warmup
scheduler = torch.optim.lr_scheduler.LambdaLR(
optimizer,
lr_lambda=warmup_lambda
)
# Example: Transformer inverse square root schedule
def transformer_schedule(epoch):
warmup_steps = 4000
step = epoch + 1
return min(step ** (-0.5), step * warmup_steps ** (-1.5))
scheduler = torch.optim.lr_scheduler.LambdaLR(
optimizer,
lr_lambda=transformer_schedule
)
# Example: Polynomial decay
def polynomial_decay(epoch):
return (1 - epoch / 100) ** 0.9 # Decay to 0 at epoch 100
scheduler = torch.optim.lr_scheduler.LambdaLR(
optimizer,
lr_lambda=polynomial_decay
)
```
**Best For:** Custom schedules, implementing specific papers, advanced users
### 4. Warmup Strategies - CRITICAL FOR TRANSFORMERS
## Why Warmup is Essential
**Problem at Training Start:**
- Weights are randomly initialized
- Gradients can be very large and unstable
- BatchNorm statistics are uninitialized
- High LR can cause immediate divergence (NaN loss)
**Solution: Gradual LR Increase**
- Start with very low LR (1% of target)
- Linearly increase to target LR over first few epochs
- Allows model to stabilize before aggressive learning
**Quantitative Impact:**
- Transformers WITHOUT warmup: Often diverge or train very unstably
- Transformers WITH warmup: Stable training, better final performance
- Vision models: Warmup improves stability, sometimes +0.5-1% accuracy
## When Warmup is MANDATORY
**ALWAYS use warmup when:**
**Training transformers (ViT, BERT, GPT, T5, etc.)**
- Transformers REQUIRE warmup - not optional
- Without warmup, training often diverges
- Standard practice in all transformer papers
**Large batch sizes (>512)**
- Large batches → larger effective learning rate
- Warmup prevents early instability
- Standard for distributed training
**High initial learning rates**
- If starting with LR > 0.001, use warmup
- Warmup allows higher peak LR safely
**Training from scratch (not fine-tuning)**
- Random initialization needs gentle start
- Fine-tuning can often skip warmup (weights already good)
**Usually use warmup when:**
✅ Large models (>100M parameters)
✅ Using AdamW optimizer (common with transformers)
✅ Following modern training recipes
**May skip warmup when:**
❌ Fine-tuning pretrained models (weights already trained)
❌ Small learning rates (< 0.0001)
❌ Small models (<10M parameters)
❌ Established recipe without warmup (e.g., some CNN papers)
## Warmup Implementation Patterns
### Pattern 1: Linear Warmup + Cosine Decay (Most Common)
```python
import torch.optim.lr_scheduler as lr_scheduler
# Warmup for 5 epochs
warmup = lr_scheduler.LinearLR(
optimizer,
start_factor=0.01, # Start at 1% of base LR
end_factor=1.0, # End at 100% of base LR
total_iters=5 # Over 5 epochs
)
# Cosine decay for remaining 95 epochs
cosine = lr_scheduler.CosineAnnealingLR(
optimizer,
T_max=95, # 95 epochs after warmup
eta_min=1e-5 # Final LR
)
# Combine sequentially
scheduler = lr_scheduler.SequentialLR(
optimizer,
schedulers=[warmup, cosine],
milestones=[5] # Switch to cosine after epoch 5
)
# Training loop
for epoch in range(100):
train_one_epoch(model, train_loader, optimizer)
scheduler.step()
```
**Schedule Visualization (base_lr=0.001):**
- Epochs 0-4: Linear ramp from 0.00001 → 0.001 (warmup)
- Epochs 5-99: Cosine decay from 0.001 → 0.00001
**Use For:** Vision transformers, modern CNNs, most large-scale training
### Pattern 2: Linear Warmup + MultiStepLR
```python
# Warmup for 5 epochs
warmup = lr_scheduler.LinearLR(
optimizer,
start_factor=0.01,
total_iters=5
)
# Step decay at 30, 60, 90
steps = lr_scheduler.MultiStepLR(
optimizer,
milestones=[30, 60, 90],
gamma=0.1
)
# Combine
scheduler = lr_scheduler.SequentialLR(
optimizer,
schedulers=[warmup, steps],
milestones=[5]
)
```
**Use For:** Classical CNN training with warmup
### Pattern 3: Manual Warmup (More Control)
```python
def get_lr_schedule(epoch, total_epochs, base_lr, warmup_epochs=5):
"""
Custom schedule with warmup and cosine decay.
"""
if epoch < warmup_epochs:
# Linear warmup
return base_lr * (epoch + 1) / warmup_epochs
else:
# Cosine decay
progress = (epoch - warmup_epochs) / (total_epochs - warmup_epochs)
return base_lr * 0.5 * (1 + math.cos(math.pi * progress))
# Training loop
for epoch in range(100):
lr = get_lr_schedule(epoch, total_epochs=100, base_lr=0.001)
for param_group in optimizer.param_groups:
param_group['lr'] = lr
train_one_epoch(model, train_loader, optimizer)
```
**Use For:** Custom schedules, research, maximum control
### Pattern 4: Transformer-Style Warmup (Inverse Square Root)
```python
def transformer_lr_schedule(step, d_model, warmup_steps):
"""
Transformer schedule from "Attention is All You Need".
LR increases during warmup, then decreases proportionally to inverse sqrt of step.
"""
step = step + 1 # 1-indexed
return (d_model ** -0.5) * min(step ** -0.5, step * warmup_steps ** -1.5)
scheduler = lr_scheduler.LambdaLR(
optimizer,
lr_lambda=lambda step: transformer_lr_schedule(step, d_model=512, warmup_steps=4000)
)
# Training loop - NOTE: step every BATCH for this schedule
for epoch in range(epochs):
for batch in train_loader:
train_step(model, batch, optimizer)
optimizer.step()
scheduler.step() # Step every batch
```
**Use For:** Transformer models (BERT, GPT), following original papers
## Warmup Duration Guidelines
**How many warmup epochs?**
- **Transformers:** 5-20 epochs (or 5-10% of total training)
- **Vision models:** 5-10 epochs
- **Very large models (>1B params):** 10-20 epochs
- **Small models:** 3-5 epochs
**Rule of thumb:** 5-10% of total training epochs
**Examples:**
- 100-epoch training: 5-10 epoch warmup
- 20-epoch training: 2-3 epoch warmup
- 300-epoch training: 15-30 epoch warmup
## "But My Transformer Trained Fine Without Warmup"
Some users report training transformers without warmup successfully. Here's the reality:
**What "fine" actually means:**
- Training didn't diverge (NaN loss) - that's a low bar
- Got reasonable accuracy - but NOT optimal accuracy
- One successful run doesn't mean it's optimal or reliable
**What you're missing without warmup:**
**1. Performance gap (1-3% accuracy):**
```
Without warmup: Training works, achieves 85% accuracy
With warmup: Same model achieves 87-88% accuracy
```
That 2-3% is SIGNIFICANT:
- Difference between competitive and SOTA
- Difference between accepted and rejected paper
- Difference between passing and failing business metrics
**2. Training stability:**
```
Without warmup:
- Some runs diverge → need to restart with lower LR
- Sensitive to initialization seed
- Requires careful LR tuning
- Success rate: 60-80% of runs
With warmup:
- Stable training → consistent results
- Robust to initialization
- Wider stable LR range
- Success rate: 95-100% of runs
```
**3. Hyperparameter sensitivity:**
Without warmup:
- Very sensitive to initial LR choice (0.001 works, 0.0015 diverges)
- Sensitive to batch size
- Sensitive to optimizer settings
With warmup:
- More forgiving LR range (0.0005-0.002 all work)
- Less sensitive to batch size
- Robust optimizer configuration
**Empirical Evidence - Published Papers:**
Check transformer papers - ALL use warmup:
| Model | Paper | Warmup |
|-------|-------|--------|
| ViT | Dosovitskiy et al., 2020 | ✅ Linear, 10k steps |
| DeiT | Touvron et al., 2021 | ✅ Linear, 5 epochs |
| Swin | Liu et al., 2021 | ✅ Linear, 20 epochs |
| BERT | Devlin et al., 2018 | ✅ Linear, 10k steps |
| GPT-2 | Radford et al., 2019 | ✅ Linear warmup |
| GPT-3 | Brown et al., 2020 | ✅ Linear warmup |
| T5 | Raffel et al., 2020 | ✅ Inverse sqrt warmup |
**Every competitive transformer model uses warmup - there's a reason.**
**"But I got 85% accuracy without warmup!"**
Great! Now try with warmup and see if you get 87-88%. You probably will.
**The cost-benefit analysis:**
```python
# Cost: One line of code
warmup = LinearLR(optimizer, start_factor=0.01, total_iters=5)
scheduler = SequentialLR(optimizer, [warmup, main], [5])
# Benefit:
# - 1-3% better accuracy
# - More stable training
# - Higher success rate
# - Wider stable hyperparameter range
```
**Recommendation:**
1. Run ablation study: Train your model with and without warmup
2. Compare: Final test accuracy, training stability, number of failed runs
3. You'll find warmup gives better results with minimal complexity
**Bottom line:** Just because something "works" doesn't mean it's optimal. Warmup is standard practice for transformers because it consistently improves results.
### 5. LR Finder - Finding Optimal Initial LR
## What is LR Finder?
**Method from Leslie Smith (2015):** Cyclical Learning Rates paper
**Core Idea:**
1. Start with very small LR (1e-8)
2. Gradually increase LR (multiply by ~1.1 each batch)
3. Train for a few hundred steps, record loss at each LR
4. Plot loss vs LR
5. Choose LR where loss decreases fastest (steepest descent)
**Why It Works:**
- Too low LR: Loss decreases very slowly
- Optimal LR: Loss decreases rapidly (steepest slope)
- Too high LR: Loss plateaus or increases (instability)
**Typical Findings:**
- Loss decreases fastest at some LR (e.g., 0.01)
- Loss starts increasing at higher LR (e.g., 0.1)
- Choose LR slightly below fastest descent point (e.g., 0.003-0.01)
## LR Finder Implementation
```python
import torch
import matplotlib.pyplot as plt
import numpy as np
def find_lr(model, train_loader, optimizer, loss_fn, device,
start_lr=1e-8, end_lr=10, num_iter=100, smooth_f=0.05):
"""
LR Finder: Sweep learning rates and plot loss curve.
Args:
model: PyTorch model
train_loader: Training data loader
optimizer: Optimizer (will be modified)
loss_fn: Loss function
device: Device to train on
start_lr: Starting learning rate (default: 1e-8)
end_lr: Ending learning rate (default: 10)
num_iter: Number of iterations (default: 100)
smooth_f: Smoothing factor for loss (default: 0.05)
Returns:
lrs: List of learning rates tested
losses: List of losses at each LR
"""
# Save initial model state to restore later
model.train()
initial_state = model.state_dict()
# Calculate LR multiplier for exponential increase
lr_mult = (end_lr / start_lr) ** (1 / num_iter)
lrs = []
losses = []
best_loss = float('inf')
avg_loss = 0
lr = start_lr
# Iterate through training data
iterator = iter(train_loader)
for iteration in range(num_iter):
try:
data, target = next(iterator)
except StopIteration:
# Restart iterator if we run out of data
iterator = iter(train_loader)
data, target = next(iterator)
# Set learning rate
for param_group in optimizer.param_groups:
param_group['lr'] = lr
# Forward pass
data, target = data.to(device), target.to(device)
optimizer.zero_grad()
output = model(data)
loss = loss_fn(output, target)
# Compute smoothed loss (exponential moving average)
if iteration == 0:
avg_loss = loss.item()
else:
avg_loss = smooth_f * loss.item() + (1 - smooth_f) * avg_loss
# Record
lrs.append(lr)
losses.append(avg_loss)
# Track best loss
if avg_loss < best_loss:
best_loss = avg_loss
# Stop if loss explodes (>4x best loss)
if avg_loss > 4 * best_loss:
print(f"Stopping early at iteration {iteration}: loss exploded")
break
# Backward pass
loss.backward()
optimizer.step()
# Increase learning rate
lr *= lr_mult
if lr > end_lr:
break
# Restore model to initial state
model.load_state_dict(initial_state)
# Plot results
plt.figure(figsize=(10, 6))
plt.plot(lrs, losses)
plt.xscale('log')
plt.xlabel('Learning Rate (log scale)')
plt.ylabel('Loss')
plt.title('LR Finder')
plt.grid(True, alpha=0.3)
# Mark suggested LR (10x below minimum loss)
min_loss_idx = np.argmin(losses)
suggested_lr = lrs[max(0, min_loss_idx - 5)] # A bit before minimum
plt.axvline(suggested_lr, color='red', linestyle='--',
label=f'Suggested LR: {suggested_lr:.2e}')
plt.legend()
plt.show()
print(f"\nLR Finder Results:")
print(f" Minimum loss at LR: {lrs[min_loss_idx]:.2e}")
print(f" Suggested starting LR: {suggested_lr:.2e}")
print(f" (Choose LR where loss decreases fastest, before minimum)")
return lrs, losses
def suggest_lr_from_finder(lrs, losses):
"""
Suggest optimal learning rate from LR finder results.
Strategy: Find LR where loss gradient is steepest (fastest decrease).
"""
# Compute gradient of loss w.r.t. log(LR)
log_lrs = np.log10(lrs)
gradients = np.gradient(losses, log_lrs)
# Find steepest descent (most negative gradient)
steepest_idx = np.argmin(gradients)
# Suggested LR is at steepest point or slightly before
suggested_lr = lrs[steepest_idx]
return suggested_lr
```
## Using LR Finder
### Basic Usage:
```python
# Setup model, optimizer, loss
model = YourModel().to(device)
optimizer = torch.optim.SGD(model.parameters(), lr=0.1) # LR will be overridden
loss_fn = torch.nn.CrossEntropyLoss()
# Run LR finder
lrs, losses = find_lr(model, train_loader, optimizer, loss_fn, device)
# Manually inspect plot and choose LR
# Look for: steepest descent point (fastest loss decrease)
# Typically: 10x lower than loss minimum
# Example: If minimum is at 0.1, choose 0.01 as starting LR
base_lr = 0.01 # Based on plot inspection
```
### Automated LR Selection:
```python
# Run LR finder
lrs, losses = find_lr(model, train_loader, optimizer, loss_fn, device)
# Get suggested LR
suggested_lr = suggest_lr_from_finder(lrs, losses)
# Use suggested LR
optimizer = torch.optim.SGD(model.parameters(), lr=suggested_lr)
```
### Using with OneCycleLR:
```python
# Find optimal LR
lrs, losses = find_lr(model, train_loader, optimizer, loss_fn, device)
optimal_lr = suggest_lr_from_finder(lrs, losses) # e.g., 0.01
# OneCycleLR: Use 5-10x optimal as max_lr
max_lr = optimal_lr * 10 # e.g., 0.1
scheduler = torch.optim.lr_scheduler.OneCycleLR(
optimizer,
max_lr=max_lr,
steps_per_epoch=len(train_loader),
epochs=20
)
```
## Interpreting LR Finder Results
**Typical Plot Patterns:**
```
Loss
|
| X <-- Loss explodes (LR too high)
| X
| X
| X <-- Loss minimum (still too high)
| X
| X <-- CHOOSE HERE (steepest descent)
| X
| X
| X
|X___________
1e-8 1e-4 1e-2 0.1 1.0 10
Learning Rate
```
**How to Choose:**
1. **Steepest Descent (BEST):**
- Find where loss decreases fastest (steepest downward slope)
- This is optimal LR for rapid convergence
- Example: If steepest at 0.01, choose 0.01
2. **Before Minimum (SAFE):**
- Find minimum loss LR (e.g., 0.1)
- Choose 10x lower (e.g., 0.01)
- More conservative, safer choice
3. **Avoid:**
- Don't choose minimum itself (often too high)
- Don't choose where loss is flat (too low, slow progress)
- Don't choose where loss increases (way too high)
**Guidelines:**
- For SGD: Choose at steepest descent
- For Adam: Choose 10x below steepest (Adam more sensitive)
- For OneCycle: Use steepest as optimal, 5-10x as max_lr
## When to Use LR Finder
**Use LR Finder When:**
✅ Starting new project (unknown optimal LR)
✅ New architecture or dataset
✅ Tuning OneCycleLR (finding max_lr)
✅ Transitioning between optimizers
✅ Having training instability issues
**Can Skip When:**
❌ Following established paper recipe (LR already known)
❌ Fine-tuning (small LR like 1e-5 typically works)
❌ Very constrained time/resources
❌ Using adaptive methods (ReduceLROnPlateau)
**Best Practice:**
- Run LR finder once at project start
- Use found LR for all subsequent runs
- Re-run if changing optimizer, architecture, or batch size significantly
### 6. Scheduler Selection Guide
## Selection Flowchart
**1. What's your training duration?**
- **<10 epochs:** Constant LR or simple linear decay
- **10-30 epochs:** OneCycleLR (fast) or CosineAnnealingLR
- **>30 epochs:** CosineAnnealingLR or MultiStepLR
**2. What's your model type?**
- **Transformer (ViT, BERT, GPT):** CosineAnnealing + WARMUP (mandatory)
- **CNN (ResNet, EfficientNet):** MultiStepLR or CosineAnnealing + optional warmup
- **Small model:** Simpler schedulers (StepLR) or constant LR
**3. Do you know optimal schedule?**
- **Yes (from paper):** Use paper's schedule (MultiStepLR usually)
- **No (exploring):** ReduceLROnPlateau or CosineAnnealing
- **Want fast results:** OneCycleLR + LR finder
**4. What's your compute budget?**
- **High budget (100+ epochs):** CosineAnnealing or MultiStepLR
- **Low budget (10-20 epochs):** OneCycleLR
- **Adaptive budget:** ReduceLROnPlateau (stops when plateau)
## Paper Recipe vs Modern Best Practices
**If goal is EXACT REPRODUCTION:**
Use paper's exact schedule (down to every detail):
```python
# Example: Reproducing ResNet paper (He et al., 2015)
optimizer = torch.optim.SGD(model.parameters(), lr=0.1, momentum=0.9, weight_decay=1e-4)
scheduler = torch.optim.lr_scheduler.MultiStepLR(optimizer, milestones=[30, 60, 90], gamma=0.1)
# No warmup (paper didn't use it)
# Train for 100 epochs
```
**Rationale:**
- Reproduce results exactly
- Enable apples-to-apples comparison
- Validate paper's claims
- Establish baseline before improvements
**If goal is BEST PERFORMANCE:**
Use modern recipe (benefit from years of community learning):
```python
# Modern equivalent: ResNet with modern practices
optimizer = torch.optim.SGD(model.parameters(), lr=0.1, momentum=0.9, weight_decay=1e-4)
warmup = LinearLR(optimizer, start_factor=0.01, total_iters=5)
cosine = CosineAnnealingLR(optimizer, T_max=95, eta_min=1e-5)
scheduler = SequentialLR(optimizer, [warmup, cosine], [5])
# Train for 100 epochs
```
**Rationale:**
- Typically +0.5-2% better accuracy than original paper
- More stable training
- Reflects 5-10 years of community improvements
- SOTA competitive performance
**Evolution of LR Scheduling Practices:**
**Early Deep Learning (2012-2016):**
- Scheduler: StepLR with manual milestones
- Warmup: Not used (not yet discovered)
- Optimizer: SGD with momentum
- Examples: AlexNet, VGG, ResNet, Inception
**Mid Period (2017-2019):**
- Scheduler: CosineAnnealing introduced, OneCycleLR popular
- Warmup: Starting to be used for large batch training
- Optimizer: SGD still dominant, Adam increasingly common
- Examples: ResNeXt, DenseNet, MobileNet
**Modern Era (2020-2025):**
- Scheduler: CosineAnnealing default, OneCycle for fast training
- Warmup: Standard practice (mandatory for transformers)
- Optimizer: AdamW increasingly preferred for transformers
- Examples: ViT, EfficientNet, ConvNeXt, Swin, DeiT
**Practical Workflow:**
**Step 1: Reproduce paper recipe**
```python
# Use exact paper settings
optimizer = torch.optim.SGD(model.parameters(), lr=0.1, momentum=0.9)
scheduler = MultiStepLR(optimizer, milestones=[30, 60, 90], gamma=0.1)
# Should match paper's reported accuracy (e.g., 76.5%)
```
**Step 2: Validate reproduction**
- If you get 76.5% (matches paper): ✅ Reproduction successful
- If you get 74% (2% worse): ❌ Implementation bug, fix first
- If you get 78% (2% better): ✅ Great! Proceed to modern recipe
**Step 3: Try modern recipe**
```python
# Add warmup + cosine
warmup = LinearLR(optimizer, start_factor=0.01, total_iters=5)
cosine = CosineAnnealingLR(optimizer, T_max=95, eta_min=1e-5)
scheduler = SequentialLR(optimizer, [warmup, cosine], [5])
# Expect +0.5-2% improvement (e.g., 77-78.5%)
```
**Step 4: Compare results**
| Version | Accuracy | Notes |
|---------|----------|-------|
| Paper recipe | 76.5% | Baseline (reproduces paper) |
| Modern recipe | 78.0% | +1.5% from warmup + cosine |
**When to Use Which:**
**Use Paper Recipe:**
- Publishing reproduction study
- Comparing to paper's baseline
- Validating implementation correctness
- Research requiring exact reproducibility
**Use Modern Recipe:**
- Building production system (want best performance)
- Competing in benchmark (need SOTA results)
- Publishing new method (should use modern baseline)
- Limited compute (modern practices more efficient)
**Trade-off Table:**
| Aspect | Paper Recipe | Modern Recipe |
|--------|--------------|---------------|
| Reproducibility | ✅ Exact | ⚠️ Better but different |
| Performance | ⚠️ Good (for its time) | ✅ Better (+0.5-2%) |
| Comparability | ✅ To paper | ✅ To SOTA |
| Compute efficiency | ⚠️ May be suboptimal | ✅ Modern optimizations |
| Training stability | ⚠️ Variable | ✅ More stable (warmup) |
**Bottom Line:**
Both are valid depending on your goal:
- **Research/reproduction:** Start with paper recipe
- **Production/competition:** Use modern recipe
- **Best practice:** Validate with paper recipe, deploy with modern recipe
## Domain-Specific Recommendations
### Image Classification (CNNs)
**Standard Recipe (ResNet, VGG):**
```python
optimizer = torch.optim.SGD(model.parameters(), lr=0.1, momentum=0.9, weight_decay=1e-4)
scheduler = torch.optim.lr_scheduler.MultiStepLR(optimizer, milestones=[30, 60, 90], gamma=0.1)
# Train for 100 epochs
```
**Modern Recipe (EfficientNet, RegNet):**
```python
optimizer = torch.optim.SGD(model.parameters(), lr=0.1, momentum=0.9, weight_decay=1e-5)
warmup = LinearLR(optimizer, start_factor=0.01, total_iters=5)
cosine = CosineAnnealingLR(optimizer, T_max=95, eta_min=1e-5)
scheduler = SequentialLR(optimizer, [warmup, cosine], [5])
# Train for 100 epochs
```
### Vision Transformers (ViT, Swin, DeiT)
**Standard Recipe:**
```python
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-3, weight_decay=0.05)
warmup = LinearLR(optimizer, start_factor=0.01, total_iters=10)
cosine = CosineAnnealingLR(optimizer, T_max=290, eta_min=1e-5)
scheduler = SequentialLR(optimizer, [warmup, cosine], [10])
# Train for 300 epochs
# WARMUP IS MANDATORY
```
### NLP Transformers (BERT, GPT, T5)
**Standard Recipe:**
```python
optimizer = torch.optim.AdamW(model.parameters(), lr=5e-4, weight_decay=0.01)
# Linear warmup + linear decay
def lr_lambda(step):
warmup_steps = 10000
total_steps = 100000
if step < warmup_steps:
return step / warmup_steps
else:
return max(0.0, (total_steps - step) / (total_steps - warmup_steps))
scheduler = torch.optim.lr_scheduler.LambdaLR(optimizer, lr_lambda)
# Step every batch, not epoch
# WARMUP IS MANDATORY
```
### Object Detection (Faster R-CNN, YOLO)
**Standard Recipe:**
```python
optimizer = torch.optim.SGD(model.parameters(), lr=0.02, momentum=0.9, weight_decay=1e-4)
scheduler = torch.optim.lr_scheduler.MultiStepLR(optimizer, milestones=[16, 22], gamma=0.1)
# Train for 26 epochs
```
### Fast Training (Limited Compute)
**FastAI Recipe:**
```python
# Run LR finder first
optimal_lr = find_lr(model, train_loader, optimizer, loss_fn, device)
max_lr = optimal_lr * 10
optimizer = torch.optim.SGD(model.parameters(), lr=max_lr)
scheduler = torch.optim.lr_scheduler.OneCycleLR(
optimizer,
max_lr=max_lr,
steps_per_epoch=len(train_loader),
epochs=20,
pct_start=0.3
)
# Train for 20 epochs
# Step every batch
```
### 7. Common Scheduling Pitfalls
## Pitfall 1: No Warmup for Transformers
**WRONG:**
```python
# Training Vision Transformer
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-3)
scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=100)
# ❌ No warmup - training will be very unstable or diverge
```
**RIGHT:**
```python
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-3)
warmup = LinearLR(optimizer, start_factor=0.01, total_iters=5)
cosine = CosineAnnealingLR(optimizer, T_max=95)
scheduler = SequentialLR(optimizer, [warmup, cosine], [5])
# ✅ Warmup prevents early instability
```
**Why It Matters:**
- Transformers with high LR at start → NaN loss, divergence
- Random initialization needs gradual LR ramp
- 5-10 epoch warmup is STANDARD practice
**How to Detect:**
- Loss is NaN or explodes in first few epochs
- Training very unstable early, stabilizes later
- Gradients extremely large at start
## Pitfall 2: Wrong scheduler.step() Placement
**WRONG (Most Schedulers):**
```python
for epoch in range(epochs):
for batch in train_loader:
loss = train_step(model, batch, optimizer)
optimizer.step()
scheduler.step() # ❌ Stepping every batch, not every epoch
```
**RIGHT:**
```python
for epoch in range(epochs):
for batch in train_loader:
loss = train_step(model, batch, optimizer)
optimizer.step()
scheduler.step() # ✅ Step AFTER each epoch
```
**EXCEPTION (OneCycleLR):**
```python
for epoch in range(epochs):
for batch in train_loader:
loss = train_step(model, batch, optimizer)
optimizer.step()
scheduler.step() # ✅ OneCycle steps EVERY BATCH
```
**Why It Matters:**
- CosineAnnealing with T_max=100 expects 100 steps (epochs)
- Stepping every batch: If 390 batches/epoch, LR decays in <1 epoch
- LR reaches minimum way too fast
**How to Detect:**
- LR decays to minimum in first epoch
- Print LR each step: `print(optimizer.param_groups[0]['lr'])`
- Check if LR changes every batch (wrong) vs every epoch (right)
**Rule:**
- **Most schedulers (Step, Cosine, Exponential):** Step per epoch
- **OneCycleLR only:** Step per batch
- **ReduceLROnPlateau:** Step per epoch with validation metric
## Pitfall 3: scheduler.step() Before optimizer.step()
**WRONG:**
```python
loss.backward()
scheduler.step() # ❌ Wrong order
optimizer.step()
```
**RIGHT:**
```python
loss.backward()
optimizer.step() # ✅ Update weights first
scheduler.step() # Then update LR
```
**Why It Matters:**
- Scheduler updates LR based on current epoch/step
- Should update weights with current LR, THEN move to next LR
- Wrong order = off-by-one error in schedule
**How to Detect:**
- Usually subtle, hard to notice
- Best practice: always optimizer.step() then scheduler.step()
## Pitfall 4: Not Passing Metric to ReduceLROnPlateau
**WRONG:**
```python
scheduler = ReduceLROnPlateau(optimizer)
for epoch in range(epochs):
train_loss = train_one_epoch(model, train_loader, optimizer)
scheduler.step() # ❌ No metric passed
```
**RIGHT:**
```python
scheduler = ReduceLROnPlateau(optimizer, mode='min')
for epoch in range(epochs):
train_loss = train_one_epoch(model, train_loader, optimizer)
val_loss = validate(model, val_loader)
scheduler.step(val_loss) # ✅ Pass validation metric
```
**Why It Matters:**
- ReduceLROnPlateau NEEDS metric to detect plateau
- Without metric, scheduler doesn't know when to reduce LR
- Will get error or incorrect behavior
**How to Detect:**
- Error message: "ReduceLROnPlateau needs a metric"
- LR never reduces even when training plateaus
## Pitfall 5: Using OneCycle for Long Training
**SUBOPTIMAL:**
```python
# Training for 200 epochs
scheduler = OneCycleLR(optimizer, max_lr=0.1, epochs=200, steps_per_epoch=len(train_loader))
# ❌ OneCycle designed for shorter training (10-30 epochs)
```
**BETTER:**
```python
# For long training, use Cosine
warmup = LinearLR(optimizer, start_factor=0.01, total_iters=10)
cosine = CosineAnnealingLR(optimizer, T_max=190, eta_min=1e-5)
scheduler = SequentialLR(optimizer, [warmup, cosine], [10])
# ✅ Cosine better suited for long training
```
**Why It Matters:**
- OneCycle's aggressive up-then-down profile works for short training
- For long training, gentler cosine decay more stable
- OneCycle typically used for 10-30 epochs in FastAI style
**When to Use Each:**
- **OneCycle:** 10-30 epochs, limited compute, want fast results
- **Cosine:** 50+ epochs, full training, want best final performance
## Pitfall 6: Not Tuning max_lr for OneCycle
**WRONG:**
```python
# Just guessing max_lr
scheduler = OneCycleLR(optimizer, max_lr=0.1, epochs=20, steps_per_epoch=len(train_loader))
# ❌ Random max_lr without tuning
# Might be too high (unstable) or too low (slow)
```
**RIGHT:**
```python
# Step 1: Run LR finder
lrs, losses = find_lr(model, train_loader, optimizer, loss_fn, device)
optimal_lr = suggest_lr_from_finder(lrs, losses) # e.g., 0.01
# Step 2: Use 5-10x optimal as max_lr
max_lr = optimal_lr * 10 # e.g., 0.1
scheduler = OneCycleLR(optimizer, max_lr=max_lr, epochs=20, steps_per_epoch=len(train_loader))
# ✅ Tuned max_lr based on LR finder
```
**Why It Matters:**
- OneCycle is VERY sensitive to max_lr
- Too high: Training unstable, loss explodes
- Too low: Slow training, underperforms
- LR finder finds optimal, use 5-10x as max_lr
**How to Tune:**
1. Run LR finder (see LR Finder section)
2. Find optimal LR (steepest descent point)
3. Use 5-10x optimal as max_lr for OneCycle
4. If still unstable, reduce max_lr (try 3x, 2x)
## Pitfall 7: Forgetting to Adjust T_max After Adding Warmup
**WRONG:**
```python
# Want 100 epoch training
warmup = LinearLR(optimizer, start_factor=0.01, total_iters=5)
cosine = CosineAnnealingLR(optimizer, T_max=100) # ❌ Should be 95
scheduler = SequentialLR(optimizer, [warmup, cosine], [5])
```
**RIGHT:**
```python
# Want 100 epoch training
warmup = LinearLR(optimizer, start_factor=0.01, total_iters=5)
cosine = CosineAnnealingLR(optimizer, T_max=95) # ✅ 100 - 5 = 95
scheduler = SequentialLR(optimizer, [warmup, cosine], [5])
```
**Why It Matters:**
- Total training is warmup + main schedule
- If warmup is 5 epochs and cosine is 100, total is 105 epochs
- T_max should be (total_epochs - warmup_epochs)
**How to Calculate:**
```python
total_epochs = 100
warmup_epochs = 5
T_max = total_epochs - warmup_epochs # 95
```
## Pitfall 8: Using Same LR for All Param Groups
**SUBOPTIMAL:**
```python
# Fine-tuning: applying same LR to all layers
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
# ❌ Backbone and head both use 1e-3
```
**BETTER:**
```python
# Fine-tuning: lower LR for pretrained backbone, higher for new head
optimizer = torch.optim.Adam([
{'params': model.backbone.parameters(), 'lr': 1e-4}, # Lower LR for pretrained
{'params': model.head.parameters(), 'lr': 1e-3} # Higher LR for random init
])
scheduler = CosineAnnealingLR(optimizer, T_max=100)
# ✅ Scheduler applies to all param groups proportionally
```
**Why It Matters:**
- Pretrained layers need smaller LR (already trained)
- New layers need higher LR (random initialization)
- Schedulers work with param groups automatically
**Note:** Schedulers multiply all param groups by same factor, preserving relative ratios
## Pitfall 9: Not Monitoring LR During Training
**PROBLEM:**
- Schedule not behaving as expected
- Hard to debug without visibility into LR
**SOLUTION:**
```python
# Log LR every epoch
for epoch in range(epochs):
current_lr = optimizer.param_groups[0]['lr']
print(f"Epoch {epoch}: LR = {current_lr:.6f}")
train_one_epoch(model, train_loader, optimizer)
scheduler.step()
# Or use TensorBoard
from torch.utils.tensorboard import SummaryWriter
writer = SummaryWriter()
for epoch in range(epochs):
current_lr = optimizer.param_groups[0]['lr']
writer.add_scalar('Learning Rate', current_lr, epoch)
train_one_epoch(model, train_loader, optimizer)
scheduler.step()
```
**Best Practice:**
- Always log LR to console or TensorBoard
- Plot LR schedule before training (see next section)
- Verify schedule matches expectations
## Pitfall 10: Not Validating Schedule Before Training
**PROBLEM:**
- Run full training, discover schedule was wrong
- Waste compute on incorrect schedule
**SOLUTION: Dry-run the schedule:**
```python
def plot_schedule(scheduler_fn, num_epochs):
"""
Plot LR schedule before training to verify it's correct.
"""
# Create dummy model and optimizer
model = torch.nn.Linear(1, 1)
optimizer = torch.optim.SGD(model.parameters(), lr=0.1)
scheduler = scheduler_fn(optimizer)
lrs = []
for epoch in range(num_epochs):
lrs.append(optimizer.param_groups[0]['lr'])
optimizer.step() # Dummy step
scheduler.step()
# Plot
plt.figure(figsize=(10, 6))
plt.plot(lrs)
plt.xlabel('Epoch')
plt.ylabel('Learning Rate')
plt.title('LR Schedule')
plt.grid(True, alpha=0.3)
plt.show()
# Usage
def my_scheduler(opt):
warmup = LinearLR(opt, start_factor=0.01, total_iters=5)
cosine = CosineAnnealingLR(opt, T_max=95)
return SequentialLR(opt, [warmup, cosine], [5])
plot_schedule(my_scheduler, num_epochs=100)
# Verify plot looks correct BEFORE training
```
**Best Practice:**
- Plot schedule before every major training run
- Verify warmup duration, decay shape, final LR
- Catch mistakes early (T_max wrong, step placement, etc.)
### 8. Modern Best Practices (2024-2025)
## Vision Models (CNNs, ResNets, ConvNeXt)
**Standard Recipe:**
```python
# Optimizer
optimizer = torch.optim.SGD(
model.parameters(),
lr=0.1,
momentum=0.9,
weight_decay=1e-4
)
# Scheduler: MultiStepLR or CosineAnnealing
# Option 1: MultiStepLR (classical)
scheduler = MultiStepLR(optimizer, milestones=[30, 60, 90], gamma=0.1)
# Option 2: CosineAnnealing (modern)
warmup = LinearLR(optimizer, start_factor=0.01, total_iters=5)
cosine = CosineAnnealingLR(optimizer, T_max=95, eta_min=1e-5)
scheduler = SequentialLR(optimizer, [warmup, cosine], [5])
# Training
epochs = 100
for epoch in range(epochs):
train_one_epoch(model, train_loader, optimizer)
scheduler.step()
```
**Key Points:**
- SGD with momentum (0.9) is standard for CNNs
- LR = 0.1 for batch size 256 (scale linearly for other batch sizes)
- Warmup optional but beneficial (5 epochs)
- CosineAnnealing increasingly preferred over MultiStepLR
## Vision Transformers (ViT, Swin, DeiT)
**Standard Recipe:**
```python
# Optimizer
optimizer = torch.optim.AdamW(
model.parameters(),
lr=1e-3,
weight_decay=0.05,
betas=(0.9, 0.999)
)
# Scheduler: MUST include warmup
warmup_epochs = 10
cosine_epochs = 290
warmup = LinearLR(optimizer, start_factor=0.01, total_iters=warmup_epochs)
cosine = CosineAnnealingLR(optimizer, T_max=cosine_epochs, eta_min=1e-5)
scheduler = SequentialLR(optimizer, [warmup, cosine], [warmup_epochs])
# Training
epochs = 300
for epoch in range(epochs):
train_one_epoch(model, train_loader, optimizer)
scheduler.step()
```
**Key Points:**
- AdamW optimizer (not SGD)
- Warmup is MANDATORY (10-20 epochs)
- Long training (300 epochs typical)
- LR = 1e-3 for batch size 512 (scale for other sizes)
- Cosine decay to very small LR (1e-5)
**Why Warmup is Critical for ViT:**
- Self-attention layers highly sensitive to initialization
- High LR at start causes gradient explosion
- Warmup allows attention patterns to stabilize
## NLP Transformers (BERT, GPT, T5)
**Standard Recipe:**
```python
# Optimizer
optimizer = torch.optim.AdamW(
model.parameters(),
lr=5e-4,
weight_decay=0.01,
betas=(0.9, 0.999)
)
# Scheduler: Linear warmup + linear decay (or inverse sqrt)
total_steps = len(train_loader) * epochs
warmup_steps = int(0.1 * total_steps) # 10% warmup
def lr_lambda(step):
if step < warmup_steps:
return step / warmup_steps
else:
return max(0.0, (total_steps - step) / (total_steps - warmup_steps))
scheduler = LambdaLR(optimizer, lr_lambda)
# Training: step EVERY BATCH
for epoch in range(epochs):
for batch in train_loader:
train_step(model, batch, optimizer)
optimizer.step()
scheduler.step() # Step every batch, not epoch
```
**Key Points:**
- AdamW optimizer
- Warmup is MANDATORY (typically 10% of training)
- Linear warmup + linear decay (BERT, GPT-2 style)
- Step scheduler EVERY BATCH (not every epoch)
- LR typically 1e-4 to 5e-4
**Alternative: Inverse Square Root (Original Transformer):**
```python
def transformer_schedule(step):
warmup_steps = 4000
step = step + 1
return (d_model ** -0.5) * min(step ** -0.5, step * warmup_steps ** -1.5)
scheduler = LambdaLR(optimizer, transformer_schedule)
```
## Object Detection (Faster R-CNN, YOLO, DETR)
**Standard Recipe (Two-stage detectors):**
```python
# Optimizer
optimizer = torch.optim.SGD(
model.parameters(),
lr=0.02,
momentum=0.9,
weight_decay=1e-4
)
# Scheduler: MultiStepLR with short schedule
scheduler = MultiStepLR(optimizer, milestones=[16, 22], gamma=0.1)
# Training
epochs = 26 # Shorter than classification
for epoch in range(epochs):
train_one_epoch(model, train_loader, optimizer)
scheduler.step()
```
**Standard Recipe (Transformer detectors like DETR):**
```python
# Optimizer
optimizer = torch.optim.AdamW(
[
{'params': model.backbone.parameters(), 'lr': 1e-5}, # Lower for backbone
{'params': model.transformer.parameters(), 'lr': 1e-4} # Higher for transformer
],
weight_decay=1e-4
)
# Scheduler: Step decay
scheduler = MultiStepLR(optimizer, milestones=[200], gamma=0.1)
# Training: Long schedule for DETR
epochs = 300
```
**Key Points:**
- Detection typically shorter training than classification
- Lower LR (0.02 vs 0.1) due to task difficulty
- DETR needs very long training (300 epochs)
## Semantic Segmentation (U-Net, DeepLab, SegFormer)
**Standard Recipe (CNN-based):**
```python
# Optimizer
optimizer = torch.optim.SGD(
model.parameters(),
lr=0.01,
momentum=0.9,
weight_decay=1e-4
)
# Scheduler: Polynomial decay (common in segmentation)
def poly_lr_lambda(epoch):
return (1 - epoch / total_epochs) ** 0.9
scheduler = LambdaLR(optimizer, poly_lr_lambda)
# Training
epochs = 100
for epoch in range(epochs):
train_one_epoch(model, train_loader, optimizer)
scheduler.step()
```
**Key Points:**
- Polynomial decay common in segmentation (DeepLab papers)
- Lower initial LR (0.01) than classification
- Power of 0.9 standard
## Fast Training / Limited Compute (FastAI Style)
**OneCycle Recipe:**
```python
# Step 1: Find optimal LR
lrs, losses = find_lr(model, train_loader, optimizer, loss_fn, device)
optimal_lr = suggest_lr_from_finder(lrs, losses) # e.g., 0.01
max_lr = optimal_lr * 10 # e.g., 0.1
# Step 2: OneCycleLR
optimizer = torch.optim.SGD(model.parameters(), lr=max_lr, momentum=0.9)
scheduler = OneCycleLR(
optimizer,
max_lr=max_lr,
steps_per_epoch=len(train_loader),
epochs=20,
pct_start=0.3, # 30% warmup, 70% cooldown
anneal_strategy='cos'
)
# Step 3: Train (step every batch)
for epoch in range(20):
for batch in train_loader:
train_step(model, batch, optimizer)
optimizer.step()
scheduler.step() # Every batch
```
**Key Points:**
- Use LR finder to tune max_lr (CRITICAL)
- Train for fewer epochs (10-30)
- Step scheduler every batch
- Often achieves 90-95% of full training performance in 20-30% of time
## Fine-Tuning Pretrained Models
**Standard Recipe:**
```python
# Optimizer: Different LRs for backbone vs head
optimizer = torch.optim.AdamW([
{'params': model.backbone.parameters(), 'lr': 1e-5}, # Very low for pretrained
{'params': model.head.parameters(), 'lr': 1e-3} # Higher for new head
])
# Scheduler: Simple cosine or even constant
# Option 1: Constant LR (fine-tuning often doesn't need scheduling)
scheduler = None
# Option 2: Gentle cosine decay
scheduler = CosineAnnealingLR(optimizer, T_max=10, eta_min=1e-6)
# Training: Short duration
epochs = 10 # Fine-tuning is quick
for epoch in range(epochs):
train_one_epoch(model, train_loader, optimizer)
if scheduler:
scheduler.step()
```
**Key Points:**
- Much lower LR for pretrained parts (1e-5)
- Higher LR for new/random parts (1e-3)
- Short training (3-10 epochs)
- Scheduling often optional (constant LR works)
- No warmup needed (weights already good)
## Large Batch Training (Batch Size > 1024)
**Standard Recipe:**
```python
# Linear LR scaling rule: LR scales with batch size
base_lr = 0.1 # For batch size 256
batch_size = 2048
scaled_lr = base_lr * (batch_size / 256) # 0.8 for batch 2048
# Optimizer
optimizer = torch.optim.SGD(model.parameters(), lr=scaled_lr, momentum=0.9)
# Scheduler: MUST include warmup (critical for large batch)
warmup_epochs = 5
warmup = LinearLR(optimizer, start_factor=0.01, total_iters=warmup_epochs)
cosine = CosineAnnealingLR(optimizer, T_max=95, eta_min=1e-5)
scheduler = SequentialLR(optimizer, [warmup, cosine], [warmup_epochs])
# Training
epochs = 100
for epoch in range(epochs):
train_one_epoch(model, train_loader, optimizer)
scheduler.step()
```
**Key Points:**
- Scale LR linearly with batch size (LR = base_lr * batch_size / base_batch_size)
- Warmup is MANDATORY for large batch (5-10 epochs minimum)
- Longer warmup for very large batches (>4096: use 10-20 epochs)
**Why Warmup Critical for Large Batch:**
- Large batch = larger effective LR
- High effective LR at start causes instability
- Warmup prevents divergence
## Modern Defaults by Domain (2025)
| Domain | Optimizer | Scheduler | Warmup | Epochs |
|--------|-----------|-----------|--------|--------|
| Vision (CNN) | SGD (0.9) | Cosine or MultiStep | Optional (5) | 100-200 |
| Vision (ViT) | AdamW | Cosine | MANDATORY (10-20) | 300 |
| NLP (BERT/GPT) | AdamW | Linear | MANDATORY (10%) | Varies |
| Detection | SGD | MultiStep | Optional | 26-300 |
| Segmentation | SGD | Polynomial | Optional | 100 |
| Fast/OneCycle | SGD | OneCycle | Built-in | 10-30 |
| Fine-tuning | AdamW | Constant/Cosine | No | 3-10 |
| Large Batch | SGD | Cosine | MANDATORY (5-20) | 100-200 |
### 9. Debugging Scheduler Issues
## Issue: Training Unstable / Loss Spikes
**Symptoms:**
- Loss increases suddenly during training
- NaN or Inf loss
- Training was stable, then becomes unstable
**Likely Causes:**
1. **No warmup (transformers, large models)**
- Solution: Add 5-10 epoch warmup
2. **LR too high at start**
- Solution: Lower initial LR or extend warmup
3. **LR drop too sharp (MultiStepLR)**
- Solution: Use gentler scheduler (Cosine) or smaller gamma
**Debugging Steps:**
```python
# 1. Print LR every epoch
for epoch in range(epochs):
current_lr = optimizer.param_groups[0]['lr']
print(f"Epoch {epoch}: LR = {current_lr:.6e}")
# 2. Check if loss spike correlates with LR change
loss = train_one_epoch(model, train_loader, optimizer)
print(f" Loss = {loss:.4f}")
scheduler.step()
# 3. Plot LR and loss together
import matplotlib.pyplot as plt
plt.figure(figsize=(12, 5))
plt.subplot(1, 2, 1)
plt.plot(lr_history)
plt.xlabel('Epoch')
plt.ylabel('Learning Rate')
plt.subplot(1, 2, 2)
plt.plot(loss_history)
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.show()
```
**Solutions:**
- Add/extend warmup: `LinearLR(optimizer, start_factor=0.01, total_iters=10)`
- Lower initial LR: `lr = 0.01` instead of `lr = 0.1`
- Gentler scheduler: `CosineAnnealingLR` instead of `MultiStepLR`
- Gradient clipping: `torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)`
## Issue: Training Plateaus Too Early
**Symptoms:**
- Loss stops decreasing after 20-30 epochs
- Validation accuracy flat
- Training seems stuck
**Likely Causes:**
1. **Not using scheduler (constant LR too high for current regime)**
- Solution: Add scheduler (CosineAnnealing or ReduceLROnPlateau)
2. **Scheduler reducing LR too early**
- Solution: Push back milestones or increase patience
3. **LR already too low**
- Solution: Check current LR, may need to restart with higher initial LR
**Debugging Steps:**
```python
# Check current LR
current_lr = optimizer.param_groups[0]['lr']
print(f"Current LR: {current_lr:.6e}")
# If LR very low (<1e-6), plateau might be due to other issues (architecture, data, etc.)
# If LR still high (>1e-3), should reduce LR to break plateau
```
**Solutions:**
- Add ReduceLROnPlateau: Automatically reduces when plateau detected
```python
scheduler = ReduceLROnPlateau(optimizer, mode='min', factor=0.1, patience=10)
```
- Manual LR reduction: If at epoch 30 and plateaued, reduce LR by 10x now
```python
for param_group in optimizer.param_groups:
param_group['lr'] *= 0.1
```
- Use scheduler from start next time:
```python
scheduler = CosineAnnealingLR(optimizer, T_max=100)
```
## Issue: Poor Final Performance (Train > Val Gap)
**Symptoms:**
- Training accuracy high (95%), validation lower (88%)
- Model overfitting
- Test performance disappointing
**Likely Causes (Scheduling Related):**
1. **LR not low enough at end**
- Solution: Lower eta_min or extend training
2. **Not using scheduler (constant LR doesn't fine-tune)**
- Solution: Add scheduler to reduce LR in late training
3. **Scheduler ending too early**
- Solution: Extend training or adjust T_max
**Debugging Steps:**
```python
# Check final LR
final_lr = optimizer.param_groups[0]['lr']
print(f"Final LR: {final_lr:.6e}")
# Final LR should be very low (1e-5 to 1e-6)
# If final LR still high (>1e-3), model didn't fine-tune properly
```
**Solutions:**
- Lower eta_min: `CosineAnnealingLR(optimizer, T_max=100, eta_min=1e-6)`
- Extend training: Train for more epochs to allow LR to decay further
- Add late-stage fine-tuning:
```python
# After main training, do 10 more epochs with very low LR
for param_group in optimizer.param_groups:
param_group['lr'] = 1e-5
for epoch in range(10):
train_one_epoch(model, train_loader, optimizer)
```
**Note:** If train-val gap large, may also need regularization (not scheduling issue)
## Issue: LR Decays Too Fast
**Symptoms:**
- LR reaches minimum in first few epochs
- Training very slow after initial epochs
- Looks like constant very low LR
**Likely Causes:**
1. **scheduler.step() called every batch instead of epoch**
- Solution: Move scheduler.step() outside batch loop
2. **T_max too small (e.g., T_max=10 but training for 100 epochs)**
- Solution: Set T_max = total_epochs
3. **Using OneCycle unintentionally**
- Solution: Verify scheduler type
**Debugging Steps:**
```python
# Print LR first few epochs
for epoch in range(10):
print(f"Epoch {epoch}: LR = {optimizer.param_groups[0]['lr']:.6e}")
for batch in train_loader:
train_step(model, batch, optimizer)
# scheduler.step() # ❌ If this is here, that's the bug
scheduler.step() # ✅ Should be here
```
**Solutions:**
- Move scheduler.step() to correct location (after epoch, not after batch)
- Fix T_max: `T_max = total_epochs` or `T_max = total_epochs - warmup_epochs`
- Verify scheduler type: `print(type(scheduler))`
## Issue: OneCycleLR Not Working
**Symptoms:**
- Training with OneCycle becomes unstable around peak LR
- Loss increases during ramp-up phase
- Worse performance than expected
**Likely Causes:**
1. **max_lr too high**
- Solution: Run LR finder, use lower max_lr
2. **scheduler.step() placement wrong (should be per batch)**
- Solution: Call scheduler.step() every batch
3. **Not tuning max_lr**
- Solution: Use LR finder to find optimal, use 5-10x as max_lr
**Debugging Steps:**
```python
# Plot LR schedule
lrs = []
for epoch in range(epochs):
for batch in train_loader:
lrs.append(optimizer.param_groups[0]['lr'])
scheduler.step()
plt.plot(lrs)
plt.xlabel('Batch')
plt.ylabel('Learning Rate')
plt.title('OneCycle LR Schedule')
plt.show()
# Should see: ramp up to max_lr, then ramp down
# If doesn't look like that, scheduler.step() placement wrong
```
**Solutions:**
- Run LR finder first:
```python
optimal_lr = find_lr(model, train_loader, optimizer, loss_fn, device)
max_lr = optimal_lr * 10 # Or try 5x, 3x if 10x unstable
```
- Lower max_lr manually:
```python
# If max_lr=0.1 unstable, try 0.03 or 0.01
scheduler = OneCycleLR(optimizer, max_lr=0.03, ...)
```
- Verify step() every batch:
```python
for epoch in range(epochs):
for batch in train_loader:
train_step(model, batch, optimizer)
optimizer.step()
scheduler.step() # ✅ Every batch
```
## Issue: Warmup Not Working
**Symptoms:**
- Training still unstable in first few epochs despite warmup
- Loss spikes even with warmup
- NaN loss at start
**Likely Causes:**
1. **Warmup too short (need longer ramp-up)**
- Solution: Extend warmup from 5 to 10-20 epochs
2. **start_factor too high (not starting low enough)**
- Solution: Use start_factor=0.001 instead of 0.01
3. **Warmup not actually being used (SequentialLR bug)**
- Solution: Verify warmup scheduler is active early
**Debugging Steps:**
```python
# Print LR first 10 epochs
for epoch in range(10):
current_lr = optimizer.param_groups[0]['lr']
print(f"Epoch {epoch}: LR = {current_lr:.6e}")
# Should see gradual increase from low to high
# If jumps immediately to high, warmup not working
train_one_epoch(model, train_loader, optimizer)
scheduler.step()
```
**Solutions:**
- Extend warmup:
```python
warmup = LinearLR(optimizer, start_factor=0.01, total_iters=20) # 20 epochs
```
- Lower start_factor:
```python
warmup = LinearLR(optimizer, start_factor=0.001, total_iters=5) # Start at 0.1%
```
- Verify SequentialLR milestone:
```python
# Milestone should match warmup duration
scheduler = SequentialLR(optimizer, [warmup, cosine], milestones=[20])
```
- Add gradient clipping as additional safeguard:
```python
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
```
## Issue: ReduceLROnPlateau Never Reduces LR
**Symptoms:**
- Using ReduceLROnPlateau for 50+ epochs
- Validation loss clearly plateaued
- Learning rate never reduces
**Debugging Steps:**
**1. Verify metric is being passed:**
```python
val_loss = validate(model, val_loader)
print(f"Epoch {epoch}: val_loss = {val_loss:.6f}") # Print metric
scheduler.step(val_loss) # Ensure passing metric
```
**2. Check mode is correct:**
```python
# For loss (want to minimize):
scheduler = ReduceLROnPlateau(optimizer, mode='min')
# For accuracy (want to maximize):
scheduler = ReduceLROnPlateau(optimizer, mode='max')
```
Wrong mode means scheduler waits for opposite direction (loss increasing instead of decreasing).
**3. Check threshold isn't too strict:**
```python
# Default threshold=1e-4 (0.01% improvement threshold)
# If val_loss 0.5000 → 0.4999 (0.02% improvement), counts as improvement
# If threshold too high, tiny improvements prevent reduction
# Solution: Lower threshold to be more sensitive
scheduler = ReduceLROnPlateau(optimizer, threshold=1e-5)
# Or remove threshold entirely
scheduler = ReduceLROnPlateau(optimizer, threshold=0)
```
**4. Enable verbose logging:**
```python
scheduler = ReduceLROnPlateau(optimizer, verbose=True)
# Prints: "Epoch 00042: reducing learning rate of group 0 to 1.0000e-04"
# when it reduces
```
**5. Verify plateau is real:**
```python
# Plot validation loss over time
import matplotlib.pyplot as plt
plt.figure(figsize=(10, 6))
plt.plot(val_losses)
plt.xlabel('Epoch')
plt.ylabel('Validation Loss')
plt.title('Validation Loss Over Time')
plt.grid(True, alpha=0.3)
plt.show()
# Check: Is loss truly flat, or still slowly improving?
# Tiny improvements (0.4500 → 0.4499) count as progress
```
**6. Check cooldown isn't preventing reduction:**
```python
# Default cooldown=0, but if set higher, prevents reduction after recent reduction
scheduler = ReduceLROnPlateau(optimizer, cooldown=0) # No cooldown
```
**Common Causes Table:**
| Problem | Symptom | Solution |
|---------|---------|----------|
| Not passing metric | Error or no reduction | `scheduler.step(val_loss)` |
| Wrong mode | Never reduces | `mode='min'` for loss, `mode='max'` for accuracy |
| Threshold too strict | Ignores small improvements | Lower to `threshold=1e-5` or `0` |
| Metric still improving | Not actually plateaued | Increase patience or accept slow progress |
| Cooldown active | Reducing but waiting | Set `cooldown=0` |
| Min_lr reached | Can't reduce further | Check current LR, may be at min_lr |
**Example Fix:**
```python
scheduler = ReduceLROnPlateau(
optimizer,
mode='min', # For loss minimization
factor=0.1, # Reduce by 10x
patience=10, # Wait 10 epochs
threshold=0, # Accept any improvement (most sensitive)
threshold_mode='rel',
cooldown=0, # No cooldown period
min_lr=1e-6, # Minimum LR allowed
verbose=True # Print when reducing
)
# Training loop
for epoch in range(epochs):
train_loss = train_one_epoch(model, train_loader, optimizer)
val_loss = validate(model, val_loader)
print(f"Epoch {epoch}: train_loss={train_loss:.4f}, val_loss={val_loss:.4f}")
scheduler.step(val_loss) # Pass validation loss
# Print current LR
current_lr = optimizer.param_groups[0]['lr']
print(f" Current LR: {current_lr:.6e}")
```
**Advanced Debugging:**
If still not reducing, manually check scheduler logic:
```python
# Get scheduler state
print(f"Best metric so far: {scheduler.best}")
print(f"Epochs without improvement: {scheduler.num_bad_epochs}")
print(f"Patience: {scheduler.patience}")
# If num_bad_epochs < patience, it's still waiting
# If num_bad_epochs >= patience, should reduce next step
```
### 10. Rationalization Table
When users rationalize away proper LR scheduling, counter with:
| Rationalization | Reality | Counter-Argument |
|-----------------|---------|------------------|
| "Constant LR is simpler" | Leaves 2-5% performance on table | "One line of code for 2-5% better accuracy is excellent ROI" |
| "Warmup seems optional" | MANDATORY for transformers | "Without warmup, transformers diverge or train unstably" |
| "I don't know which scheduler to use" | CosineAnnealing is great default | "CosineAnnealingLR works well for most cases, zero tuning" |
| "Scheduling is too complicated" | Modern frameworks make it trivial | "scheduler = CosineAnnealingLR(optimizer, T_max=100) - that's it" |
| "Papers don't mention scheduling" | They do, in implementation details | "Check paper's code repo or appendix - scheduling always there" |
| "My model is too small to need scheduling" | Even small models benefit | "Scheduling helps all models converge to better minima" |
| "Just use Adam, it adapts automatically" | Adam still benefits from scheduling | "SOTA transformers use AdamW + scheduling (BERT, GPT, ViT)" |
| "I'll tune it later" | Scheduling should be from start | "Scheduling is core hyperparameter, not optional add-on" |
| "OneCycle always best" | Only for specific scenarios | "OneCycle great for fast training (<30 epochs), not long training" |
| "I don't have time to run LR finder" | Takes 5 minutes, saves hours | "LR finder runs in minutes, prevents wasted training runs" |
| "Warmup adds complexity" | One extra line of code | "SequentialLR([warmup, cosine], [5]) - that's the complexity" |
| "My training is already good enough" | Could be 2-5% better | "SOTA papers all use scheduling - it's standard practice" |
| "Reducing LR will slow training" | Reduces LR when high LR hurts | "High LR early (fast), low LR late (fine-tune) = best of both" |
| "I don't know what T_max to use" | T_max = total_epochs | "Just set T_max to your total training epochs" |
### 11. Red Flags Checklist
Watch for these warning signs that indicate scheduling problems:
**Critical Red Flags (Fix Immediately):**
🚨 Training transformer without warmup
- **Impact:** High risk of divergence, NaN loss
- **Fix:** Add 5-10 epoch warmup immediately
🚨 Loss NaN or exploding in first few epochs
- **Impact:** Training failed
- **Fix:** Add warmup, lower initial LR, gradient clipping
🚨 scheduler.step() called every batch for Cosine/Step schedulers
- **Impact:** LR decays 100x too fast
- **Fix:** Move scheduler.step() outside batch loop
🚨 Not passing metric to ReduceLROnPlateau
- **Impact:** Scheduler doesn't work at all
- **Fix:** scheduler.step(val_loss)
**Important Red Flags (Should Fix):**
⚠️ Training >30 epochs without scheduler
- **Impact:** Leaving 2-5% performance on table
- **Fix:** Add CosineAnnealingLR or MultiStepLR
⚠️ OneCycle with random max_lr (not tuned)
- **Impact:** Unstable training or suboptimal performance
- **Fix:** Run LR finder, tune max_lr
⚠️ Large batch (>512) without warmup
- **Impact:** Training instability
- **Fix:** Add 5-10 epoch warmup
⚠️ Vision transformer with constant LR
- **Impact:** Poor convergence, unstable training
- **Fix:** Add warmup + cosine schedule
⚠️ Training plateaus but no scheduler to reduce LR
- **Impact:** Stuck at local minimum
- **Fix:** Add ReduceLROnPlateau or manually reduce LR
**Minor Red Flags (Consider Fixing):**
⚡ CNN training without any scheduling
- **Impact:** Missing 1-3% accuracy
- **Fix:** Add MultiStepLR or CosineAnnealingLR
⚡ Not monitoring LR during training
- **Impact:** Hard to debug schedule issues
- **Fix:** Log LR every epoch
⚡ T_max doesn't match training duration
- **Impact:** Schedule ends too early/late
- **Fix:** Set T_max = total_epochs - warmup_epochs
⚡ Using same LR for pretrained and new layers (fine-tuning)
- **Impact:** Suboptimal fine-tuning
- **Fix:** Use different LRs for param groups
⚡ Not validating schedule before full training
- **Impact:** Risk wasting compute on wrong schedule
- **Fix:** Plot schedule dry-run before training
### 12. Quick Reference
## Scheduler Selection Cheatsheet
```
Q: What should I use for...
Vision CNN (100 epochs)?
→ CosineAnnealingLR(optimizer, T_max=100, eta_min=1e-5)
Vision Transformer?
→ LinearLR(warmup 5) + CosineAnnealingLR(T_max=95) [WARMUP MANDATORY]
NLP Transformer?
→ LinearLR(warmup 10%) + LinearLR(decay) [WARMUP MANDATORY]
Fast training (<30 epochs)?
→ OneCycleLR(max_lr=tune_with_LR_finder)
Don't know optimal schedule?
→ ReduceLROnPlateau(mode='min', patience=10)
Training plateaued?
→ Add ReduceLROnPlateau or manually reduce LR by 10x now
Following paper recipe?
→ Use paper's exact schedule (usually MultiStepLR)
Fine-tuning pretrained model?
→ Constant low LR (1e-5) or gentle CosineAnnealing
Large batch (>512)?
→ LinearLR(warmup 5-10) + CosineAnnealingLR [WARMUP MANDATORY]
```
## Step Placement Quick Reference
```python
# Most schedulers (Step, Cosine, Exponential)
for epoch in range(epochs):
for batch in train_loader:
train_step(...)
scheduler.step() # AFTER epoch
# OneCycleLR (EXCEPTION)
for epoch in range(epochs):
for batch in train_loader:
train_step(...)
scheduler.step() # AFTER each batch
# ReduceLROnPlateau (pass metric)
for epoch in range(epochs):
for batch in train_loader:
train_step(...)
val_loss = validate(...)
scheduler.step(val_loss) # Pass metric
```
## Warmup Quick Reference
```python
# Pattern: Warmup + Cosine (most common)
warmup = LinearLR(optimizer, start_factor=0.01, total_iters=5)
cosine = CosineAnnealingLR(optimizer, T_max=95, eta_min=1e-5)
scheduler = SequentialLR(optimizer, [warmup, cosine], [5])
# When warmup is MANDATORY:
# ✅ Transformers (ViT, BERT, GPT)
# ✅ Large batch (>512)
# ✅ High initial LR
# ✅ Training from scratch
# When warmup is optional:
# ❌ Fine-tuning
# ❌ Small LR (<1e-4)
# ❌ Small models
```
## LR Finder Quick Reference
```python
# Run LR finder
lrs, losses = find_lr(model, train_loader, optimizer, loss_fn, device)
# Find optimal (steepest descent)
optimal_lr = suggest_lr_from_finder(lrs, losses)
# Use cases:
# - Direct use: optimizer = SGD(params, lr=optimal_lr)
# - OneCycle: max_lr = optimal_lr * 10
# - Conservative: base_lr = optimal_lr * 0.1
```
## Summary
Learning rate scheduling is CRITICAL for competitive model performance:
**Key Takeaways:**
1. **Scheduling improves final accuracy by 2-5%** - not optional for SOTA
2. **Warmup is MANDATORY for transformers** - prevents divergence
3. **CosineAnnealingLR is best default** - works well, zero tuning
4. **Use LR finder for new problems** - finds optimal initial LR in minutes
5. **OneCycleLR needs max_lr tuning** - run LR finder first
6. **Watch scheduler.step() placement** - most per epoch, OneCycle per batch
7. **Always monitor LR during training** - log to console or TensorBoard
8. **Plot schedule before training** - catch mistakes early
**Modern Defaults (2025):**
- **Vision CNNs:** SGD + CosineAnnealingLR (optional warmup)
- **Vision Transformers:** AdamW + Warmup + CosineAnnealingLR (warmup mandatory)
- **NLP Transformers:** AdamW + Warmup + Linear decay (warmup mandatory)
- **Fast Training:** SGD + OneCycleLR (tune max_lr with LR finder)
**When In Doubt:**
- Use CosineAnnealingLR with T_max = total_epochs
- Add 5-epoch warmup for large models
- Run LR finder if unsure about initial LR
- Log LR every epoch to monitor schedule
Learning rate scheduling is one of the highest-ROI hyperparameters - master it for significantly better model performance.