Files
gh-tachyon-beep-skillpacks-…/skills/using-llm-specialist/llm-finetuning-strategies.md
2025-11-30 08:59:54 +08:00

970 lines
23 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# LLM Fine-Tuning Strategies
## Context
You're considering fine-tuning an LLM or debugging a fine-tuning process. Common mistakes:
- **Fine-tuning when prompts would work** (unnecessary cost/time)
- **Full fine-tuning instead of LoRA** (100× less efficient)
- **Poor dataset quality** (garbage in, garbage out)
- **Wrong hyperparameters** (catastrophic forgetting)
- **No validation strategy** (overfitting undetected)
**This skill provides effective fine-tuning strategies: when to fine-tune, efficient methods (LoRA), data quality, hyperparameters, and evaluation.**
## Decision Tree: Prompt Engineering vs Fine-Tuning
**Start with prompt engineering. Fine-tuning is last resort.**
### Step 1: Try Prompt Engineering
```python
# System message + few-shot examples
system = """
You are a {role} with {characteristics}.
{guidelines}
"""
few_shot = [
# 3-5 examples of desired behavior
]
# Test quality
quality = evaluate(system, few_shot, test_set)
```
**If quality ≥ 90%:** ✅ STOP. Use prompts (no fine-tuning needed)
**If quality < 90%:** Continue to Step 2
### Step 2: Optimize Prompts
- Add more examples (5-10)
- Add chain-of-thought
- Specify output format more clearly
- Try different system messages
- Use temperature=0 for consistency
**If quality ≥ 90%:** ✅ STOP. Use optimized prompts
**If quality < 90%:** Continue to Step 3
### Step 3: Consider Fine-Tuning
**Fine-tune when:**
**Prompts fail** (quality < 90% after optimization)
**Have 1000+ examples** (minimum for meaningful fine-tuning)
**Need consistency** (can't rely on prompt variations)
**Reduce latency** (shorter prompts → faster inference)
**Teach new capability** (not in base model)
**Don't fine-tune for:**
**Tone/style matching** (use system message)
**Output formatting** (use format specification in prompt)
**Few examples** (< 100 examples insufficient)
**Quick experiments** (prompts iterate faster)
**Recent information** (use RAG, not fine-tuning)
## When to Fine-Tune: Detailed Criteria
### Criterion 1: Task Complexity
**Simple tasks (prompt engineering):**
- Classification (sentiment, category)
- Extraction (entities, dates, names)
- Formatting (JSON, CSV conversion)
- Tone matching (company voice)
**Complex tasks (consider fine-tuning):**
- Multi-step reasoning (not in base model)
- Domain-specific language (medical, legal)
- Consistent complex behavior (100+ edge cases)
- New capabilities (teach entirely new skill)
### Criterion 2: Dataset Size
```
< 100 examples: Prompts only (insufficient for fine-tuning)
100-1000: Prompts preferred (fine-tuning risky - overfitting)
1000-10k: Fine-tuning viable if prompts fail
> 10k: Fine-tuning effective
```
### Criterion 3: Cost-Benefit
**Prompt engineering:**
- Cost: $0 (just dev time)
- Time: Minutes to hours (fast iteration)
- Maintenance: Easy (just update prompt)
**Fine-tuning:**
- Cost: $100-1000+ (compute + data prep)
- Time: Days to weeks (data prep + training + eval)
- Maintenance: Hard (need retraining for updates)
**ROI calculation:**
```python
# Prompt engineering cost
prompt_dev_hours = 4
hourly_rate = 100
prompt_cost = 4 * 100 = $400
# Fine-tuning cost
data_prep_hours = 40
training_cost = 500
total_ft_cost = 40 * 100 + 500 = $4,500
# Cost ratio: Fine-tuning is 11× more expensive
# Only worth it if quality improvement > 10%
```
### Criterion 4: Performance Requirements
**Quality:**
- Need 90-95%: Prompts usually sufficient
- Need 95-98%: Fine-tuning may help
- Need 98%+: Fine-tuning + careful data curation
**Latency:**
- > 1 second acceptable: Prompts fine (long prompts OK)
- 200-1000ms: Fine-tuning may help (reduce prompt size)
- < 200ms: Fine-tuning + optimization required
**Consistency:**
- Variable outputs acceptable: Prompts OK (temperature > 0)
- High consistency needed: Prompts (temperature=0) or fine-tuning
- Perfect consistency: Fine-tuning + validation
## Fine-Tuning Methods
### 1. Full Fine-Tuning
**Updates all model parameters.**
**Pros:**
- Maximum flexibility (can change any behavior)
- Best quality (when you have massive data)
**Cons:**
- Expensive (7B model = 28GB memory for weights alone)
- Slow (hours to days)
- Risk of catastrophic forgetting
- Hard to merge multiple fine-tunes
**When to use:**
- Massive dataset (100k+ examples)
- Fundamental behavior change needed
- Have large compute resources (multi-GPU)
**Memory requirements:**
```python
# 7B parameter model (FP32)
weights = 7B * 4 bytes = 28 GB
gradients = 28 GB
optimizer_states = 56 GB (Adam: 2× weights)
activations = ~8 GB (batch_size=8)
total = 120 GB # Need multi-GPU!
```
### 2. LoRA (Low-Rank Adaptation)
**Freezes base model, trains small adapter matrices.**
**How it works:**
```
Original linear layer: W (d × k)
LoRA: W + (A × B)
where A (d × r), B (r × k), r << d,k
Example:
W: 4096 × 4096 = 16.7M parameters
A: 4096 × 8 = 32K parameters
B: 8 × 4096 = 32K parameters
A + B = 64K parameters (0.4% of original!)
```
**Pros:**
- Extremely efficient (1% of parameters)
- Fast training (10× faster than full FT)
- Low memory (fits single GPU)
- Easy to merge multiple LoRAs
- No catastrophic forgetting (base model frozen)
**Cons:**
- Slightly lower capacity than full FT (99% quality usually)
- Need to keep base model + adapters
**When to use:**
- 99% of fine-tuning cases
- Limited compute (single GPU)
- Fast iteration needed
- Multiple tasks (train separate LoRAs, swap as needed)
**Configuration:**
```python
from peft import LoraConfig, get_peft_model
config = LoraConfig(
r=8, # Rank (4-16 typical, higher = more capacity)
lora_alpha=32, # Scaling (usually 2× rank)
target_modules=["q_proj", "v_proj"], # Which layers
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM"
)
model = get_peft_model(base_model, config)
print(model.print_trainable_parameters())
# trainable params: 8.4M || all params: 7B || trainable%: 0.12%
```
**Rank selection:**
```
r=4: Minimal (fast, low capacity) - simple tasks
r=8: Standard (balanced) - most tasks
r=16: High capacity (slower, better quality) - complex tasks
r=32+: Approaching full FT quality (diminishing returns)
Start with r=8, increase only if quality insufficient
```
### 3. QLoRA (Quantized LoRA)
**LoRA + 4-bit quantization of base model.**
**Pros:**
- Extremely memory efficient (4× less than LoRA)
- 7B model fits on 16GB GPU
- Same quality as LoRA
**Cons:**
- Slower than LoRA (quantization overhead)
- More complex setup
**When to use:**
- Limited GPU memory (< 24GB)
- Large models on consumer GPUs
- Cost optimization (cheaper GPUs)
**Setup:**
```python
from transformers import BitsAndBytesConfig
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_use_double_quant=True,
)
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-2-7b-hf",
quantization_config=bnb_config,
device_map="auto"
)
# Then add LoRA as usual
model = get_peft_model(model, lora_config)
```
**Memory comparison:**
```
Method | 7B Model | 13B Model | 70B Model
---------------|----------|-----------|----------
Full FT | 120 GB | 200 GB | 1000 GB
LoRA | 40 GB | 60 GB | 300 GB
QLoRA | 12 GB | 20 GB | 80 GB
```
### Method Selection:
```python
if gpu_memory < 24:
use_qlora()
elif gpu_memory < 80:
use_lora()
elif have_massive_data and multi_gpu_cluster:
use_full_finetuning()
else:
use_lora() # Default choice
```
## Dataset Preparation
**Quality > Quantity. 1,000 clean examples > 10,000 noisy examples.**
### 1. Data Collection
**Good sources:**
- Human-labeled data (gold standard)
- Curated conversations (high-quality)
- Expert-written examples
- Validated user interactions
**Bad sources:**
- Raw logs (errors, incomplete, noise)
- Scraped data (quality varies wildly)
- Automated generation (may have artifacts)
- Untested user inputs (edge cases, adversarial)
### 2. Data Cleaning
```python
def clean_dataset(raw_data):
clean = []
for example in raw_data:
# Filter 1: Remove errors
if any(err in example for err in ['error', 'exception', 'failed']):
continue
# Filter 2: Length checks
if len(example['input']) < 10 or len(example['output']) < 10:
continue # Too short
if len(example['input']) > 2000 or len(example['output']) > 2000:
continue # Too long (may be malformed)
# Filter 3: Completeness
if not example['output'].strip().endswith(('.', '!', '?')):
continue # Incomplete response
# Filter 4: Language check
if not is_valid_language(example['output']):
continue # Gibberish or wrong language
# Filter 5: Duplicates
if is_duplicate(example, clean):
continue
clean.append(example)
return clean
cleaned = clean_dataset(raw_data)
print(f"Filtered: {len(raw_data)}{len(cleaned)}")
# Example: 10,000 → 3,000 (but high quality!)
```
### 3. Manual Validation
**Critical step: Spot check 100+ random examples.**
```python
import random
sample = random.sample(cleaned, min(100, len(cleaned)))
for i, ex in enumerate(sample):
print(f"\n--- Example {i+1}/100 ---")
print(f"Input: {ex['input']}")
print(f"Output: {ex['output']}")
response = input("Quality (good/bad/skip)? ")
if response == 'bad':
# Investigate pattern, add filtering rule
print("Why bad?")
reason = input()
# Update filtering logic
```
**What to check:**
- ☐ Output is correct and complete
- ☐ Output matches desired format/style
- ☐ No errors or hallucinations
- ☐ Appropriate length
- ☐ Natural language (not robotic)
- ☐ Consistent with other examples
### 4. Dataset Format
**OpenAI format (for GPT fine-tuning):**
```json
{
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is the capital of France?"},
{"role": "assistant", "content": "The capital of France is Paris."}
]
}
```
**Hugging Face format:**
```python
from datasets import Dataset
data = {
'input': ["question 1", "question 2", ...],
'output': ["answer 1", "answer 2", ...]
}
dataset = Dataset.from_dict(data)
```
### 5. Train/Val/Test Split
```python
from sklearn.model_selection import train_test_split
# 70% train, 15% val, 15% test
train, temp = train_test_split(data, test_size=0.3, random_state=42)
val, test = train_test_split(temp, test_size=0.5, random_state=42)
print(f"Train: {len(train)}, Val: {len(val)}, Test: {len(test)}")
# Example: Train: 2100, Val: 450, Test: 450
# Stratified split for imbalanced data
train, temp = train_test_split(
data, test_size=0.3, stratify=data['label'], random_state=42
)
```
**Split guidelines:**
- Minimum validation: 100 examples
- Minimum test: 100 examples
- Large datasets (> 10k): 80/10/10 split
- Small datasets (< 5k): 70/15/15 split
### 6. Data Augmentation (Optional)
**When you need more data:**
```python
# Paraphrasing
"What's the weather?" "How's the weather today?"
# Back-translation
English French English (introduces variation)
# Synthetic generation (use carefully!)
few_shot_examples = [...]
new_examples = llm.generate(
f"Generate 10 examples similar to: {few_shot_examples}"
)
# ALWAYS manually validate synthetic data!
```
**Warning:** Synthetic data can introduce artifacts. Always validate!
## Hyperparameters
### Learning Rate
**Most critical hyperparameter.**
```python
# Pre-training LR: 1e-3 to 3e-4
# Fine-tuning LR: 100-1000× smaller!
training_args = TrainingArguments(
learning_rate=1e-5, # Start here for 7B models
# Or even more conservative:
learning_rate=1e-6, # For larger models or small datasets
)
```
**Guidelines:**
```
Model size | Pre-train LR | Fine-tune LR
---------------|--------------|-------------
1B params | 3e-4 | 3e-5 to 1e-5
7B params | 3e-4 | 1e-5 to 1e-6
13B params | 2e-4 | 5e-6 to 1e-6
70B+ params | 1e-4 | 1e-6 to 1e-7
Rule: Fine-tune LR ≈ Pre-train LR / 100
```
**LR scheduling:**
```python
from transformers import get_linear_schedule_with_warmup
optimizer = AdamW(model.parameters(), lr=1e-5)
scheduler = get_linear_schedule_with_warmup(
optimizer,
num_warmup_steps=100, # Gradual LR increase (10% of training)
num_training_steps=total_steps
)
```
**Signs of wrong LR:**
Too high (LR > 1e-4):
- Training loss oscillates wildly
- Model generates gibberish
- Catastrophic forgetting (fails on general tasks)
Too low (LR < 1e-7):
- Training loss barely decreases
- Model doesn't adapt to new data
- Very slow convergence
### Epochs
```python
training_args = TrainingArguments(
num_train_epochs=3, # Standard: 3-5 epochs
)
```
**Guidelines:**
```
Dataset size | Epochs
-------------|-------
< 1k | 5-10 (more passes needed)
1k-5k | 3-5 (standard)
5k-10k | 2-3
> 10k | 1-2 (large dataset, fewer passes)
Rule: Smaller dataset → more epochs (but watch for overfitting!)
```
**Too many epochs:**
- Training loss → 0 but val loss increases (overfitting)
- Model memorizes training data
- Catastrophic forgetting
**Too few epochs:**
- Model hasn't fully adapted
- Training and val loss still decreasing
### Batch Size
```python
training_args = TrainingArguments(
per_device_train_batch_size=8, # Depends on GPU memory
gradient_accumulation_steps=4, # Effective batch = 8 × 4 = 32
)
```
**Guidelines:**
```
GPU Memory | Batch Size (7B model)
-----------|----------------------
16 GB | 1-2 (use gradient accumulation!)
24 GB | 2-4
40 GB | 4-8
80 GB | 8-16
Effective batch size (with accumulation): 16-64 typical
```
**Gradient accumulation:**
```python
# Simulate batch_size=32 with only 8 examples fitting in memory:
per_device_train_batch_size=8
gradient_accumulation_steps=4
# Effective batch = 8 × 4 = 32
```
### Weight Decay
```python
training_args = TrainingArguments(
weight_decay=0.01, # L2 regularization (prevent overfitting)
)
```
**Guidelines:**
- Standard: 0.01
- Strong regularization: 0.1 (small dataset, high overfitting risk)
- Light regularization: 0.001 (large dataset)
### Warmup
```python
training_args = TrainingArguments(
warmup_steps=100, # Or warmup_ratio=0.1 (10% of training)
)
```
**Why warmup:**
- Prevents initial instability (large gradients early)
- Gradual LR increase: 0 → target_LR over warmup steps
**Guidelines:**
- Warmup: 5-10% of total training steps
- Longer warmup for larger models
## Training
### Basic Training Loop
```python
from transformers import Trainer, TrainingArguments
training_args = TrainingArguments(
output_dir="./results",
# Hyperparameters
learning_rate=1e-5,
num_train_epochs=3,
per_device_train_batch_size=8,
gradient_accumulation_steps=4,
weight_decay=0.01,
warmup_steps=100,
# Evaluation
evaluation_strategy="steps",
eval_steps=100,
save_strategy="steps",
save_steps=100,
load_best_model_at_end=True,
metric_for_best_model="eval_loss",
# Logging
logging_steps=10,
logging_dir="./logs",
# Optimization
fp16=True, # Mixed precision (faster, less memory)
gradient_checkpointing=True, # Trade compute for memory
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=train_dataset,
eval_dataset=val_dataset,
tokenizer=tokenizer,
)
trainer.train()
```
### Monitoring Training
**Key metrics to watch:**
```python
# 1. Training loss (should decrease steadily)
# 2. Validation loss (should decrease, then plateau)
# 3. Validation metrics (accuracy, F1, BLEU, etc.)
# Warning signs:
# - Train loss → 0 but val loss increasing: Overfitting
# - Train loss oscillating: LR too high
# - Train loss not decreasing: LR too low or data issues
```
**Logging:**
```python
import wandb
wandb.init(project="fine-tuning")
training_args = TrainingArguments(
report_to="wandb", # Log to Weights & Biases
logging_steps=10,
)
```
### Early Stopping
```python
from transformers import EarlyStoppingCallback
trainer = Trainer(
...
callbacks=[EarlyStoppingCallback(
early_stopping_patience=3, # Stop if no improvement for 3 evals
early_stopping_threshold=0.01, # Minimum improvement
)]
)
```
**Why early stopping:**
- Prevents overfitting (stops before val loss increases)
- Saves compute (don't train unnecessary epochs)
- Automatically finds optimal epoch count
## Evaluation
### 1. Validation During Training
```python
def compute_metrics(eval_pred):
predictions, labels = eval_pred
# Decode predictions
decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True)
decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)
# Compute metrics
from sklearn.metrics import accuracy_score, f1_score
accuracy = accuracy_score(decoded_labels, decoded_preds)
f1 = f1_score(decoded_labels, decoded_preds, average='weighted')
return {'accuracy': accuracy, 'f1': f1}
trainer = Trainer(
...
compute_metrics=compute_metrics,
)
```
### 2. Test Set Evaluation (Final)
```python
# After training completes, evaluate on held-out test set ONCE
test_results = trainer.evaluate(test_dataset)
print(f"Test accuracy: {test_results['accuracy']:.2%}")
print(f"Test F1: {test_results['f1']:.2%}")
```
### 3. Qualitative Evaluation
**Critical: Manually test on real examples!**
```python
def test_model(model, tokenizer, test_examples):
for ex in test_examples:
prompt = ex['input']
expected = ex['output']
# Generate
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_length=100)
generated = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(f"Input: {prompt}")
print(f"Expected: {expected}")
print(f"Generated: {generated}")
print(f"Match: {'' if generated == expected else ''}")
print("-" * 80)
# Test on 20-50 examples (including edge cases)
test_model(model, tokenizer, test_examples)
```
### 4. A/B Testing (Production)
```python
# Route 50% traffic to base model, 50% to fine-tuned
import random
def get_model():
if random.random() < 0.5:
return base_model
else:
return finetuned_model
# Measure:
# - User satisfaction (thumbs up/down)
# - Task success rate
# - Response time
# - Cost per request
# After 1000+ requests, analyze results
```
### 5. Catastrophic Forgetting Check
**Critical: Ensure fine-tuning didn't break base capabilities!**
```python
# Test on general knowledge tasks
general_tasks = [
"What is the capital of France?", # Basic knowledge
"Translate to Spanish: Hello", # Translation
"2 + 2 = ?", # Basic math
"Who wrote Hamlet?", # Literature
]
for task in general_tasks:
before = base_model.generate(task)
after = finetuned_model.generate(task)
print(f"Task: {task}")
print(f"Before: {before}")
print(f"After: {after}")
print(f"Preserved: {'' if before == after else ''}")
```
## Common Issues and Solutions
### Issue 1: Overfitting
**Symptoms:**
- Train loss → 0, val loss increases
- Perfect on training data, poor on test data
**Solutions:**
```python
# 1. Reduce epochs
num_train_epochs=3 # Instead of 10
# 2. Increase regularization
weight_decay=0.1 # Instead of 0.01
# 3. Early stopping
early_stopping_patience=3
# 4. Collect more data
# 5. Data augmentation
# 6. Use LoRA (less prone to overfitting than full FT)
```
### Issue 2: Catastrophic Forgetting
**Symptoms:**
- Fine-tuned model fails on general tasks
- Lost pre-trained knowledge
**Solutions:**
```python
# 1. Lower learning rate (most important!)
learning_rate=1e-6 # Instead of 1e-4
# 2. Fewer epochs
num_train_epochs=2 # Instead of 10
# 3. Use LoRA (base model frozen, can't forget)
# 4. Add general examples to training set (10-20% general data)
```
### Issue 3: Poor Quality
**Symptoms:**
- Model output is low quality (incorrect, incoherent)
**Solutions:**
```python
# 1. Check dataset quality (most common cause!)
# - Manual validation
# - Remove noise
# - Fix labels
# 2. Increase model size
# - 7B → 13B → 70B
# 3. Increase training data
# - Need 1000+ high-quality examples
# 4. Adjust hyperparameters
# - Try higher LR (1e-5 → 3e-5) if underfit
# - Train longer (3 → 5 epochs)
# 5. Check if base model has capability
# - If base model can't do task, fine-tuning won't help
```
### Issue 4: Slow Training
**Symptoms:**
- Training takes days/weeks
**Solutions:**
```python
# 1. Use LoRA (10× faster than full FT)
# 2. Mixed precision
fp16=True # 2× faster
# 3. Gradient checkpointing (trade speed for memory)
gradient_checkpointing=True
# 4. Smaller batch size + gradient accumulation
per_device_train_batch_size=2
gradient_accumulation_steps=16
# 5. Use multiple GPUs
# 6. Use faster GPU (A100 > V100 > T4)
```
### Issue 5: Out of Memory
**Symptoms:**
- CUDA out of memory error
**Solutions:**
```python
# 1. Use QLoRA (4× less memory)
# 2. Reduce batch size
per_device_train_batch_size=1
gradient_accumulation_steps=32
# 3. Gradient checkpointing
gradient_checkpointing=True
# 4. Use smaller model
# 7B → 3B → 1B
# 5. Reduce sequence length
max_seq_length=512 # Instead of 2048
```
## Best Practices Summary
### Before Fine-Tuning:
1. ☐ Try prompt engineering first (90% of cases, prompts work!)
2. ☐ Have 1000+ high-quality examples
3. ☐ Clean and validate dataset (quality > quantity)
4. ☐ Create train/val/test split (70/15/15)
5. ☐ Define success metrics (what does "good" mean?)
### During Fine-Tuning:
6. ☐ Use LoRA (unless specific reason for full FT)
7. ☐ Set tiny learning rate (1e-5 to 1e-6 for 7B models)
8. ☐ Train for 3-5 epochs (not 50!)
9. ☐ Monitor val loss (stop when it stops improving)
10. ☐ Log everything (wandb, tensorboard)
### After Fine-Tuning:
11. ☐ Evaluate on test set (quantitative metrics)
12. ☐ Manual testing (qualitative, 20-50 examples)
13. ☐ Check for catastrophic forgetting (general tasks)
14. ☐ A/B test in production (before full rollout)
15. ☐ Document hyperparameters (for reproducibility)
## Quick Reference
| Task | Method | Dataset | LR | Epochs |
|------|--------|---------|----|----|
| Tone matching | Prompts | N/A | N/A | N/A |
| Simple classification | Prompts | N/A | N/A | N/A |
| Complex domain task | LoRA | 1k-10k | 1e-5 | 3-5 |
| Fundamental change | Full FT | 100k+ | 1e-5 | 1-3 |
| Limited GPU | QLoRA | 1k-10k | 1e-5 | 3-5 |
**Default recommendation:** Try prompts first. If that fails, use LoRA with LR=1e-5, epochs=3, and high-quality dataset.
## Summary
**Core principles:**
1. **Prompt engineering first**: 90% of tasks don't need fine-tuning
2. **LoRA by default**: 100× more efficient than full fine-tuning, same quality
3. **Data quality matters**: 1,000 clean examples > 10,000 noisy examples
4. **Tiny learning rate**: Fine-tune LR = Pre-train LR / 100 to / 1000
5. **Validation essential**: Train/val/test split + early stopping + catastrophic forgetting check
**Decision tree:**
1. Try prompts (system message + few-shot)
2. If quality < 90%, optimize prompts
3. If still < 90% and have 1000+ examples, consider fine-tuning
4. Use LoRA (default), QLoRA (limited GPU), or full FT (rare)
5. Set LR = 1e-5, epochs = 3-5, monitor val loss
6. Evaluate on test set + manual testing + general tasks
**Key insight**: Fine-tuning is powerful but expensive and slow. Start with prompts, fine-tune only when prompts demonstrably fail and you have high-quality data.