Initial commit

2025-11-30 08:59:54 +08:00
commit 725c187d17
11 changed files with 8174 additions and 0 deletions
--- a/skills/using-llm-specialist/llm-finetuning-strategies.md
+++ b/skills/using-llm-specialist/llm-finetuning-strategies.md
@@ -0,0 +1,969 @@
+
+# LLM Fine-Tuning Strategies
+
+## Context
+
+You're considering fine-tuning an LLM or debugging a fine-tuning process. Common mistakes:
+- **Fine-tuning when prompts would work** (unnecessary cost/time)
+- **Full fine-tuning instead of LoRA** (100× less efficient)
+- **Poor dataset quality** (garbage in, garbage out)
+- **Wrong hyperparameters** (catastrophic forgetting)
+- **No validation strategy** (overfitting undetected)
+
+**This skill provides effective fine-tuning strategies: when to fine-tune, efficient methods (LoRA), data quality, hyperparameters, and evaluation.**
+
+
+## Decision Tree: Prompt Engineering vs Fine-Tuning
+
+**Start with prompt engineering. Fine-tuning is last resort.**
+
+### Step 1: Try Prompt Engineering
+
+```python
+# System message + few-shot examples
+system = """
+You are a {role} with {characteristics}.
+{guidelines}
+"""
+
+few_shot = [
+    # 3-5 examples of desired behavior
+]
+
+# Test quality
+quality = evaluate(system, few_shot, test_set)
+```
+
+**If quality ≥ 90%:** ✅ STOP. Use prompts (no fine-tuning needed)
+
+**If quality < 90%:** Continue to Step 2
+
+### Step 2: Optimize Prompts
+
+- Add more examples (5-10)
+- Add chain-of-thought
+- Specify output format more clearly
+- Try different system messages
+- Use temperature=0 for consistency
+
+**If quality ≥ 90%:** ✅ STOP. Use optimized prompts
+
+**If quality < 90%:** Continue to Step 3
+
+### Step 3: Consider Fine-Tuning
+
+**Fine-tune when:**
+
+✅ **Prompts fail** (quality < 90% after optimization)
+✅ **Have 1000+ examples** (minimum for meaningful fine-tuning)
+✅ **Need consistency** (can't rely on prompt variations)
+✅ **Reduce latency** (shorter prompts → faster inference)
+✅ **Teach new capability** (not in base model)
+
+**Don't fine-tune for:**
+
+❌ **Tone/style matching** (use system message)
+❌ **Output formatting** (use format specification in prompt)
+❌ **Few examples** (< 100 examples insufficient)
+❌ **Quick experiments** (prompts iterate faster)
+❌ **Recent information** (use RAG, not fine-tuning)
+
+
+## When to Fine-Tune: Detailed Criteria
+
+### Criterion 1: Task Complexity
+
+**Simple tasks (prompt engineering):**
+- Classification (sentiment, category)
+- Extraction (entities, dates, names)
+- Formatting (JSON, CSV conversion)
+- Tone matching (company voice)
+
+**Complex tasks (consider fine-tuning):**
+- Multi-step reasoning (not in base model)
+- Domain-specific language (medical, legal)
+- Consistent complex behavior (100+ edge cases)
+- New capabilities (teach entirely new skill)
+
+### Criterion 2: Dataset Size
+
+```
+< 100 examples: Prompts only (insufficient for fine-tuning)
+100-1000: Prompts preferred (fine-tuning risky - overfitting)
+1000-10k: Fine-tuning viable if prompts fail
+> 10k: Fine-tuning effective
+```
+
+### Criterion 3: Cost-Benefit
+
+**Prompt engineering:**
+- Cost: $0 (just dev time)
+- Time: Minutes to hours (fast iteration)
+- Maintenance: Easy (just update prompt)
+
+**Fine-tuning:**
+- Cost: $100-1000+ (compute + data prep)
+- Time: Days to weeks (data prep + training + eval)
+- Maintenance: Hard (need retraining for updates)
+
+**ROI calculation:**
+```python
+# Prompt engineering cost
+prompt_dev_hours = 4
+hourly_rate = 100
+prompt_cost = 4 * 100 = $400
+
+# Fine-tuning cost
+data_prep_hours = 40
+training_cost = 500
+total_ft_cost = 40 * 100 + 500 = $4,500
+
+# Cost ratio: Fine-tuning is 11× more expensive
+# Only worth it if quality improvement > 10%
+```
+
+### Criterion 4: Performance Requirements
+
+**Quality:**
+- Need 90-95%: Prompts usually sufficient
+- Need 95-98%: Fine-tuning may help
+- Need 98%+: Fine-tuning + careful data curation
+
+**Latency:**
+- > 1 second acceptable: Prompts fine (long prompts OK)
+- 200-1000ms: Fine-tuning may help (reduce prompt size)
+- < 200ms: Fine-tuning + optimization required
+
+**Consistency:**
+- Variable outputs acceptable: Prompts OK (temperature > 0)
+- High consistency needed: Prompts (temperature=0) or fine-tuning
+- Perfect consistency: Fine-tuning + validation
+
+
+## Fine-Tuning Methods
+
+### 1. Full Fine-Tuning
+
+**Updates all model parameters.**
+
+**Pros:**
+- Maximum flexibility (can change any behavior)
+- Best quality (when you have massive data)
+
+**Cons:**
+- Expensive (7B model = 28GB memory for weights alone)
+- Slow (hours to days)
+- Risk of catastrophic forgetting
+- Hard to merge multiple fine-tunes
+
+**When to use:**
+- Massive dataset (100k+ examples)
+- Fundamental behavior change needed
+- Have large compute resources (multi-GPU)
+
+**Memory requirements:**
+```python
+# 7B parameter model (FP32)
+weights = 7B * 4 bytes = 28 GB
+gradients = 28 GB
+optimizer_states = 56 GB (Adam: 2× weights)
+activations = ~8 GB (batch_size=8)
+total = 120 GB  # Need multi-GPU!
+```
+
+### 2. LoRA (Low-Rank Adaptation)
+
+**Freezes base model, trains small adapter matrices.**
+
+**How it works:**
+```
+Original linear layer: W (d × k)
+LoRA: W + (A × B)
+  where A (d × r), B (r × k), r << d,k
+
+Example:
+W: 4096 × 4096 = 16.7M parameters
+A: 4096 × 8 = 32K parameters
+B: 8 × 4096 = 32K parameters
+A + B = 64K parameters (0.4% of original!)
+```
+
+**Pros:**
+- Extremely efficient (1% of parameters)
+- Fast training (10× faster than full FT)
+- Low memory (fits single GPU)
+- Easy to merge multiple LoRAs
+- No catastrophic forgetting (base model frozen)
+
+**Cons:**
+- Slightly lower capacity than full FT (99% quality usually)
+- Need to keep base model + adapters
+
+**When to use:**
+- 99% of fine-tuning cases
+- Limited compute (single GPU)
+- Fast iteration needed
+- Multiple tasks (train separate LoRAs, swap as needed)
+
+**Configuration:**
+```python
+from peft import LoraConfig, get_peft_model
+
+config = LoraConfig(
+    r=8,  # Rank (4-16 typical, higher = more capacity)
+    lora_alpha=32,  # Scaling (usually 2× rank)
+    target_modules=["q_proj", "v_proj"],  # Which layers
+    lora_dropout=0.05,
+    bias="none",
+    task_type="CAUSAL_LM"
+)
+
+model = get_peft_model(base_model, config)
+print(model.print_trainable_parameters())
+# trainable params: 8.4M || all params: 7B || trainable%: 0.12%
+```
+
+**Rank selection:**
+```
+r=4: Minimal (fast, low capacity) - simple tasks
+r=8: Standard (balanced) - most tasks
+r=16: High capacity (slower, better quality) - complex tasks
+r=32+: Approaching full FT quality (diminishing returns)
+
+Start with r=8, increase only if quality insufficient
+```
+
+### 3. QLoRA (Quantized LoRA)
+
+**LoRA + 4-bit quantization of base model.**
+
+**Pros:**
+- Extremely memory efficient (4× less than LoRA)
+- 7B model fits on 16GB GPU
+- Same quality as LoRA
+
+**Cons:**
+- Slower than LoRA (quantization overhead)
+- More complex setup
+
+**When to use:**
+- Limited GPU memory (< 24GB)
+- Large models on consumer GPUs
+- Cost optimization (cheaper GPUs)
+
+**Setup:**
+```python
+from transformers import BitsAndBytesConfig
+
+bnb_config = BitsAndBytesConfig(
+    load_in_4bit=True,
+    bnb_4bit_quant_type="nf4",
+    bnb_4bit_compute_dtype=torch.bfloat16,
+    bnb_4bit_use_double_quant=True,
+)
+
+model = AutoModelForCausalLM.from_pretrained(
+    "meta-llama/Llama-2-7b-hf",
+    quantization_config=bnb_config,
+    device_map="auto"
+)
+
+# Then add LoRA as usual
+model = get_peft_model(model, lora_config)
+```
+
+**Memory comparison:**
+```
+Method         | 7B Model | 13B Model | 70B Model
+---------------|----------|-----------|----------
+Full FT        | 120 GB   | 200 GB    | 1000 GB
+LoRA           | 40 GB    | 60 GB     | 300 GB
+QLoRA          | 12 GB    | 20 GB     | 80 GB
+```
+
+### Method Selection:
+
+```python
+if gpu_memory < 24:
+    use_qlora()
+elif gpu_memory < 80:
+    use_lora()
+elif have_massive_data and multi_gpu_cluster:
+    use_full_finetuning()
+else:
+    use_lora()  # Default choice
+```
+
+
+## Dataset Preparation
+
+**Quality > Quantity. 1,000 clean examples > 10,000 noisy examples.**
+
+### 1. Data Collection
+
+**Good sources:**
+- Human-labeled data (gold standard)
+- Curated conversations (high-quality)
+- Expert-written examples
+- Validated user interactions
+
+**Bad sources:**
+- Raw logs (errors, incomplete, noise)
+- Scraped data (quality varies wildly)
+- Automated generation (may have artifacts)
+- Untested user inputs (edge cases, adversarial)
+
+### 2. Data Cleaning
+
+```python
+def clean_dataset(raw_data):
+    clean = []
+
+    for example in raw_data:
+        # Filter 1: Remove errors
+        if any(err in example for err in ['error', 'exception', 'failed']):
+            continue
+
+        # Filter 2: Length checks
+        if len(example['input']) < 10 or len(example['output']) < 10:
+            continue  # Too short
+        if len(example['input']) > 2000 or len(example['output']) > 2000:
+            continue  # Too long (may be malformed)
+
+        # Filter 3: Completeness
+        if not example['output'].strip().endswith(('.', '!', '?')):
+            continue  # Incomplete response
+
+        # Filter 4: Language check
+        if not is_valid_language(example['output']):
+            continue  # Gibberish or wrong language
+
+        # Filter 5: Duplicates
+        if is_duplicate(example, clean):
+            continue
+
+        clean.append(example)
+
+    return clean
+
+cleaned = clean_dataset(raw_data)
+print(f"Filtered: {len(raw_data)} → {len(cleaned)}")
+# Example: 10,000 → 3,000 (but high quality!)
+```
+
+### 3. Manual Validation
+
+**Critical step: Spot check 100+ random examples.**
+
+```python
+import random
+
+sample = random.sample(cleaned, min(100, len(cleaned)))
+
+for i, ex in enumerate(sample):
+    print(f"\n--- Example {i+1}/100 ---")
+    print(f"Input: {ex['input']}")
+    print(f"Output: {ex['output']}")
+
+    response = input("Quality (good/bad/skip)? ")
+    if response == 'bad':
+        # Investigate pattern, add filtering rule
+        print("Why bad?")
+        reason = input()
+        # Update filtering logic
+```
+
+**What to check:**
+- ☐ Output is correct and complete
+- ☐ Output matches desired format/style
+- ☐ No errors or hallucinations
+- ☐ Appropriate length
+- ☐ Natural language (not robotic)
+- ☐ Consistent with other examples
+
+### 4. Dataset Format
+
+**OpenAI format (for GPT fine-tuning):**
+```json
+{
+  "messages": [
+    {"role": "system", "content": "You are a helpful assistant."},
+    {"role": "user", "content": "What is the capital of France?"},
+    {"role": "assistant", "content": "The capital of France is Paris."}
+  ]
+}
+```
+
+**Hugging Face format:**
+```python
+from datasets import Dataset
+
+data = {
+    'input': ["question 1", "question 2", ...],
+    'output': ["answer 1", "answer 2", ...]
+}
+
+dataset = Dataset.from_dict(data)
+```
+
+### 5. Train/Val/Test Split
+
+```python
+from sklearn.model_selection import train_test_split
+
+# 70% train, 15% val, 15% test
+train, temp = train_test_split(data, test_size=0.3, random_state=42)
+val, test = train_test_split(temp, test_size=0.5, random_state=42)
+
+print(f"Train: {len(train)}, Val: {len(val)}, Test: {len(test)}")
+# Example: Train: 2100, Val: 450, Test: 450
+
+# Stratified split for imbalanced data
+train, temp = train_test_split(
+    data, test_size=0.3, stratify=data['label'], random_state=42
+)
+```
+
+**Split guidelines:**
+- Minimum validation: 100 examples
+- Minimum test: 100 examples
+- Large datasets (> 10k): 80/10/10 split
+- Small datasets (< 5k): 70/15/15 split
+
+### 6. Data Augmentation (Optional)
+
+**When you need more data:**
+
+```python
+# Paraphrasing
+"What's the weather?" → "How's the weather today?"
+
+# Back-translation
+English → French → English (introduces variation)
+
+# Synthetic generation (use carefully!)
+few_shot_examples = [...]
+new_examples = llm.generate(
+    f"Generate 10 examples similar to: {few_shot_examples}"
+)
+# ALWAYS manually validate synthetic data!
+```
+
+**Warning:** Synthetic data can introduce artifacts. Always validate!
+
+
+## Hyperparameters
+
+### Learning Rate
+
+**Most critical hyperparameter.**
+
+```python
+# Pre-training LR: 1e-3 to 3e-4
+# Fine-tuning LR: 100-1000× smaller!
+
+training_args = TrainingArguments(
+    learning_rate=1e-5,  # Start here for 7B models
+    # Or even more conservative:
+    learning_rate=1e-6,  # For larger models or small datasets
+)
+```
+
+**Guidelines:**
+```
+Model size     | Pre-train LR | Fine-tune LR
+---------------|--------------|-------------
+1B params      | 3e-4         | 3e-5 to 1e-5
+7B params      | 3e-4         | 1e-5 to 1e-6
+13B params     | 2e-4         | 5e-6 to 1e-6
+70B+ params    | 1e-4         | 1e-6 to 1e-7
+
+Rule: Fine-tune LR ≈ Pre-train LR / 100
+```
+
+**LR scheduling:**
+```python
+from transformers import get_linear_schedule_with_warmup
+
+optimizer = AdamW(model.parameters(), lr=1e-5)
+scheduler = get_linear_schedule_with_warmup(
+    optimizer,
+    num_warmup_steps=100,  # Gradual LR increase (10% of training)
+    num_training_steps=total_steps
+)
+```
+
+**Signs of wrong LR:**
+
+Too high (LR > 1e-4):
+- Training loss oscillates wildly
+- Model generates gibberish
+- Catastrophic forgetting (fails on general tasks)
+
+Too low (LR < 1e-7):
+- Training loss barely decreases
+- Model doesn't adapt to new data
+- Very slow convergence
+
+### Epochs
+
+```python
+training_args = TrainingArguments(
+    num_train_epochs=3,  # Standard: 3-5 epochs
+)
+```
+
+**Guidelines:**
+```
+Dataset size | Epochs
+-------------|-------
+< 1k         | 5-10 (more passes needed)
+1k-5k        | 3-5 (standard)
+5k-10k       | 2-3
+> 10k        | 1-2 (large dataset, fewer passes)
+
+Rule: Smaller dataset → more epochs (but watch for overfitting!)
+```
+
+**Too many epochs:**
+- Training loss → 0 but val loss increases (overfitting)
+- Model memorizes training data
+- Catastrophic forgetting
+
+**Too few epochs:**
+- Model hasn't fully adapted
+- Training and val loss still decreasing
+
+### Batch Size
+
+```python
+training_args = TrainingArguments(
+    per_device_train_batch_size=8,  # Depends on GPU memory
+    gradient_accumulation_steps=4,   # Effective batch = 8 × 4 = 32
+)
+```
+
+**Guidelines:**
+```
+GPU Memory | Batch Size (7B model)
+-----------|----------------------
+16 GB      | 1-2 (use gradient accumulation!)
+24 GB      | 2-4
+40 GB      | 4-8
+80 GB      | 8-16
+
+Effective batch size (with accumulation): 16-64 typical
+```
+
+**Gradient accumulation:**
+```python
+# Simulate batch_size=32 with only 8 examples fitting in memory:
+per_device_train_batch_size=8
+gradient_accumulation_steps=4
+# Effective batch = 8 × 4 = 32
+```
+
+### Weight Decay
+
+```python
+training_args = TrainingArguments(
+    weight_decay=0.01,  # L2 regularization (prevent overfitting)
+)
+```
+
+**Guidelines:**
+- Standard: 0.01
+- Strong regularization: 0.1 (small dataset, high overfitting risk)
+- Light regularization: 0.001 (large dataset)
+
+### Warmup
+
+```python
+training_args = TrainingArguments(
+    warmup_steps=100,  # Or warmup_ratio=0.1 (10% of training)
+)
+```
+
+**Why warmup:**
+- Prevents initial instability (large gradients early)
+- Gradual LR increase: 0 → target_LR over warmup steps
+
+**Guidelines:**
+- Warmup: 5-10% of total training steps
+- Longer warmup for larger models
+
+
+## Training
+
+### Basic Training Loop
+
+```python
+from transformers import Trainer, TrainingArguments
+
+training_args = TrainingArguments(
+    output_dir="./results",
+
+    # Hyperparameters
+    learning_rate=1e-5,
+    num_train_epochs=3,
+    per_device_train_batch_size=8,
+    gradient_accumulation_steps=4,
+    weight_decay=0.01,
+    warmup_steps=100,
+
+    # Evaluation
+    evaluation_strategy="steps",
+    eval_steps=100,
+    save_strategy="steps",
+    save_steps=100,
+    load_best_model_at_end=True,
+    metric_for_best_model="eval_loss",
+
+    # Logging
+    logging_steps=10,
+    logging_dir="./logs",
+
+    # Optimization
+    fp16=True,  # Mixed precision (faster, less memory)
+    gradient_checkpointing=True,  # Trade compute for memory
+)
+
+trainer = Trainer(
+    model=model,
+    args=training_args,
+    train_dataset=train_dataset,
+    eval_dataset=val_dataset,
+    tokenizer=tokenizer,
+)
+
+trainer.train()
+```
+
+### Monitoring Training
+
+**Key metrics to watch:**
+
+```python
+# 1. Training loss (should decrease steadily)
+# 2. Validation loss (should decrease, then plateau)
+# 3. Validation metrics (accuracy, F1, BLEU, etc.)
+
+# Warning signs:
+# - Train loss → 0 but val loss increasing: Overfitting
+# - Train loss oscillating: LR too high
+# - Train loss not decreasing: LR too low or data issues
+```
+
+**Logging:**
+```python
+import wandb
+
+wandb.init(project="fine-tuning")
+
+training_args = TrainingArguments(
+    report_to="wandb",  # Log to Weights & Biases
+    logging_steps=10,
+)
+```
+
+### Early Stopping
+
+```python
+from transformers import EarlyStoppingCallback
+
+trainer = Trainer(
+    ...
+    callbacks=[EarlyStoppingCallback(
+        early_stopping_patience=3,  # Stop if no improvement for 3 evals
+        early_stopping_threshold=0.01,  # Minimum improvement
+    )]
+)
+```
+
+**Why early stopping:**
+- Prevents overfitting (stops before val loss increases)
+- Saves compute (don't train unnecessary epochs)
+- Automatically finds optimal epoch count
+
+
+## Evaluation
+
+### 1. Validation During Training
+
+```python
+def compute_metrics(eval_pred):
+    predictions, labels = eval_pred
+
+    # Decode predictions
+    decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True)
+    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)
+
+    # Compute metrics
+    from sklearn.metrics import accuracy_score, f1_score
+    accuracy = accuracy_score(decoded_labels, decoded_preds)
+    f1 = f1_score(decoded_labels, decoded_preds, average='weighted')
+
+    return {'accuracy': accuracy, 'f1': f1}
+
+trainer = Trainer(
+    ...
+    compute_metrics=compute_metrics,
+)
+```
+
+### 2. Test Set Evaluation (Final)
+
+```python
+# After training completes, evaluate on held-out test set ONCE
+test_results = trainer.evaluate(test_dataset)
+
+print(f"Test accuracy: {test_results['accuracy']:.2%}")
+print(f"Test F1: {test_results['f1']:.2%}")
+```
+
+### 3. Qualitative Evaluation
+
+**Critical: Manually test on real examples!**
+
+```python
+def test_model(model, tokenizer, test_examples):
+    for ex in test_examples:
+        prompt = ex['input']
+        expected = ex['output']
+
+        # Generate
+        inputs = tokenizer(prompt, return_tensors="pt")
+        outputs = model.generate(**inputs, max_length=100)
+        generated = tokenizer.decode(outputs[0], skip_special_tokens=True)
+
+        print(f"Input: {prompt}")
+        print(f"Expected: {expected}")
+        print(f"Generated: {generated}")
+        print(f"Match: {'✓' if generated == expected else '✗'}")
+        print("-" * 80)
+
+# Test on 20-50 examples (including edge cases)
+test_model(model, tokenizer, test_examples)
+```
+
+### 4. A/B Testing (Production)
+
+```python
+# Route 50% traffic to base model, 50% to fine-tuned
+import random
+
+def get_model():
+    if random.random() < 0.5:
+        return base_model
+    else:
+        return finetuned_model
+
+# Measure:
+# - User satisfaction (thumbs up/down)
+# - Task success rate
+# - Response time
+# - Cost per request
+
+# After 1000+ requests, analyze results
+```
+
+### 5. Catastrophic Forgetting Check
+
+**Critical: Ensure fine-tuning didn't break base capabilities!**
+
+```python
+# Test on general knowledge tasks
+general_tasks = [
+    "What is the capital of France?",  # Basic knowledge
+    "Translate to Spanish: Hello",    # Translation
+    "2 + 2 = ?",                       # Basic math
+    "Who wrote Hamlet?",               # Literature
+]
+
+for task in general_tasks:
+    before = base_model.generate(task)
+    after = finetuned_model.generate(task)
+
+    print(f"Task: {task}")
+    print(f"Before: {before}")
+    print(f"After: {after}")
+    print(f"Preserved: {'✓' if before == after else '✗'}")
+```
+
+
+## Common Issues and Solutions
+
+### Issue 1: Overfitting
+
+**Symptoms:**
+- Train loss → 0, val loss increases
+- Perfect on training data, poor on test data
+
+**Solutions:**
+```python
+# 1. Reduce epochs
+num_train_epochs=3  # Instead of 10
+
+# 2. Increase regularization
+weight_decay=0.1  # Instead of 0.01
+
+# 3. Early stopping
+early_stopping_patience=3
+
+# 4. Collect more data
+# 5. Data augmentation
+
+# 6. Use LoRA (less prone to overfitting than full FT)
+```
+
+### Issue 2: Catastrophic Forgetting
+
+**Symptoms:**
+- Fine-tuned model fails on general tasks
+- Lost pre-trained knowledge
+
+**Solutions:**
+```python
+# 1. Lower learning rate (most important!)
+learning_rate=1e-6  # Instead of 1e-4
+
+# 2. Fewer epochs
+num_train_epochs=2  # Instead of 10
+
+# 3. Use LoRA (base model frozen, can't forget)
+
+# 4. Add general examples to training set (10-20% general data)
+```
+
+### Issue 3: Poor Quality
+
+**Symptoms:**
+- Model output is low quality (incorrect, incoherent)
+
+**Solutions:**
+```python
+# 1. Check dataset quality (most common cause!)
+# - Manual validation
+# - Remove noise
+# - Fix labels
+
+# 2. Increase model size
+# - 7B → 13B → 70B
+
+# 3. Increase training data
+# - Need 1000+ high-quality examples
+
+# 4. Adjust hyperparameters
+# - Try higher LR (1e-5 → 3e-5) if underfit
+# - Train longer (3 → 5 epochs)
+
+# 5. Check if base model has capability
+# - If base model can't do task, fine-tuning won't help
+```
+
+### Issue 4: Slow Training
+
+**Symptoms:**
+- Training takes days/weeks
+
+**Solutions:**
+```python
+# 1. Use LoRA (10× faster than full FT)
+
+# 2. Mixed precision
+fp16=True  # 2× faster
+
+# 3. Gradient checkpointing (trade speed for memory)
+gradient_checkpointing=True
+
+# 4. Smaller batch size + gradient accumulation
+per_device_train_batch_size=2
+gradient_accumulation_steps=16
+
+# 5. Use multiple GPUs
+# 6. Use faster GPU (A100 > V100 > T4)
+```
+
+### Issue 5: Out of Memory
+
+**Symptoms:**
+- CUDA out of memory error
+
+**Solutions:**
+```python
+# 1. Use QLoRA (4× less memory)
+
+# 2. Reduce batch size
+per_device_train_batch_size=1
+gradient_accumulation_steps=32
+
+# 3. Gradient checkpointing
+gradient_checkpointing=True
+
+# 4. Use smaller model
+# 7B → 3B → 1B
+
+# 5. Reduce sequence length
+max_seq_length=512  # Instead of 2048
+```
+
+
+## Best Practices Summary
+
+### Before Fine-Tuning:
+
+1. ☐ Try prompt engineering first (90% of cases, prompts work!)
+2. ☐ Have 1000+ high-quality examples
+3. ☐ Clean and validate dataset (quality > quantity)
+4. ☐ Create train/val/test split (70/15/15)
+5. ☐ Define success metrics (what does "good" mean?)
+
+### During Fine-Tuning:
+
+6. ☐ Use LoRA (unless specific reason for full FT)
+7. ☐ Set tiny learning rate (1e-5 to 1e-6 for 7B models)
+8. ☐ Train for 3-5 epochs (not 50!)
+9. ☐ Monitor val loss (stop when it stops improving)
+10. ☐ Log everything (wandb, tensorboard)
+
+### After Fine-Tuning:
+
+11. ☐ Evaluate on test set (quantitative metrics)
+12. ☐ Manual testing (qualitative, 20-50 examples)
+13. ☐ Check for catastrophic forgetting (general tasks)
+14. ☐ A/B test in production (before full rollout)
+15. ☐ Document hyperparameters (for reproducibility)
+
+
+## Quick Reference
+
+| Task | Method | Dataset | LR | Epochs |
+|------|--------|---------|----|----|
+| Tone matching | Prompts | N/A | N/A | N/A |
+| Simple classification | Prompts | N/A | N/A | N/A |
+| Complex domain task | LoRA | 1k-10k | 1e-5 | 3-5 |
+| Fundamental change | Full FT | 100k+ | 1e-5 | 1-3 |
+| Limited GPU | QLoRA | 1k-10k | 1e-5 | 3-5 |
+
+**Default recommendation:** Try prompts first. If that fails, use LoRA with LR=1e-5, epochs=3, and high-quality dataset.
+
+
+## Summary
+
+**Core principles:**
+
+1. **Prompt engineering first**: 90% of tasks don't need fine-tuning
+2. **LoRA by default**: 100× more efficient than full fine-tuning, same quality
+3. **Data quality matters**: 1,000 clean examples > 10,000 noisy examples
+4. **Tiny learning rate**: Fine-tune LR = Pre-train LR / 100 to / 1000
+5. **Validation essential**: Train/val/test split + early stopping + catastrophic forgetting check
+
+**Decision tree:**
+1. Try prompts (system message + few-shot)
+2. If quality < 90%, optimize prompts
+3. If still < 90% and have 1000+ examples, consider fine-tuning
+4. Use LoRA (default), QLoRA (limited GPU), or full FT (rare)
+5. Set LR = 1e-5, epochs = 3-5, monitor val loss
+6. Evaluate on test set + manual testing + general tasks
+
+**Key insight**: Fine-tuning is powerful but expensive and slow. Start with prompts, fine-tune only when prompts demonstrably fail and you have high-quality data.