Initial commit

2025-11-30 08:59:57 +08:00
commit d208552c88
14 changed files with 22010 additions and 0 deletions
--- a/skills/using-ml-production/quantization-for-inference.md
+++ b/skills/using-ml-production/quantization-for-inference.md
@@ -0,0 +1,991 @@
+
+# Quantization for Inference Skill
+
+## When to Use This Skill
+
+Use this skill when you observe these symptoms:
+
+**Performance Symptoms:**
+- Model inference too slow on CPU (e.g., >10ms when need <5ms)
+- Batch processing taking too long (low throughput)
+- Need to serve more requests per second with same hardware
+
+**Size Symptoms:**
+- Model too large for edge devices (e.g., 100MB+ for mobile)
+- Want to fit more models in GPU memory
+- Memory-constrained deployment environment
+
+**Deployment Symptoms:**
+- Deploying to CPU servers (quantization gives 2-4× CPU speedup)
+- Deploying to edge devices (mobile, IoT, embedded systems)
+- Cost-sensitive deployment (smaller models = lower hosting costs)
+
+**When NOT to use this skill:**
+- Model already fast enough and small enough (no problem to solve)
+- Deploying exclusively on GPU with no memory constraints (modest benefit)
+- Prototyping phase where optimization is premature
+- Model so small that quantization overhead not worth it (e.g., <5MB)
+
+## Core Principle
+
+**Quantization trades precision for performance.**
+
+Quantization converts high-precision numbers (FP32: 32 bits) to low-precision integers (INT8: 8 bits or INT4: 4 bits). This provides:
+- **4-8× smaller model size** (fewer bits per parameter)
+- **2-4× faster inference on CPU** (INT8 operations faster than FP32)
+- **Small accuracy loss** (typically 0.5-1% for INT8)
+
+**Formula:** Lower precision (FP32 → INT8 → INT4) = Smaller size + Faster inference + More accuracy loss
+
+The skill is choosing the **right precision for your accuracy tolerance**.
+
+## Quantization Framework
+
+```
+┌────────────────────────────────────────────┐
+│   1. Recognize Quantization Need           │
+│   CPU/Edge + (Slow OR Large)               │
+└──────────────┬─────────────────────────────┘
+               │
+               ▼
+┌────────────────────────────────────────────┐
+│   2. Choose Quantization Type              │
+│   Dynamic → Static → QAT (increasing cost) │
+└──────────────┬─────────────────────────────┘
+               │
+               ▼
+┌────────────────────────────────────────────┐
+│   3. Calibrate (if Static/QAT)             │
+│   100-1000 representative samples          │
+└──────────────┬─────────────────────────────┘
+               │
+               ▼
+┌────────────────────────────────────────────┐
+│   4. Validate Accuracy Trade-offs          │
+│   Baseline vs Quantized accuracy           │
+└──────────────┬─────────────────────────────┘
+               │
+               ▼
+┌────────────────────────────────────────────┐
+│   5. Decide: Accept or Iterate             │
+│   <2% loss → Deploy                        │
+│   >2% loss → Try QAT or different precision│
+└────────────────────────────────────────────┘
+```
+
+## Part 1: Quantization Types
+
+### Type 1: Dynamic Quantization
+
+**What it does:** Quantizes weights to INT8, keeps activations in FP32.
+
+**When to use:**
+- Simplest quantization (no calibration needed)
+- Primary goal is size reduction
+- Batch processing where latency less critical
+- Quick experiment to see if quantization helps
+
+**Benefits:**
+- ✅ 4× size reduction (weights are 75% of model size)
+- ✅ 1.2-1.5× CPU speedup (modest, because activations still FP32)
+- ✅ Minimal accuracy loss (~0.2-0.5%)
+- ✅ No calibration data needed
+
+**Limitations:**
+- ⚠️ Limited CPU speedup (activations still FP32)
+- ⚠️ Not optimal for edge devices needing maximum performance
+
+**PyTorch implementation:**
+
+```python
+import torch
+import torch.quantization
+
+# WHY: Dynamic quantization is simplest - just one function call
+# No calibration data needed because activations stay FP32
+model = torch.load('model.pth')
+model.eval()  # WHY: Must be in eval mode (no batchnorm updates)
+
+# WHY: Specify which layers to quantize (Linear, LSTM, etc.)
+# These layers benefit most from quantization
+quantized_model = torch.quantization.quantize_dynamic(
+    model,
+    qconfig_spec={torch.nn.Linear},  # WHY: Quantize Linear layers only
+    dtype=torch.qint8  # WHY: INT8 is standard precision
+)
+
+# Save quantized model
+torch.save(quantized_model.state_dict(), 'model_quantized_dynamic.pth')
+
+# Verify size reduction
+original_size = os.path.getsize('model.pth') / (1024 ** 2)  # MB
+quantized_size = os.path.getsize('model_quantized_dynamic.pth') / (1024 ** 2)
+print(f"Original: {original_size:.1f}MB → Quantized: {quantized_size:.1f}MB")
+print(f"Size reduction: {original_size / quantized_size:.1f}×")
+```
+
+**Example use case:** BERT classification model where primary goal is reducing size from 440MB to 110MB for easier deployment.
+
+
+### Type 2: Static Quantization (Post-Training Quantization)
+
+**What it does:** Quantizes both weights and activations to INT8.
+
+**When to use:**
+- Need maximum CPU speedup (2-4×)
+- Deploying to CPU servers or edge devices
+- Can afford calibration step (5-10 minutes)
+- Primary goal is inference speed
+
+**Benefits:**
+- ✅ 4× size reduction (same as dynamic)
+- ✅ 2-4× CPU speedup (both weights and activations INT8)
+- ✅ No retraining required (post-training)
+- ✅ Acceptable accuracy loss (~0.5-1%)
+
+**Requirements:**
+- ⚠️ Needs calibration data (100-1000 samples from validation set)
+- ⚠️ Slightly more complex setup than dynamic
+
+**PyTorch implementation:**
+
+```python
+import torch
+import torch.quantization
+
+def calibrate_model(model, calibration_loader):
+    """
+    Calibrate model by running representative data through it.
+
+    WHY: Static quantization needs to know activation ranges.
+    Calibration finds min/max values for each activation layer.
+
+    Args:
+        model: Model in eval mode with quantization stubs
+        calibration_loader: DataLoader with 100-1000 samples
+    """
+    model.eval()
+    with torch.no_grad():
+        for batch_idx, (data, _) in enumerate(calibration_loader):
+            model(data)
+            if batch_idx >= 100:  # WHY: 100 batches usually sufficient
+                break
+    return model
+
+# Step 1: Prepare model for quantization
+model = torch.load('model.pth')
+model.eval()
+
+# WHY: Insert quantization/dequantization stubs at boundaries
+# This tells PyTorch where to convert between FP32 and INT8
+model.qconfig = torch.quantization.get_default_qconfig('fbgemm')
+torch.quantization.prepare(model, inplace=True)
+
+# Step 2: Calibrate with representative data
+# WHY: Must use data from training/validation set, not random data
+# Calibration finds activation ranges - needs real distribution
+calibration_dataset = torch.utils.data.Subset(
+    val_dataset,
+    indices=range(1000)  # WHY: 1000 samples sufficient for most models
+)
+calibration_loader = torch.utils.data.DataLoader(
+    calibration_dataset,
+    batch_size=32,
+    shuffle=False  # WHY: Order doesn't matter for calibration
+)
+
+model = calibrate_model(model, calibration_loader)
+
+# Step 3: Convert to quantized model
+torch.quantization.convert(model, inplace=True)
+
+# Save quantized model
+torch.save(model.state_dict(), 'model_quantized_static.pth')
+
+# Benchmark speed improvement
+import time
+
+def benchmark(model, data, num_iterations=100):
+    """WHY: Warm up model first, then measure average latency."""
+    model.eval()
+    # Warm up (first few iterations slower)
+    for _ in range(10):
+        model(data)
+
+    start = time.time()
+    with torch.no_grad():
+        for _ in range(num_iterations):
+            model(data)
+    end = time.time()
+    return (end - start) / num_iterations * 1000  # ms per inference
+
+test_data = torch.randn(1, 3, 224, 224)  # Example input
+
+baseline_latency = benchmark(original_model, test_data)
+quantized_latency = benchmark(model, test_data)
+
+print(f"Baseline: {baseline_latency:.2f}ms")
+print(f"Quantized: {quantized_latency:.2f}ms")
+print(f"Speedup: {baseline_latency / quantized_latency:.2f}×")
+```
+
+**Example use case:** ResNet50 image classifier for CPU inference - need <5ms latency, achieve 4ms with static quantization (vs 15ms baseline).
+
+
+### Type 3: Quantization-Aware Training (QAT)
+
+**What it does:** Simulates quantization during training to minimize accuracy loss.
+
+**When to use:**
+- Static quantization accuracy loss too large (>2%)
+- Need best possible accuracy with INT8
+- Can afford retraining (hours to days)
+- Critical production system with strict accuracy requirements
+
+**Benefits:**
+- ✅ Best accuracy (~0.1-0.3% loss vs 0.5-1% for static)
+- ✅ 4× size reduction (same as dynamic/static)
+- ✅ 2-4× CPU speedup (same as static)
+
+**Limitations:**
+- ⚠️ Requires retraining (most expensive option)
+- ⚠️ Takes hours to days depending on model size
+- ⚠️ More complex implementation
+
+**PyTorch implementation:**
+
+```python
+import torch
+import torch.quantization
+
+def train_one_epoch_qat(model, train_loader, optimizer, criterion):
+    """
+    Train one epoch with quantization-aware training.
+
+    WHY: QAT inserts fake quantization ops during training.
+    Model learns to be robust to quantization errors.
+    """
+    model.train()
+    for data, target in train_loader:
+        optimizer.zero_grad()
+        output = model(data)
+        loss = criterion(output, target)
+        loss.backward()
+        optimizer.step()
+    return model
+
+# Step 1: Prepare model for QAT
+model = torch.load('model.pth')
+model.train()
+
+# WHY: QAT config includes fake quantization ops
+# These simulate quantization during forward pass
+model.qconfig = torch.quantization.get_default_qat_qconfig('fbgemm')
+torch.quantization.prepare_qat(model, inplace=True)
+
+# Step 2: Train with quantization-aware training
+# WHY: Model learns to compensate for quantization errors
+optimizer = torch.optim.SGD(model.parameters(), lr=0.0001)  # WHY: Low LR for fine-tuning
+criterion = torch.nn.CrossEntropyLoss()
+
+num_epochs = 5  # WHY: Usually 5-10 epochs sufficient for QAT fine-tuning
+for epoch in range(num_epochs):
+    model = train_one_epoch_qat(model, train_loader, optimizer, criterion)
+    print(f"Epoch {epoch+1}/{num_epochs} complete")
+
+# Step 3: Convert to quantized model
+model.eval()
+torch.quantization.convert(model, inplace=True)
+
+# Save QAT quantized model
+torch.save(model.state_dict(), 'model_quantized_qat.pth')
+```
+
+**Example use case:** Medical imaging model where accuracy is critical - static quantization gives 2% accuracy loss, QAT reduces to 0.3%.
+
+
+## Part 2: Quantization Type Decision Matrix
+
+| Type | Complexity | Calibration | Retraining | Size Reduction | CPU Speedup | Accuracy Loss |
+|------|-----------|-------------|------------|----------------|-------------|---------------|
+| **Dynamic** | Low | No | No | 4× | 1.2-1.5× | ~0.2-0.5% |
+| **Static** | Medium | Yes | No | 4× | 2-4× | ~0.5-1% |
+| **QAT** | High | Yes | Yes | 4× | 2-4× | ~0.1-0.3% |
+
+**Decision flow:**
+1. Start with **dynamic quantization**: Simplest, verify quantization helps
+2. Upgrade to **static quantization**: If need more speedup, can afford calibration
+3. Use **QAT**: Only if accuracy loss from static too large (rare)
+
+**Why this order?** Incremental cost. Dynamic is free (5 minutes), static is cheap (15 minutes), QAT is expensive (hours/days). Don't pay for QAT unless you need it.
+
+
+## Part 3: Calibration Best Practices
+
+### What is Calibration?
+
+**Purpose:** Find min/max ranges for each activation layer.
+
+**Why needed:** Static quantization needs to know activation ranges to map FP32 → INT8. Without calibration, ranges are wrong → accuracy collapses.
+
+**How it works:**
+1. Run representative data through model
+2. Record min/max activation values per layer
+3. Use these ranges to quantize activations at inference time
+
+### Calibration Data Requirements
+
+**Data source:**
+- ✅ **Use validation set samples** (matches training distribution)
+- ❌ Don't use random images from internet (different distribution)
+- ❌ Don't use single image repeated (insufficient coverage)
+- ❌ Don't use training set that doesn't match deployment (distribution shift)
+
+**Data size:**
+- **Minimum:** 100 samples (sufficient for simple models)
+- **Recommended:** 500-1000 samples (better coverage)
+- **Maximum:** Full validation set is overkill (slow, no benefit)
+
+**Data characteristics:**
+- Must cover range of inputs model sees in production
+- Include edge cases (bright/dark images, long/short text)
+- Distribution should match deployment, not just training
+- Class balance less important than input diversity
+
+**Example calibration data selection:**
+
+```python
+import torch
+import numpy as np
+
+def select_calibration_data(val_dataset, num_samples=1000):
+    """
+    Select diverse calibration samples from validation set.
+
+    WHY: Want samples that cover range of activation values.
+    Random selection from validation set usually sufficient.
+
+    Args:
+        val_dataset: Full validation dataset
+        num_samples: Number of calibration samples (default 1000)
+
+    Returns:
+        Calibration dataset subset
+    """
+    # WHY: Random selection ensures diversity
+    # Stratified sampling can help ensure class coverage
+    indices = np.random.choice(len(val_dataset), num_samples, replace=False)
+    calibration_dataset = torch.utils.data.Subset(val_dataset, indices)
+
+    return calibration_dataset
+
+# Example: Select 1000 random samples from validation set
+calibration_dataset = select_calibration_data(val_dataset, num_samples=1000)
+calibration_loader = torch.utils.data.DataLoader(
+    calibration_dataset,
+    batch_size=32,
+    shuffle=False  # WHY: Order doesn't matter for calibration
+)
+```
+
+### Common Calibration Pitfalls
+
+**Pitfall 1: Using wrong data distribution**
+- ❌ "Random images from internet" for ImageNet-trained model
+- ✅ Use ImageNet validation set samples
+
+**Pitfall 2: Too few samples**
+- ❌ 10 samples (insufficient coverage of activation ranges)
+- ✅ 100-1000 samples (good coverage)
+
+**Pitfall 3: Using training data that doesn't match deployment**
+- ❌ Calibrate on sunny outdoor images, deploy on indoor images
+- ✅ Calibrate on data matching deployment distribution
+
+**Pitfall 4: Skipping calibration validation**
+- ❌ Calibrate once, assume it works
+- ✅ Validate accuracy after calibration to verify ranges are good
+
+
+## Part 4: Precision Selection (INT8 vs INT4 vs FP16)
+
+### Precision Spectrum
+
+| Precision | Bits | Size vs FP32 | Speedup (CPU) | Typical Accuracy Loss |
+|-----------|------|--------------|---------------|----------------------|
+| **FP32** | 32 | 1× | 1× | 0% (baseline) |
+| **FP16** | 16 | 2× | 1.5× | <0.1% |
+| **INT8** | 8 | 4× | 2-4× | 0.5-1% |
+| **INT4** | 4 | 8× | 4-8× | 1-3% |
+
+**Trade-off:** Lower precision = Smaller size + Faster inference + More accuracy loss
+
+### When to Use Each Precision
+
+**FP16 (Half Precision):**
+- GPU inference (Tensor Cores optimized for FP16)
+- Need minimal accuracy loss (<0.1%)
+- Size reduction secondary concern
+- **Example:** Large language models on GPU
+
+**INT8 (Standard Quantization):**
+- CPU inference (INT8 operations fast on CPU)
+- Edge device deployment
+- Good balance of size/speed/accuracy
+- **Most common choice** for production deployment
+- **Example:** Image classification on mobile devices
+
+**INT4 (Aggressive Quantization):**
+- Extremely memory-constrained (e.g., 1GB mobile devices)
+- Can tolerate larger accuracy loss (1-3%)
+- Need maximum size reduction (8×)
+- **Use sparingly** - accuracy risk high
+- **Example:** Large language models (LLaMA-7B: 13GB → 3.5GB)
+
+### Decision Flow
+
+```python
+def choose_precision(accuracy_tolerance, deployment_target):
+    """
+    Choose quantization precision based on requirements.
+
+    WHY: Different precisions for different constraints.
+    INT8 is default, FP16 for GPU, INT4 for extreme memory constraints.
+    """
+    if accuracy_tolerance < 0.1:
+        return "FP16"  # Minimal accuracy loss required
+    elif deployment_target == "GPU":
+        return "FP16"  # GPU optimized for FP16
+    elif deployment_target in ["CPU", "edge"]:
+        return "INT8"  # CPU optimized for INT8
+    elif deployment_target == "extreme_edge" and accuracy_tolerance > 1:
+        return "INT4"  # Only if can tolerate 1-3% loss
+    else:
+        return "INT8"  # Default safe choice
+```
+
+
+## Part 5: ONNX Quantization (Cross-Framework)
+
+**When to use:** Deploying to ONNX Runtime (CPU/edge devices) or need cross-framework compatibility.
+
+### ONNX Static Quantization
+
+```python
+import onnxruntime
+from onnxruntime.quantization import quantize_static, CalibrationDataReader
+import numpy as np
+
+class CalibrationDataReaderWrapper(CalibrationDataReader):
+    """
+    WHY: ONNX requires custom calibration data reader.
+    This class feeds calibration data to ONNX quantization engine.
+    """
+    def __init__(self, calibration_data):
+        self.calibration_data = calibration_data
+        self.iterator = iter(calibration_data)
+
+    def get_next(self):
+        """WHY: Called by ONNX to get next calibration batch."""
+        try:
+            data, _ = next(self.iterator)
+            return {"input": data.numpy()}  # WHY: Return dict of input name → data
+        except StopIteration:
+            return None
+
+# Step 1: Export PyTorch model to ONNX
+model = torch.load('model.pth')
+model.eval()
+dummy_input = torch.randn(1, 3, 224, 224)
+
+torch.onnx.export(
+    model,
+    dummy_input,
+    'model.onnx',
+    input_names=['input'],
+    output_names=['output'],
+    opset_version=13  # WHY: ONNX opset 13+ supports quantization ops
+)
+
+# Step 2: Prepare calibration data
+calibration_loader = torch.utils.data.DataLoader(
+    calibration_dataset,
+    batch_size=1,  # WHY: ONNX calibration uses batch size 1
+    shuffle=False
+)
+calibration_reader = CalibrationDataReaderWrapper(calibration_loader)
+
+# Step 3: Quantize ONNX model
+quantize_static(
+    'model.onnx',
+    'model_quantized.onnx',
+    calibration_data_reader=calibration_reader,
+    quant_format='QDQ'  # WHY: QDQ format compatible with most backends
+)
+
+# Step 4: Benchmark ONNX quantized model
+import time
+
+session = onnxruntime.InferenceSession('model_quantized.onnx')
+input_name = session.get_inputs()[0].name
+
+test_data = np.random.randn(1, 3, 224, 224).astype(np.float32)
+
+# Warm up
+for _ in range(10):
+    session.run(None, {input_name: test_data})
+
+# Benchmark
+start = time.time()
+for _ in range(100):
+    session.run(None, {input_name: test_data})
+end = time.time()
+
+latency = (end - start) / 100 * 1000  # ms per inference
+print(f"ONNX Quantized latency: {latency:.2f}ms")
+```
+
+**ONNX advantages:**
+- Cross-framework (works with PyTorch, TensorFlow, etc.)
+- Optimized ONNX Runtime for CPU inference
+- Good hardware backend support (x86, ARM)
+
+
+## Part 6: Accuracy Validation (Critical Step)
+
+### Why Accuracy Validation Matters
+
+Quantization is **lossy compression**. Must measure accuracy impact:
+- Some models tolerate quantization well (<0.5% loss)
+- Some models sensitive to quantization (>2% loss)
+- Some layers more sensitive than others
+- **Can't assume quantization is safe without measuring**
+
+### Validation Methodology
+
+```python
+def validate_quantization(original_model, quantized_model, val_loader):
+    """
+    Validate quantization by comparing accuracy.
+
+    WHY: Quantization is lossy - must measure impact.
+    Compare baseline vs quantized on same validation set.
+
+    Returns:
+        dict with baseline_acc, quantized_acc, accuracy_loss
+    """
+    def evaluate(model, data_loader):
+        model.eval()
+        correct = 0
+        total = 0
+        with torch.no_grad():
+            for data, target in data_loader:
+                output = model(data)
+                pred = output.argmax(dim=1)
+                correct += (pred == target).sum().item()
+                total += target.size(0)
+        return 100.0 * correct / total
+
+    baseline_acc = evaluate(original_model, val_loader)
+    quantized_acc = evaluate(quantized_model, val_loader)
+    accuracy_loss = baseline_acc - quantized_acc
+
+    return {
+        'baseline_acc': baseline_acc,
+        'quantized_acc': quantized_acc,
+        'accuracy_loss': accuracy_loss,
+        'acceptable': accuracy_loss < 2.0  # WHY: <2% loss usually acceptable
+    }
+
+# Example validation
+results = validate_quantization(original_model, quantized_model, val_loader)
+print(f"Baseline accuracy: {results['baseline_acc']:.2f}%")
+print(f"Quantized accuracy: {results['quantized_acc']:.2f}%")
+print(f"Accuracy loss: {results['accuracy_loss']:.2f}%")
+print(f"Acceptable: {results['acceptable']}")
+
+# Decision logic
+if results['acceptable']:
+    print("✅ Quantization acceptable - deploy quantized model")
+else:
+    print("❌ Accuracy loss too large - try QAT or reconsider quantization")
+```
+
+### Acceptable Accuracy Thresholds
+
+**General guidelines:**
+- **<1% loss:** Excellent quantization result
+- **1-2% loss:** Acceptable for most applications
+- **2-3% loss:** Consider QAT to reduce loss
+- **>3% loss:** Quantization may not be suitable for this model
+
+**Task-specific thresholds:**
+- Image classification: 1-2% top-1 accuracy loss acceptable
+- Object detection: 1-2% mAP loss acceptable
+- NLP classification: 0.5-1% accuracy loss acceptable
+- Medical/safety-critical: <0.5% loss required (use QAT)
+
+
+## Part 7: LLM Quantization (GPTQ, AWQ)
+
+**Note:** This skill covers general quantization. For LLM-specific optimization (GPTQ, AWQ, KV cache, etc.), see the `llm-inference-optimization` skill in the llm-specialist pack.
+
+### LLM Quantization Overview
+
+**Why LLMs need quantization:**
+- Very large (7B parameters = 13GB in FP16)
+- Memory-bound inference (limited by VRAM)
+- INT4 quantization: 13GB → 3.5GB (fits in consumer GPUs)
+
+**LLM-specific quantization methods:**
+- **GPTQ:** Post-training quantization optimized for LLMs
+- **AWQ:** Activation-aware weight quantization (better quality than GPTQ)
+- **Both:** Achieve INT4 with <0.5 perplexity increase
+
+### When to Use LLM Quantization
+
+✅ **Use when:**
+- Deploying LLMs locally (consumer GPUs)
+- Memory-constrained (need to fit in 12GB/24GB VRAM)
+- Cost-sensitive (smaller models cheaper to host)
+- Latency-sensitive (smaller models faster to load)
+
+❌ **Don't use when:**
+- Have sufficient GPU memory for FP16
+- Accuracy critical (medical, legal applications)
+- Already using API (OpenAI, Anthropic) - they handle optimization
+
+### LLM Quantization References
+
+For detailed LLM quantization:
+- **See skill:** `llm-inference-optimization` (llm-specialist pack)
+- **Covers:** GPTQ, AWQ, KV cache optimization, token streaming
+- **Tools:** llama.cpp, vLLM, text-generation-inference
+
+**Quick reference (defer to llm-specialist for details):**
+
+```python
+# GPTQ quantization (example - see llm-specialist for full details)
+from transformers import AutoModelForCausalLM, GPTQConfig
+
+# WHY: GPTQ optimizes layer-wise for minimal perplexity increase
+quantization_config = GPTQConfig(bits=4, dataset="c4", tokenizer=tokenizer)
+
+model = AutoModelForCausalLM.from_pretrained(
+    "meta-llama/Llama-2-7b-hf",
+    quantization_config=quantization_config,
+    device_map="auto"
+)
+
+# Result: 13GB → 3.5GB, <0.5 perplexity increase
+```
+
+
+## Part 8: When NOT to Quantize
+
+### Scenario 1: Already Fast Enough
+
+**Example:** MobileNetV2 (14MB, 3ms CPU latency)
+- Quantization: 14MB → 4MB, 3ms → 2ms
+- **Benefit:** 10MB saved, 1ms faster
+- **Cost:** Calibration, validation, testing, debugging
+- **Decision:** Not worth effort unless specific requirement
+
+**Rule:** If current performance meets requirements, don't optimize.
+
+### Scenario 2: GPU-Only Deployment with No Memory Constraints
+
+**Example:** ResNet50 on Tesla V100 with 32GB VRAM
+- Quantization: 1.5-2× GPU speedup (modest)
+- FP32 already fast on GPU (Tensor Cores optimized)
+- No memory pressure (plenty of VRAM)
+- **Decision:** Focus on other bottlenecks (data loading, I/O)
+
+**Rule:** Quantization is most beneficial for CPU inference and memory-constrained GPU.
+
+### Scenario 3: Accuracy-Critical Applications
+
+**Example:** Medical diagnosis model where misdiagnosis has severe consequences
+- Quantization introduces accuracy loss (even if small)
+- Risk not worth benefit
+- **Decision:** Keep FP32, optimize other parts (batching, caching)
+
+**Rule:** Safety-critical systems should avoid lossy compression unless thoroughly validated.
+
+### Scenario 4: Prototyping Phase
+
+**Example:** Early development, trying different architectures
+- Quantization is optimization - premature at prototype stage
+- Focus on getting model working first
+- **Decision:** Defer quantization until production deployment
+
+**Rule:** Don't optimize until you need to (Knuth: "Premature optimization is root of all evil").
+
+
+## Part 9: Quantization Benchmarks (Expected Results)
+
+### Image Classification (ResNet50, ImageNet)
+
+| Metric | FP32 Baseline | Dynamic INT8 | Static INT8 | QAT INT8 |
+|--------|---------------|--------------|-------------|----------|
+| Size | 98MB | 25MB (4×) | 25MB (4×) | 25MB (4×) |
+| CPU Latency | 15ms | 12ms (1.25×) | 4ms (3.75×) | 4ms (3.75×) |
+| Top-1 Accuracy | 76.1% | 75.9% (0.2% loss) | 75.3% (0.8% loss) | 75.9% (0.2% loss) |
+
+**Insight:** Static quantization gives 3.75× speedup with acceptable 0.8% accuracy loss.
+
+### Object Detection (YOLOv5s, COCO)
+
+| Metric | FP32 Baseline | Static INT8 | QAT INT8 |
+|--------|---------------|-------------|----------|
+| Size | 14MB | 4MB (3.5×) | 4MB (3.5×) |
+| CPU Latency | 45ms | 15ms (3×) | 15ms (3×) |
+| mAP@0.5 | 37.4% | 36.8% (0.6% loss) | 37.2% (0.2% loss) |
+
+**Insight:** QAT gives better accuracy (0.2% vs 0.6% loss) with same speedup.
+
+### NLP Classification (BERT-base, GLUE)
+
+| Metric | FP32 Baseline | Dynamic INT8 | Static INT8 |
+|--------|---------------|--------------|-------------|
+| Size | 440MB | 110MB (4×) | 110MB (4×) |
+| CPU Latency | 35ms | 28ms (1.25×) | 12ms (2.9×) |
+| Accuracy | 93.5% | 93.2% (0.3% loss) | 92.8% (0.7% loss) |
+
+**Insight:** Static quantization gives 2.9× speedup but dynamic sufficient if speedup not critical.
+
+### LLM Inference (LLaMA-7B)
+
+| Metric | FP16 Baseline | GPTQ INT4 | AWQ INT4 |
+|--------|---------------|-----------|----------|
+| Size | 13GB | 3.5GB (3.7×) | 3.5GB (3.7×) |
+| First Token Latency | 800ms | 250ms (3.2×) | 230ms (3.5×) |
+| Perplexity | 5.68 | 5.82 (0.14 increase) | 5.77 (0.09 increase) |
+
+**Insight:** AWQ gives better quality than GPTQ with similar speedup.
+
+
+## Part 10: Common Pitfalls and Solutions
+
+### Pitfall 1: Skipping Accuracy Validation
+
+**Issue:** Deploy quantized model without measuring accuracy impact.
+**Risk:** Discover accuracy degradation in production (too late).
+**Solution:** Always validate accuracy on representative data before deployment.
+
+```python
+# ❌ WRONG: Deploy without validation
+quantized_model = quantize(model)
+deploy(quantized_model)  # Hope it works!
+
+# ✅ RIGHT: Validate before deployment
+quantized_model = quantize(model)
+results = validate_accuracy(original_model, quantized_model, val_loader)
+if results['acceptable']:
+    deploy(quantized_model)
+else:
+    print("Accuracy loss too large - try QAT")
+```
+
+### Pitfall 2: Using Wrong Calibration Data
+
+**Issue:** Calibrate with random/unrepresentative data.
+**Risk:** Activation ranges wrong → accuracy collapses.
+**Solution:** Use 100-1000 samples from validation set matching deployment distribution.
+
+```python
+# ❌ WRONG: Random images from internet
+calibration_data = download_random_images()
+
+# ✅ RIGHT: Samples from validation set
+calibration_data = torch.utils.data.Subset(val_dataset, range(1000))
+```
+
+### Pitfall 3: Choosing Wrong Quantization Type
+
+**Issue:** Use dynamic quantization when need static speedup.
+**Risk:** Get 1.2× speedup instead of 3× speedup.
+**Solution:** Match quantization type to requirements (dynamic for size, static for speed).
+
+```python
+# ❌ WRONG: Use dynamic when need speed
+if need_fast_cpu_inference:
+    quantized_model = torch.quantization.quantize_dynamic(model)  # Only 1.2× speedup
+
+# ✅ RIGHT: Use static for speed
+if need_fast_cpu_inference:
+    model = prepare_and_calibrate(model, calibration_data)
+    quantized_model = torch.quantization.convert(model)  # 2-4× speedup
+```
+
+### Pitfall 4: Quantizing GPU-Only Deployments
+
+**Issue:** Quantize model for GPU inference without memory pressure.
+**Risk:** Effort not worth modest 1.5-2× GPU speedup.
+**Solution:** Only quantize GPU if memory-constrained (multiple models in VRAM).
+
+```python
+# ❌ WRONG: Quantize for GPU with no memory issue
+if deployment_target == "GPU" and have_plenty_of_memory:
+    quantized_model = quantize(model)  # Wasted effort
+
+# ✅ RIGHT: Skip quantization if not needed
+if deployment_target == "GPU" and have_plenty_of_memory:
+    deploy(model)  # Keep FP32, focus on other optimizations
+```
+
+### Pitfall 5: Over-Quantizing (INT4 When INT8 Sufficient)
+
+**Issue:** Use aggressive INT4 quantization when INT8 would suffice.
+**Risk:** Larger accuracy loss than necessary.
+**Solution:** Start with INT8 (standard), only use INT4 if extreme memory constraints.
+
+```python
+# ❌ WRONG: Jump to INT4 without trying INT8
+quantized_model = quantize(model, precision="INT4")  # 2-3% accuracy loss
+
+# ✅ RIGHT: Start with INT8, only use INT4 if needed
+quantized_model_int8 = quantize(model, precision="INT8")  # 0.5-1% accuracy loss
+if model_still_too_large:
+    quantized_model_int4 = quantize(model, precision="INT4")
+```
+
+### Pitfall 6: Assuming All Layers Quantize Equally
+
+**Issue:** Quantize all layers uniformly, but some layers more sensitive.
+**Risk:** Accuracy loss dominated by few sensitive layers.
+**Solution:** Use mixed precision - keep sensitive layers in FP32/INT8, quantize others to INT4.
+
+```python
+# ✅ ADVANCED: Mixed precision quantization
+# Keep first/last layers in higher precision, quantize middle layers aggressively
+from torch.quantization import QConfigMapping
+
+qconfig_mapping = QConfigMapping()
+qconfig_mapping.set_global(get_default_qconfig('fbgemm'))  # INT8 default
+qconfig_mapping.set_module_name('model.layer1', None)  # Keep first layer FP32
+qconfig_mapping.set_module_name('model.layer10', None)  # Keep last layer FP32
+
+model = quantize_with_qconfig(model, qconfig_mapping)
+```
+
+
+## Part 11: Decision Framework Summary
+
+### Step 1: Recognize Quantization Need
+
+**Symptoms:**
+- Model too slow on CPU (>10ms when need <5ms)
+- Model too large for edge devices (>50MB)
+- Deploying to CPU/edge (not GPU)
+- Need to reduce hosting costs
+
+**If YES → Proceed to Step 2**
+**If NO → Don't quantize, focus on other optimizations**
+
+### Step 2: Choose Quantization Type
+
+```
+Start with Dynamic:
+├─ Sufficient? (meets latency/size requirements)
+│  ├─ YES → Deploy dynamic quantized model
+│  └─ NO → Proceed to Static
+│
+Static Quantization:
+├─ Sufficient? (meets latency/size + accuracy acceptable)
+│  ├─ YES → Deploy static quantized model
+│  └─ NO → Accuracy loss >2%
+│     │
+│     └─ Proceed to QAT
+│
+QAT:
+├─ Train with quantization awareness
+└─ Achieves <1% accuracy loss → Deploy
+```
+
+### Step 3: Calibrate (if Static/QAT)
+
+**Calibration data:**
+- Source: Validation set (representative samples)
+- Size: 100-1000 samples
+- Characteristics: Match deployment distribution
+
+**Calibration process:**
+1. Select samples from validation set
+2. Run through model to collect activation ranges
+3. Validate accuracy after calibration
+4. If accuracy loss >2%, try different calibration data or QAT
+
+### Step 4: Validate Accuracy
+
+**Required measurements:**
+- Baseline accuracy (FP32)
+- Quantized accuracy (INT8/INT4)
+- Accuracy loss (baseline - quantized)
+- Acceptable threshold (typically <2%)
+
+**Decision:**
+- If accuracy loss <2% → Deploy
+- If accuracy loss >2% → Try QAT or reconsider quantization
+
+### Step 5: Benchmark Performance
+
+**Required measurements:**
+- Model size (MB): baseline vs quantized
+- Inference latency (ms): baseline vs quantized
+- Throughput (requests/sec): baseline vs quantized
+
+**Verify expected results:**
+- Size: 4× reduction (FP32 → INT8)
+- CPU speedup: 2-4× (static quantization)
+- GPU speedup: 1.5-2× (if applicable)
+
+
+## Part 12: Production Deployment Checklist
+
+Before deploying quantized model to production:
+
+**✅ Accuracy Validated**
+- [ ] Baseline accuracy measured on validation set
+- [ ] Quantized accuracy measured on same validation set
+- [ ] Accuracy loss within acceptable threshold (<2%)
+- [ ] Validated on representative production data
+
+**✅ Performance Benchmarked**
+- [ ] Size reduction measured (expect 4× for INT8)
+- [ ] Latency improvement measured (expect 2-4× CPU)
+- [ ] Throughput improvement measured
+- [ ] Performance meets requirements
+
+**✅ Calibration Verified** (if static/QAT)
+- [ ] Used representative samples from validation set (not random data)
+- [ ] Used sufficient calibration data (100-1000 samples)
+- [ ] Calibration data matches deployment distribution
+
+**✅ Edge Cases Tested**
+- [ ] Tested on diverse inputs (bright/dark images, long/short text)
+- [ ] Validated numerical stability (no NaN/Inf outputs)
+- [ ] Tested inference on target hardware (CPU/GPU/edge device)
+
+**✅ Rollback Plan**
+- [ ] Can easily revert to FP32 model if issues found
+- [ ] Monitoring in place to detect accuracy degradation
+- [ ] A/B testing plan to compare FP32 vs quantized
+
+
+## Skill Mastery Checklist
+
+You have mastered quantization for inference when you can:
+
+- [ ] Recognize when quantization is appropriate (CPU/edge deployment, size/speed issues)
+- [ ] Choose correct quantization type (dynamic vs static vs QAT) based on requirements
+- [ ] Implement dynamic quantization in PyTorch (5 lines of code)
+- [ ] Implement static quantization with proper calibration (20 lines of code)
+- [ ] Select appropriate calibration data (validation set, 100-1000 samples)
+- [ ] Validate accuracy trade-offs systematically (baseline vs quantized)
+- [ ] Benchmark performance improvements (size, latency, throughput)
+- [ ] Decide when NOT to quantize (GPU-only, already fast, accuracy-critical)
+- [ ] Debug quantization issues (accuracy collapse, wrong speedup, numerical instability)
+- [ ] Deploy quantized models to production with confidence
+
+**Key insight:** Quantization is not magic - it's a systematic trade-off of precision for performance. The skill is matching the right quantization approach to your specific requirements.