Files
gh-tachyon-beep-skillpacks-…/skills/using-ml-production/quantization-for-inference.md
2025-11-30 08:59:57 +08:00

992 lines
34 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Quantization for Inference Skill
## When to Use This Skill
Use this skill when you observe these symptoms:
**Performance Symptoms:**
- Model inference too slow on CPU (e.g., >10ms when need <5ms)
- Batch processing taking too long (low throughput)
- Need to serve more requests per second with same hardware
**Size Symptoms:**
- Model too large for edge devices (e.g., 100MB+ for mobile)
- Want to fit more models in GPU memory
- Memory-constrained deployment environment
**Deployment Symptoms:**
- Deploying to CPU servers (quantization gives 2-4× CPU speedup)
- Deploying to edge devices (mobile, IoT, embedded systems)
- Cost-sensitive deployment (smaller models = lower hosting costs)
**When NOT to use this skill:**
- Model already fast enough and small enough (no problem to solve)
- Deploying exclusively on GPU with no memory constraints (modest benefit)
- Prototyping phase where optimization is premature
- Model so small that quantization overhead not worth it (e.g., <5MB)
## Core Principle
**Quantization trades precision for performance.**
Quantization converts high-precision numbers (FP32: 32 bits) to low-precision integers (INT8: 8 bits or INT4: 4 bits). This provides:
- **4-8× smaller model size** (fewer bits per parameter)
- **2-4× faster inference on CPU** (INT8 operations faster than FP32)
- **Small accuracy loss** (typically 0.5-1% for INT8)
**Formula:** Lower precision (FP32 → INT8 → INT4) = Smaller size + Faster inference + More accuracy loss
The skill is choosing the **right precision for your accuracy tolerance**.
## Quantization Framework
```
┌────────────────────────────────────────────┐
│ 1. Recognize Quantization Need │
│ CPU/Edge + (Slow OR Large) │
└──────────────┬─────────────────────────────┘
┌────────────────────────────────────────────┐
│ 2. Choose Quantization Type │
│ Dynamic → Static → QAT (increasing cost) │
└──────────────┬─────────────────────────────┘
┌────────────────────────────────────────────┐
│ 3. Calibrate (if Static/QAT) │
│ 100-1000 representative samples │
└──────────────┬─────────────────────────────┘
┌────────────────────────────────────────────┐
│ 4. Validate Accuracy Trade-offs │
│ Baseline vs Quantized accuracy │
└──────────────┬─────────────────────────────┘
┌────────────────────────────────────────────┐
│ 5. Decide: Accept or Iterate │
│ <2% loss → Deploy │
│ >2% loss → Try QAT or different precision│
└────────────────────────────────────────────┘
```
## Part 1: Quantization Types
### Type 1: Dynamic Quantization
**What it does:** Quantizes weights to INT8, keeps activations in FP32.
**When to use:**
- Simplest quantization (no calibration needed)
- Primary goal is size reduction
- Batch processing where latency less critical
- Quick experiment to see if quantization helps
**Benefits:**
- ✅ 4× size reduction (weights are 75% of model size)
- ✅ 1.2-1.5× CPU speedup (modest, because activations still FP32)
- ✅ Minimal accuracy loss (~0.2-0.5%)
- ✅ No calibration data needed
**Limitations:**
- ⚠️ Limited CPU speedup (activations still FP32)
- ⚠️ Not optimal for edge devices needing maximum performance
**PyTorch implementation:**
```python
import torch
import torch.quantization
# WHY: Dynamic quantization is simplest - just one function call
# No calibration data needed because activations stay FP32
model = torch.load('model.pth')
model.eval() # WHY: Must be in eval mode (no batchnorm updates)
# WHY: Specify which layers to quantize (Linear, LSTM, etc.)
# These layers benefit most from quantization
quantized_model = torch.quantization.quantize_dynamic(
model,
qconfig_spec={torch.nn.Linear}, # WHY: Quantize Linear layers only
dtype=torch.qint8 # WHY: INT8 is standard precision
)
# Save quantized model
torch.save(quantized_model.state_dict(), 'model_quantized_dynamic.pth')
# Verify size reduction
original_size = os.path.getsize('model.pth') / (1024 ** 2) # MB
quantized_size = os.path.getsize('model_quantized_dynamic.pth') / (1024 ** 2)
print(f"Original: {original_size:.1f}MB → Quantized: {quantized_size:.1f}MB")
print(f"Size reduction: {original_size / quantized_size:.1f}×")
```
**Example use case:** BERT classification model where primary goal is reducing size from 440MB to 110MB for easier deployment.
### Type 2: Static Quantization (Post-Training Quantization)
**What it does:** Quantizes both weights and activations to INT8.
**When to use:**
- Need maximum CPU speedup (2-4×)
- Deploying to CPU servers or edge devices
- Can afford calibration step (5-10 minutes)
- Primary goal is inference speed
**Benefits:**
- ✅ 4× size reduction (same as dynamic)
- ✅ 2-4× CPU speedup (both weights and activations INT8)
- ✅ No retraining required (post-training)
- ✅ Acceptable accuracy loss (~0.5-1%)
**Requirements:**
- ⚠️ Needs calibration data (100-1000 samples from validation set)
- ⚠️ Slightly more complex setup than dynamic
**PyTorch implementation:**
```python
import torch
import torch.quantization
def calibrate_model(model, calibration_loader):
"""
Calibrate model by running representative data through it.
WHY: Static quantization needs to know activation ranges.
Calibration finds min/max values for each activation layer.
Args:
model: Model in eval mode with quantization stubs
calibration_loader: DataLoader with 100-1000 samples
"""
model.eval()
with torch.no_grad():
for batch_idx, (data, _) in enumerate(calibration_loader):
model(data)
if batch_idx >= 100: # WHY: 100 batches usually sufficient
break
return model
# Step 1: Prepare model for quantization
model = torch.load('model.pth')
model.eval()
# WHY: Insert quantization/dequantization stubs at boundaries
# This tells PyTorch where to convert between FP32 and INT8
model.qconfig = torch.quantization.get_default_qconfig('fbgemm')
torch.quantization.prepare(model, inplace=True)
# Step 2: Calibrate with representative data
# WHY: Must use data from training/validation set, not random data
# Calibration finds activation ranges - needs real distribution
calibration_dataset = torch.utils.data.Subset(
val_dataset,
indices=range(1000) # WHY: 1000 samples sufficient for most models
)
calibration_loader = torch.utils.data.DataLoader(
calibration_dataset,
batch_size=32,
shuffle=False # WHY: Order doesn't matter for calibration
)
model = calibrate_model(model, calibration_loader)
# Step 3: Convert to quantized model
torch.quantization.convert(model, inplace=True)
# Save quantized model
torch.save(model.state_dict(), 'model_quantized_static.pth')
# Benchmark speed improvement
import time
def benchmark(model, data, num_iterations=100):
"""WHY: Warm up model first, then measure average latency."""
model.eval()
# Warm up (first few iterations slower)
for _ in range(10):
model(data)
start = time.time()
with torch.no_grad():
for _ in range(num_iterations):
model(data)
end = time.time()
return (end - start) / num_iterations * 1000 # ms per inference
test_data = torch.randn(1, 3, 224, 224) # Example input
baseline_latency = benchmark(original_model, test_data)
quantized_latency = benchmark(model, test_data)
print(f"Baseline: {baseline_latency:.2f}ms")
print(f"Quantized: {quantized_latency:.2f}ms")
print(f"Speedup: {baseline_latency / quantized_latency:.2f}×")
```
**Example use case:** ResNet50 image classifier for CPU inference - need <5ms latency, achieve 4ms with static quantization (vs 15ms baseline).
### Type 3: Quantization-Aware Training (QAT)
**What it does:** Simulates quantization during training to minimize accuracy loss.
**When to use:**
- Static quantization accuracy loss too large (>2%)
- Need best possible accuracy with INT8
- Can afford retraining (hours to days)
- Critical production system with strict accuracy requirements
**Benefits:**
- ✅ Best accuracy (~0.1-0.3% loss vs 0.5-1% for static)
- ✅ 4× size reduction (same as dynamic/static)
- ✅ 2-4× CPU speedup (same as static)
**Limitations:**
- ⚠️ Requires retraining (most expensive option)
- ⚠️ Takes hours to days depending on model size
- ⚠️ More complex implementation
**PyTorch implementation:**
```python
import torch
import torch.quantization
def train_one_epoch_qat(model, train_loader, optimizer, criterion):
"""
Train one epoch with quantization-aware training.
WHY: QAT inserts fake quantization ops during training.
Model learns to be robust to quantization errors.
"""
model.train()
for data, target in train_loader:
optimizer.zero_grad()
output = model(data)
loss = criterion(output, target)
loss.backward()
optimizer.step()
return model
# Step 1: Prepare model for QAT
model = torch.load('model.pth')
model.train()
# WHY: QAT config includes fake quantization ops
# These simulate quantization during forward pass
model.qconfig = torch.quantization.get_default_qat_qconfig('fbgemm')
torch.quantization.prepare_qat(model, inplace=True)
# Step 2: Train with quantization-aware training
# WHY: Model learns to compensate for quantization errors
optimizer = torch.optim.SGD(model.parameters(), lr=0.0001) # WHY: Low LR for fine-tuning
criterion = torch.nn.CrossEntropyLoss()
num_epochs = 5 # WHY: Usually 5-10 epochs sufficient for QAT fine-tuning
for epoch in range(num_epochs):
model = train_one_epoch_qat(model, train_loader, optimizer, criterion)
print(f"Epoch {epoch+1}/{num_epochs} complete")
# Step 3: Convert to quantized model
model.eval()
torch.quantization.convert(model, inplace=True)
# Save QAT quantized model
torch.save(model.state_dict(), 'model_quantized_qat.pth')
```
**Example use case:** Medical imaging model where accuracy is critical - static quantization gives 2% accuracy loss, QAT reduces to 0.3%.
## Part 2: Quantization Type Decision Matrix
| Type | Complexity | Calibration | Retraining | Size Reduction | CPU Speedup | Accuracy Loss |
|------|-----------|-------------|------------|----------------|-------------|---------------|
| **Dynamic** | Low | No | No | 4× | 1.2-1.5× | ~0.2-0.5% |
| **Static** | Medium | Yes | No | 4× | 2-4× | ~0.5-1% |
| **QAT** | High | Yes | Yes | 4× | 2-4× | ~0.1-0.3% |
**Decision flow:**
1. Start with **dynamic quantization**: Simplest, verify quantization helps
2. Upgrade to **static quantization**: If need more speedup, can afford calibration
3. Use **QAT**: Only if accuracy loss from static too large (rare)
**Why this order?** Incremental cost. Dynamic is free (5 minutes), static is cheap (15 minutes), QAT is expensive (hours/days). Don't pay for QAT unless you need it.
## Part 3: Calibration Best Practices
### What is Calibration?
**Purpose:** Find min/max ranges for each activation layer.
**Why needed:** Static quantization needs to know activation ranges to map FP32 → INT8. Without calibration, ranges are wrong → accuracy collapses.
**How it works:**
1. Run representative data through model
2. Record min/max activation values per layer
3. Use these ranges to quantize activations at inference time
### Calibration Data Requirements
**Data source:**
-**Use validation set samples** (matches training distribution)
- ❌ Don't use random images from internet (different distribution)
- ❌ Don't use single image repeated (insufficient coverage)
- ❌ Don't use training set that doesn't match deployment (distribution shift)
**Data size:**
- **Minimum:** 100 samples (sufficient for simple models)
- **Recommended:** 500-1000 samples (better coverage)
- **Maximum:** Full validation set is overkill (slow, no benefit)
**Data characteristics:**
- Must cover range of inputs model sees in production
- Include edge cases (bright/dark images, long/short text)
- Distribution should match deployment, not just training
- Class balance less important than input diversity
**Example calibration data selection:**
```python
import torch
import numpy as np
def select_calibration_data(val_dataset, num_samples=1000):
"""
Select diverse calibration samples from validation set.
WHY: Want samples that cover range of activation values.
Random selection from validation set usually sufficient.
Args:
val_dataset: Full validation dataset
num_samples: Number of calibration samples (default 1000)
Returns:
Calibration dataset subset
"""
# WHY: Random selection ensures diversity
# Stratified sampling can help ensure class coverage
indices = np.random.choice(len(val_dataset), num_samples, replace=False)
calibration_dataset = torch.utils.data.Subset(val_dataset, indices)
return calibration_dataset
# Example: Select 1000 random samples from validation set
calibration_dataset = select_calibration_data(val_dataset, num_samples=1000)
calibration_loader = torch.utils.data.DataLoader(
calibration_dataset,
batch_size=32,
shuffle=False # WHY: Order doesn't matter for calibration
)
```
### Common Calibration Pitfalls
**Pitfall 1: Using wrong data distribution**
- ❌ "Random images from internet" for ImageNet-trained model
- ✅ Use ImageNet validation set samples
**Pitfall 2: Too few samples**
- ❌ 10 samples (insufficient coverage of activation ranges)
- ✅ 100-1000 samples (good coverage)
**Pitfall 3: Using training data that doesn't match deployment**
- ❌ Calibrate on sunny outdoor images, deploy on indoor images
- ✅ Calibrate on data matching deployment distribution
**Pitfall 4: Skipping calibration validation**
- ❌ Calibrate once, assume it works
- ✅ Validate accuracy after calibration to verify ranges are good
## Part 4: Precision Selection (INT8 vs INT4 vs FP16)
### Precision Spectrum
| Precision | Bits | Size vs FP32 | Speedup (CPU) | Typical Accuracy Loss |
|-----------|------|--------------|---------------|----------------------|
| **FP32** | 32 | 1× | 1× | 0% (baseline) |
| **FP16** | 16 | 2× | 1.5× | <0.1% |
| **INT8** | 8 | 4× | 2-4× | 0.5-1% |
| **INT4** | 4 | 8× | 4-8× | 1-3% |
**Trade-off:** Lower precision = Smaller size + Faster inference + More accuracy loss
### When to Use Each Precision
**FP16 (Half Precision):**
- GPU inference (Tensor Cores optimized for FP16)
- Need minimal accuracy loss (<0.1%)
- Size reduction secondary concern
- **Example:** Large language models on GPU
**INT8 (Standard Quantization):**
- CPU inference (INT8 operations fast on CPU)
- Edge device deployment
- Good balance of size/speed/accuracy
- **Most common choice** for production deployment
- **Example:** Image classification on mobile devices
**INT4 (Aggressive Quantization):**
- Extremely memory-constrained (e.g., 1GB mobile devices)
- Can tolerate larger accuracy loss (1-3%)
- Need maximum size reduction (8×)
- **Use sparingly** - accuracy risk high
- **Example:** Large language models (LLaMA-7B: 13GB → 3.5GB)
### Decision Flow
```python
def choose_precision(accuracy_tolerance, deployment_target):
"""
Choose quantization precision based on requirements.
WHY: Different precisions for different constraints.
INT8 is default, FP16 for GPU, INT4 for extreme memory constraints.
"""
if accuracy_tolerance < 0.1:
return "FP16" # Minimal accuracy loss required
elif deployment_target == "GPU":
return "FP16" # GPU optimized for FP16
elif deployment_target in ["CPU", "edge"]:
return "INT8" # CPU optimized for INT8
elif deployment_target == "extreme_edge" and accuracy_tolerance > 1:
return "INT4" # Only if can tolerate 1-3% loss
else:
return "INT8" # Default safe choice
```
## Part 5: ONNX Quantization (Cross-Framework)
**When to use:** Deploying to ONNX Runtime (CPU/edge devices) or need cross-framework compatibility.
### ONNX Static Quantization
```python
import onnxruntime
from onnxruntime.quantization import quantize_static, CalibrationDataReader
import numpy as np
class CalibrationDataReaderWrapper(CalibrationDataReader):
"""
WHY: ONNX requires custom calibration data reader.
This class feeds calibration data to ONNX quantization engine.
"""
def __init__(self, calibration_data):
self.calibration_data = calibration_data
self.iterator = iter(calibration_data)
def get_next(self):
"""WHY: Called by ONNX to get next calibration batch."""
try:
data, _ = next(self.iterator)
return {"input": data.numpy()} # WHY: Return dict of input name → data
except StopIteration:
return None
# Step 1: Export PyTorch model to ONNX
model = torch.load('model.pth')
model.eval()
dummy_input = torch.randn(1, 3, 224, 224)
torch.onnx.export(
model,
dummy_input,
'model.onnx',
input_names=['input'],
output_names=['output'],
opset_version=13 # WHY: ONNX opset 13+ supports quantization ops
)
# Step 2: Prepare calibration data
calibration_loader = torch.utils.data.DataLoader(
calibration_dataset,
batch_size=1, # WHY: ONNX calibration uses batch size 1
shuffle=False
)
calibration_reader = CalibrationDataReaderWrapper(calibration_loader)
# Step 3: Quantize ONNX model
quantize_static(
'model.onnx',
'model_quantized.onnx',
calibration_data_reader=calibration_reader,
quant_format='QDQ' # WHY: QDQ format compatible with most backends
)
# Step 4: Benchmark ONNX quantized model
import time
session = onnxruntime.InferenceSession('model_quantized.onnx')
input_name = session.get_inputs()[0].name
test_data = np.random.randn(1, 3, 224, 224).astype(np.float32)
# Warm up
for _ in range(10):
session.run(None, {input_name: test_data})
# Benchmark
start = time.time()
for _ in range(100):
session.run(None, {input_name: test_data})
end = time.time()
latency = (end - start) / 100 * 1000 # ms per inference
print(f"ONNX Quantized latency: {latency:.2f}ms")
```
**ONNX advantages:**
- Cross-framework (works with PyTorch, TensorFlow, etc.)
- Optimized ONNX Runtime for CPU inference
- Good hardware backend support (x86, ARM)
## Part 6: Accuracy Validation (Critical Step)
### Why Accuracy Validation Matters
Quantization is **lossy compression**. Must measure accuracy impact:
- Some models tolerate quantization well (<0.5% loss)
- Some models sensitive to quantization (>2% loss)
- Some layers more sensitive than others
- **Can't assume quantization is safe without measuring**
### Validation Methodology
```python
def validate_quantization(original_model, quantized_model, val_loader):
"""
Validate quantization by comparing accuracy.
WHY: Quantization is lossy - must measure impact.
Compare baseline vs quantized on same validation set.
Returns:
dict with baseline_acc, quantized_acc, accuracy_loss
"""
def evaluate(model, data_loader):
model.eval()
correct = 0
total = 0
with torch.no_grad():
for data, target in data_loader:
output = model(data)
pred = output.argmax(dim=1)
correct += (pred == target).sum().item()
total += target.size(0)
return 100.0 * correct / total
baseline_acc = evaluate(original_model, val_loader)
quantized_acc = evaluate(quantized_model, val_loader)
accuracy_loss = baseline_acc - quantized_acc
return {
'baseline_acc': baseline_acc,
'quantized_acc': quantized_acc,
'accuracy_loss': accuracy_loss,
'acceptable': accuracy_loss < 2.0 # WHY: <2% loss usually acceptable
}
# Example validation
results = validate_quantization(original_model, quantized_model, val_loader)
print(f"Baseline accuracy: {results['baseline_acc']:.2f}%")
print(f"Quantized accuracy: {results['quantized_acc']:.2f}%")
print(f"Accuracy loss: {results['accuracy_loss']:.2f}%")
print(f"Acceptable: {results['acceptable']}")
# Decision logic
if results['acceptable']:
print("✅ Quantization acceptable - deploy quantized model")
else:
print("❌ Accuracy loss too large - try QAT or reconsider quantization")
```
### Acceptable Accuracy Thresholds
**General guidelines:**
- **<1% loss:** Excellent quantization result
- **1-2% loss:** Acceptable for most applications
- **2-3% loss:** Consider QAT to reduce loss
- **>3% loss:** Quantization may not be suitable for this model
**Task-specific thresholds:**
- Image classification: 1-2% top-1 accuracy loss acceptable
- Object detection: 1-2% mAP loss acceptable
- NLP classification: 0.5-1% accuracy loss acceptable
- Medical/safety-critical: <0.5% loss required (use QAT)
## Part 7: LLM Quantization (GPTQ, AWQ)
**Note:** This skill covers general quantization. For LLM-specific optimization (GPTQ, AWQ, KV cache, etc.), see the `llm-inference-optimization` skill in the llm-specialist pack.
### LLM Quantization Overview
**Why LLMs need quantization:**
- Very large (7B parameters = 13GB in FP16)
- Memory-bound inference (limited by VRAM)
- INT4 quantization: 13GB → 3.5GB (fits in consumer GPUs)
**LLM-specific quantization methods:**
- **GPTQ:** Post-training quantization optimized for LLMs
- **AWQ:** Activation-aware weight quantization (better quality than GPTQ)
- **Both:** Achieve INT4 with <0.5 perplexity increase
### When to Use LLM Quantization
**Use when:**
- Deploying LLMs locally (consumer GPUs)
- Memory-constrained (need to fit in 12GB/24GB VRAM)
- Cost-sensitive (smaller models cheaper to host)
- Latency-sensitive (smaller models faster to load)
**Don't use when:**
- Have sufficient GPU memory for FP16
- Accuracy critical (medical, legal applications)
- Already using API (OpenAI, Anthropic) - they handle optimization
### LLM Quantization References
For detailed LLM quantization:
- **See skill:** `llm-inference-optimization` (llm-specialist pack)
- **Covers:** GPTQ, AWQ, KV cache optimization, token streaming
- **Tools:** llama.cpp, vLLM, text-generation-inference
**Quick reference (defer to llm-specialist for details):**
```python
# GPTQ quantization (example - see llm-specialist for full details)
from transformers import AutoModelForCausalLM, GPTQConfig
# WHY: GPTQ optimizes layer-wise for minimal perplexity increase
quantization_config = GPTQConfig(bits=4, dataset="c4", tokenizer=tokenizer)
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-2-7b-hf",
quantization_config=quantization_config,
device_map="auto"
)
# Result: 13GB → 3.5GB, <0.5 perplexity increase
```
## Part 8: When NOT to Quantize
### Scenario 1: Already Fast Enough
**Example:** MobileNetV2 (14MB, 3ms CPU latency)
- Quantization: 14MB → 4MB, 3ms → 2ms
- **Benefit:** 10MB saved, 1ms faster
- **Cost:** Calibration, validation, testing, debugging
- **Decision:** Not worth effort unless specific requirement
**Rule:** If current performance meets requirements, don't optimize.
### Scenario 2: GPU-Only Deployment with No Memory Constraints
**Example:** ResNet50 on Tesla V100 with 32GB VRAM
- Quantization: 1.5-2× GPU speedup (modest)
- FP32 already fast on GPU (Tensor Cores optimized)
- No memory pressure (plenty of VRAM)
- **Decision:** Focus on other bottlenecks (data loading, I/O)
**Rule:** Quantization is most beneficial for CPU inference and memory-constrained GPU.
### Scenario 3: Accuracy-Critical Applications
**Example:** Medical diagnosis model where misdiagnosis has severe consequences
- Quantization introduces accuracy loss (even if small)
- Risk not worth benefit
- **Decision:** Keep FP32, optimize other parts (batching, caching)
**Rule:** Safety-critical systems should avoid lossy compression unless thoroughly validated.
### Scenario 4: Prototyping Phase
**Example:** Early development, trying different architectures
- Quantization is optimization - premature at prototype stage
- Focus on getting model working first
- **Decision:** Defer quantization until production deployment
**Rule:** Don't optimize until you need to (Knuth: "Premature optimization is root of all evil").
## Part 9: Quantization Benchmarks (Expected Results)
### Image Classification (ResNet50, ImageNet)
| Metric | FP32 Baseline | Dynamic INT8 | Static INT8 | QAT INT8 |
|--------|---------------|--------------|-------------|----------|
| Size | 98MB | 25MB (4×) | 25MB (4×) | 25MB (4×) |
| CPU Latency | 15ms | 12ms (1.25×) | 4ms (3.75×) | 4ms (3.75×) |
| Top-1 Accuracy | 76.1% | 75.9% (0.2% loss) | 75.3% (0.8% loss) | 75.9% (0.2% loss) |
**Insight:** Static quantization gives 3.75× speedup with acceptable 0.8% accuracy loss.
### Object Detection (YOLOv5s, COCO)
| Metric | FP32 Baseline | Static INT8 | QAT INT8 |
|--------|---------------|-------------|----------|
| Size | 14MB | 4MB (3.5×) | 4MB (3.5×) |
| CPU Latency | 45ms | 15ms (3×) | 15ms (3×) |
| mAP@0.5 | 37.4% | 36.8% (0.6% loss) | 37.2% (0.2% loss) |
**Insight:** QAT gives better accuracy (0.2% vs 0.6% loss) with same speedup.
### NLP Classification (BERT-base, GLUE)
| Metric | FP32 Baseline | Dynamic INT8 | Static INT8 |
|--------|---------------|--------------|-------------|
| Size | 440MB | 110MB (4×) | 110MB (4×) |
| CPU Latency | 35ms | 28ms (1.25×) | 12ms (2.9×) |
| Accuracy | 93.5% | 93.2% (0.3% loss) | 92.8% (0.7% loss) |
**Insight:** Static quantization gives 2.9× speedup but dynamic sufficient if speedup not critical.
### LLM Inference (LLaMA-7B)
| Metric | FP16 Baseline | GPTQ INT4 | AWQ INT4 |
|--------|---------------|-----------|----------|
| Size | 13GB | 3.5GB (3.7×) | 3.5GB (3.7×) |
| First Token Latency | 800ms | 250ms (3.2×) | 230ms (3.5×) |
| Perplexity | 5.68 | 5.82 (0.14 increase) | 5.77 (0.09 increase) |
**Insight:** AWQ gives better quality than GPTQ with similar speedup.
## Part 10: Common Pitfalls and Solutions
### Pitfall 1: Skipping Accuracy Validation
**Issue:** Deploy quantized model without measuring accuracy impact.
**Risk:** Discover accuracy degradation in production (too late).
**Solution:** Always validate accuracy on representative data before deployment.
```python
# ❌ WRONG: Deploy without validation
quantized_model = quantize(model)
deploy(quantized_model) # Hope it works!
# ✅ RIGHT: Validate before deployment
quantized_model = quantize(model)
results = validate_accuracy(original_model, quantized_model, val_loader)
if results['acceptable']:
deploy(quantized_model)
else:
print("Accuracy loss too large - try QAT")
```
### Pitfall 2: Using Wrong Calibration Data
**Issue:** Calibrate with random/unrepresentative data.
**Risk:** Activation ranges wrong → accuracy collapses.
**Solution:** Use 100-1000 samples from validation set matching deployment distribution.
```python
# ❌ WRONG: Random images from internet
calibration_data = download_random_images()
# ✅ RIGHT: Samples from validation set
calibration_data = torch.utils.data.Subset(val_dataset, range(1000))
```
### Pitfall 3: Choosing Wrong Quantization Type
**Issue:** Use dynamic quantization when need static speedup.
**Risk:** Get 1.2× speedup instead of 3× speedup.
**Solution:** Match quantization type to requirements (dynamic for size, static for speed).
```python
# ❌ WRONG: Use dynamic when need speed
if need_fast_cpu_inference:
quantized_model = torch.quantization.quantize_dynamic(model) # Only 1.2× speedup
# ✅ RIGHT: Use static for speed
if need_fast_cpu_inference:
model = prepare_and_calibrate(model, calibration_data)
quantized_model = torch.quantization.convert(model) # 2-4× speedup
```
### Pitfall 4: Quantizing GPU-Only Deployments
**Issue:** Quantize model for GPU inference without memory pressure.
**Risk:** Effort not worth modest 1.5-2× GPU speedup.
**Solution:** Only quantize GPU if memory-constrained (multiple models in VRAM).
```python
# ❌ WRONG: Quantize for GPU with no memory issue
if deployment_target == "GPU" and have_plenty_of_memory:
quantized_model = quantize(model) # Wasted effort
# ✅ RIGHT: Skip quantization if not needed
if deployment_target == "GPU" and have_plenty_of_memory:
deploy(model) # Keep FP32, focus on other optimizations
```
### Pitfall 5: Over-Quantizing (INT4 When INT8 Sufficient)
**Issue:** Use aggressive INT4 quantization when INT8 would suffice.
**Risk:** Larger accuracy loss than necessary.
**Solution:** Start with INT8 (standard), only use INT4 if extreme memory constraints.
```python
# ❌ WRONG: Jump to INT4 without trying INT8
quantized_model = quantize(model, precision="INT4") # 2-3% accuracy loss
# ✅ RIGHT: Start with INT8, only use INT4 if needed
quantized_model_int8 = quantize(model, precision="INT8") # 0.5-1% accuracy loss
if model_still_too_large:
quantized_model_int4 = quantize(model, precision="INT4")
```
### Pitfall 6: Assuming All Layers Quantize Equally
**Issue:** Quantize all layers uniformly, but some layers more sensitive.
**Risk:** Accuracy loss dominated by few sensitive layers.
**Solution:** Use mixed precision - keep sensitive layers in FP32/INT8, quantize others to INT4.
```python
# ✅ ADVANCED: Mixed precision quantization
# Keep first/last layers in higher precision, quantize middle layers aggressively
from torch.quantization import QConfigMapping
qconfig_mapping = QConfigMapping()
qconfig_mapping.set_global(get_default_qconfig('fbgemm')) # INT8 default
qconfig_mapping.set_module_name('model.layer1', None) # Keep first layer FP32
qconfig_mapping.set_module_name('model.layer10', None) # Keep last layer FP32
model = quantize_with_qconfig(model, qconfig_mapping)
```
## Part 11: Decision Framework Summary
### Step 1: Recognize Quantization Need
**Symptoms:**
- Model too slow on CPU (>10ms when need <5ms)
- Model too large for edge devices (>50MB)
- Deploying to CPU/edge (not GPU)
- Need to reduce hosting costs
**If YES → Proceed to Step 2**
**If NO → Don't quantize, focus on other optimizations**
### Step 2: Choose Quantization Type
```
Start with Dynamic:
├─ Sufficient? (meets latency/size requirements)
│ ├─ YES → Deploy dynamic quantized model
│ └─ NO → Proceed to Static
Static Quantization:
├─ Sufficient? (meets latency/size + accuracy acceptable)
│ ├─ YES → Deploy static quantized model
│ └─ NO → Accuracy loss >2%
│ │
│ └─ Proceed to QAT
QAT:
├─ Train with quantization awareness
└─ Achieves <1% accuracy loss → Deploy
```
### Step 3: Calibrate (if Static/QAT)
**Calibration data:**
- Source: Validation set (representative samples)
- Size: 100-1000 samples
- Characteristics: Match deployment distribution
**Calibration process:**
1. Select samples from validation set
2. Run through model to collect activation ranges
3. Validate accuracy after calibration
4. If accuracy loss >2%, try different calibration data or QAT
### Step 4: Validate Accuracy
**Required measurements:**
- Baseline accuracy (FP32)
- Quantized accuracy (INT8/INT4)
- Accuracy loss (baseline - quantized)
- Acceptable threshold (typically <2%)
**Decision:**
- If accuracy loss <2% → Deploy
- If accuracy loss >2% → Try QAT or reconsider quantization
### Step 5: Benchmark Performance
**Required measurements:**
- Model size (MB): baseline vs quantized
- Inference latency (ms): baseline vs quantized
- Throughput (requests/sec): baseline vs quantized
**Verify expected results:**
- Size: 4× reduction (FP32 → INT8)
- CPU speedup: 2-4× (static quantization)
- GPU speedup: 1.5-2× (if applicable)
## Part 12: Production Deployment Checklist
Before deploying quantized model to production:
**✅ Accuracy Validated**
- [ ] Baseline accuracy measured on validation set
- [ ] Quantized accuracy measured on same validation set
- [ ] Accuracy loss within acceptable threshold (<2%)
- [ ] Validated on representative production data
**✅ Performance Benchmarked**
- [ ] Size reduction measured (expect 4× for INT8)
- [ ] Latency improvement measured (expect 2-4× CPU)
- [ ] Throughput improvement measured
- [ ] Performance meets requirements
**✅ Calibration Verified** (if static/QAT)
- [ ] Used representative samples from validation set (not random data)
- [ ] Used sufficient calibration data (100-1000 samples)
- [ ] Calibration data matches deployment distribution
**✅ Edge Cases Tested**
- [ ] Tested on diverse inputs (bright/dark images, long/short text)
- [ ] Validated numerical stability (no NaN/Inf outputs)
- [ ] Tested inference on target hardware (CPU/GPU/edge device)
**✅ Rollback Plan**
- [ ] Can easily revert to FP32 model if issues found
- [ ] Monitoring in place to detect accuracy degradation
- [ ] A/B testing plan to compare FP32 vs quantized
## Skill Mastery Checklist
You have mastered quantization for inference when you can:
- [ ] Recognize when quantization is appropriate (CPU/edge deployment, size/speed issues)
- [ ] Choose correct quantization type (dynamic vs static vs QAT) based on requirements
- [ ] Implement dynamic quantization in PyTorch (5 lines of code)
- [ ] Implement static quantization with proper calibration (20 lines of code)
- [ ] Select appropriate calibration data (validation set, 100-1000 samples)
- [ ] Validate accuracy trade-offs systematically (baseline vs quantized)
- [ ] Benchmark performance improvements (size, latency, throughput)
- [ ] Decide when NOT to quantize (GPU-only, already fast, accuracy-critical)
- [ ] Debug quantization issues (accuracy collapse, wrong speedup, numerical instability)
- [ ] Deploy quantized models to production with confidence
**Key insight:** Quantization is not magic - it's a systematic trade-off of precision for performance. The skill is matching the right quantization approach to your specific requirements.