Files
gh-tachyon-beep-skillpacks-…/skills/using-ml-production/quantization-for-inference.md
2025-11-30 08:59:57 +08:00

34 KiB
Raw Blame History

Quantization for Inference Skill

When to Use This Skill

Use this skill when you observe these symptoms:

Performance Symptoms:

  • Model inference too slow on CPU (e.g., >10ms when need <5ms)
  • Batch processing taking too long (low throughput)
  • Need to serve more requests per second with same hardware

Size Symptoms:

  • Model too large for edge devices (e.g., 100MB+ for mobile)
  • Want to fit more models in GPU memory
  • Memory-constrained deployment environment

Deployment Symptoms:

  • Deploying to CPU servers (quantization gives 2-4× CPU speedup)
  • Deploying to edge devices (mobile, IoT, embedded systems)
  • Cost-sensitive deployment (smaller models = lower hosting costs)

When NOT to use this skill:

  • Model already fast enough and small enough (no problem to solve)
  • Deploying exclusively on GPU with no memory constraints (modest benefit)
  • Prototyping phase where optimization is premature
  • Model so small that quantization overhead not worth it (e.g., <5MB)

Core Principle

Quantization trades precision for performance.

Quantization converts high-precision numbers (FP32: 32 bits) to low-precision integers (INT8: 8 bits or INT4: 4 bits). This provides:

  • 4-8× smaller model size (fewer bits per parameter)
  • 2-4× faster inference on CPU (INT8 operations faster than FP32)
  • Small accuracy loss (typically 0.5-1% for INT8)

Formula: Lower precision (FP32 → INT8 → INT4) = Smaller size + Faster inference + More accuracy loss

The skill is choosing the right precision for your accuracy tolerance.

Quantization Framework

┌────────────────────────────────────────────┐
│   1. Recognize Quantization Need           │
│   CPU/Edge + (Slow OR Large)               │
└──────────────┬─────────────────────────────┘
               │
               ▼
┌────────────────────────────────────────────┐
│   2. Choose Quantization Type              │
│   Dynamic → Static → QAT (increasing cost) │
└──────────────┬─────────────────────────────┘
               │
               ▼
┌────────────────────────────────────────────┐
│   3. Calibrate (if Static/QAT)             │
│   100-1000 representative samples          │
└──────────────┬─────────────────────────────┘
               │
               ▼
┌────────────────────────────────────────────┐
│   4. Validate Accuracy Trade-offs          │
│   Baseline vs Quantized accuracy           │
└──────────────┬─────────────────────────────┘
               │
               ▼
┌────────────────────────────────────────────┐
│   5. Decide: Accept or Iterate             │
│   <2% loss → Deploy                        │
│   >2% loss → Try QAT or different precision│
└────────────────────────────────────────────┘

Part 1: Quantization Types

Type 1: Dynamic Quantization

What it does: Quantizes weights to INT8, keeps activations in FP32.

When to use:

  • Simplest quantization (no calibration needed)
  • Primary goal is size reduction
  • Batch processing where latency less critical
  • Quick experiment to see if quantization helps

Benefits:

  • 4× size reduction (weights are 75% of model size)
  • 1.2-1.5× CPU speedup (modest, because activations still FP32)
  • Minimal accuracy loss (~0.2-0.5%)
  • No calibration data needed

Limitations:

  • ⚠️ Limited CPU speedup (activations still FP32)
  • ⚠️ Not optimal for edge devices needing maximum performance

PyTorch implementation:

import torch
import torch.quantization

# WHY: Dynamic quantization is simplest - just one function call
# No calibration data needed because activations stay FP32
model = torch.load('model.pth')
model.eval()  # WHY: Must be in eval mode (no batchnorm updates)

# WHY: Specify which layers to quantize (Linear, LSTM, etc.)
# These layers benefit most from quantization
quantized_model = torch.quantization.quantize_dynamic(
    model,
    qconfig_spec={torch.nn.Linear},  # WHY: Quantize Linear layers only
    dtype=torch.qint8  # WHY: INT8 is standard precision
)

# Save quantized model
torch.save(quantized_model.state_dict(), 'model_quantized_dynamic.pth')

# Verify size reduction
original_size = os.path.getsize('model.pth') / (1024 ** 2)  # MB
quantized_size = os.path.getsize('model_quantized_dynamic.pth') / (1024 ** 2)
print(f"Original: {original_size:.1f}MB → Quantized: {quantized_size:.1f}MB")
print(f"Size reduction: {original_size / quantized_size:.1f}×")

Example use case: BERT classification model where primary goal is reducing size from 440MB to 110MB for easier deployment.

Type 2: Static Quantization (Post-Training Quantization)

What it does: Quantizes both weights and activations to INT8.

When to use:

  • Need maximum CPU speedup (2-4×)
  • Deploying to CPU servers or edge devices
  • Can afford calibration step (5-10 minutes)
  • Primary goal is inference speed

Benefits:

  • 4× size reduction (same as dynamic)
  • 2-4× CPU speedup (both weights and activations INT8)
  • No retraining required (post-training)
  • Acceptable accuracy loss (~0.5-1%)

Requirements:

  • ⚠️ Needs calibration data (100-1000 samples from validation set)
  • ⚠️ Slightly more complex setup than dynamic

PyTorch implementation:

import torch
import torch.quantization

def calibrate_model(model, calibration_loader):
    """
    Calibrate model by running representative data through it.

    WHY: Static quantization needs to know activation ranges.
    Calibration finds min/max values for each activation layer.

    Args:
        model: Model in eval mode with quantization stubs
        calibration_loader: DataLoader with 100-1000 samples
    """
    model.eval()
    with torch.no_grad():
        for batch_idx, (data, _) in enumerate(calibration_loader):
            model(data)
            if batch_idx >= 100:  # WHY: 100 batches usually sufficient
                break
    return model

# Step 1: Prepare model for quantization
model = torch.load('model.pth')
model.eval()

# WHY: Insert quantization/dequantization stubs at boundaries
# This tells PyTorch where to convert between FP32 and INT8
model.qconfig = torch.quantization.get_default_qconfig('fbgemm')
torch.quantization.prepare(model, inplace=True)

# Step 2: Calibrate with representative data
# WHY: Must use data from training/validation set, not random data
# Calibration finds activation ranges - needs real distribution
calibration_dataset = torch.utils.data.Subset(
    val_dataset,
    indices=range(1000)  # WHY: 1000 samples sufficient for most models
)
calibration_loader = torch.utils.data.DataLoader(
    calibration_dataset,
    batch_size=32,
    shuffle=False  # WHY: Order doesn't matter for calibration
)

model = calibrate_model(model, calibration_loader)

# Step 3: Convert to quantized model
torch.quantization.convert(model, inplace=True)

# Save quantized model
torch.save(model.state_dict(), 'model_quantized_static.pth')

# Benchmark speed improvement
import time

def benchmark(model, data, num_iterations=100):
    """WHY: Warm up model first, then measure average latency."""
    model.eval()
    # Warm up (first few iterations slower)
    for _ in range(10):
        model(data)

    start = time.time()
    with torch.no_grad():
        for _ in range(num_iterations):
            model(data)
    end = time.time()
    return (end - start) / num_iterations * 1000  # ms per inference

test_data = torch.randn(1, 3, 224, 224)  # Example input

baseline_latency = benchmark(original_model, test_data)
quantized_latency = benchmark(model, test_data)

print(f"Baseline: {baseline_latency:.2f}ms")
print(f"Quantized: {quantized_latency:.2f}ms")
print(f"Speedup: {baseline_latency / quantized_latency:.2f}×")

Example use case: ResNet50 image classifier for CPU inference - need <5ms latency, achieve 4ms with static quantization (vs 15ms baseline).

Type 3: Quantization-Aware Training (QAT)

What it does: Simulates quantization during training to minimize accuracy loss.

When to use:

  • Static quantization accuracy loss too large (>2%)
  • Need best possible accuracy with INT8
  • Can afford retraining (hours to days)
  • Critical production system with strict accuracy requirements

Benefits:

  • Best accuracy (~0.1-0.3% loss vs 0.5-1% for static)
  • 4× size reduction (same as dynamic/static)
  • 2-4× CPU speedup (same as static)

Limitations:

  • ⚠️ Requires retraining (most expensive option)
  • ⚠️ Takes hours to days depending on model size
  • ⚠️ More complex implementation

PyTorch implementation:

import torch
import torch.quantization

def train_one_epoch_qat(model, train_loader, optimizer, criterion):
    """
    Train one epoch with quantization-aware training.

    WHY: QAT inserts fake quantization ops during training.
    Model learns to be robust to quantization errors.
    """
    model.train()
    for data, target in train_loader:
        optimizer.zero_grad()
        output = model(data)
        loss = criterion(output, target)
        loss.backward()
        optimizer.step()
    return model

# Step 1: Prepare model for QAT
model = torch.load('model.pth')
model.train()

# WHY: QAT config includes fake quantization ops
# These simulate quantization during forward pass
model.qconfig = torch.quantization.get_default_qat_qconfig('fbgemm')
torch.quantization.prepare_qat(model, inplace=True)

# Step 2: Train with quantization-aware training
# WHY: Model learns to compensate for quantization errors
optimizer = torch.optim.SGD(model.parameters(), lr=0.0001)  # WHY: Low LR for fine-tuning
criterion = torch.nn.CrossEntropyLoss()

num_epochs = 5  # WHY: Usually 5-10 epochs sufficient for QAT fine-tuning
for epoch in range(num_epochs):
    model = train_one_epoch_qat(model, train_loader, optimizer, criterion)
    print(f"Epoch {epoch+1}/{num_epochs} complete")

# Step 3: Convert to quantized model
model.eval()
torch.quantization.convert(model, inplace=True)

# Save QAT quantized model
torch.save(model.state_dict(), 'model_quantized_qat.pth')

Example use case: Medical imaging model where accuracy is critical - static quantization gives 2% accuracy loss, QAT reduces to 0.3%.

Part 2: Quantization Type Decision Matrix

Type Complexity Calibration Retraining Size Reduction CPU Speedup Accuracy Loss
Dynamic Low No No 4× 1.2-1.5× ~0.2-0.5%
Static Medium Yes No 4× 2-4× ~0.5-1%
QAT High Yes Yes 4× 2-4× ~0.1-0.3%

Decision flow:

  1. Start with dynamic quantization: Simplest, verify quantization helps
  2. Upgrade to static quantization: If need more speedup, can afford calibration
  3. Use QAT: Only if accuracy loss from static too large (rare)

Why this order? Incremental cost. Dynamic is free (5 minutes), static is cheap (15 minutes), QAT is expensive (hours/days). Don't pay for QAT unless you need it.

Part 3: Calibration Best Practices

What is Calibration?

Purpose: Find min/max ranges for each activation layer.

Why needed: Static quantization needs to know activation ranges to map FP32 → INT8. Without calibration, ranges are wrong → accuracy collapses.

How it works:

  1. Run representative data through model
  2. Record min/max activation values per layer
  3. Use these ranges to quantize activations at inference time

Calibration Data Requirements

Data source:

  • Use validation set samples (matches training distribution)
  • Don't use random images from internet (different distribution)
  • Don't use single image repeated (insufficient coverage)
  • Don't use training set that doesn't match deployment (distribution shift)

Data size:

  • Minimum: 100 samples (sufficient for simple models)
  • Recommended: 500-1000 samples (better coverage)
  • Maximum: Full validation set is overkill (slow, no benefit)

Data characteristics:

  • Must cover range of inputs model sees in production
  • Include edge cases (bright/dark images, long/short text)
  • Distribution should match deployment, not just training
  • Class balance less important than input diversity

Example calibration data selection:

import torch
import numpy as np

def select_calibration_data(val_dataset, num_samples=1000):
    """
    Select diverse calibration samples from validation set.

    WHY: Want samples that cover range of activation values.
    Random selection from validation set usually sufficient.

    Args:
        val_dataset: Full validation dataset
        num_samples: Number of calibration samples (default 1000)

    Returns:
        Calibration dataset subset
    """
    # WHY: Random selection ensures diversity
    # Stratified sampling can help ensure class coverage
    indices = np.random.choice(len(val_dataset), num_samples, replace=False)
    calibration_dataset = torch.utils.data.Subset(val_dataset, indices)

    return calibration_dataset

# Example: Select 1000 random samples from validation set
calibration_dataset = select_calibration_data(val_dataset, num_samples=1000)
calibration_loader = torch.utils.data.DataLoader(
    calibration_dataset,
    batch_size=32,
    shuffle=False  # WHY: Order doesn't matter for calibration
)

Common Calibration Pitfalls

Pitfall 1: Using wrong data distribution

  • "Random images from internet" for ImageNet-trained model
  • Use ImageNet validation set samples

Pitfall 2: Too few samples

  • 10 samples (insufficient coverage of activation ranges)
  • 100-1000 samples (good coverage)

Pitfall 3: Using training data that doesn't match deployment

  • Calibrate on sunny outdoor images, deploy on indoor images
  • Calibrate on data matching deployment distribution

Pitfall 4: Skipping calibration validation

  • Calibrate once, assume it works
  • Validate accuracy after calibration to verify ranges are good

Part 4: Precision Selection (INT8 vs INT4 vs FP16)

Precision Spectrum

Precision Bits Size vs FP32 Speedup (CPU) Typical Accuracy Loss
FP32 32 1× 1× 0% (baseline)
FP16 16 2× 1.5× <0.1%
INT8 8 4× 2-4× 0.5-1%
INT4 4 8× 4-8× 1-3%

Trade-off: Lower precision = Smaller size + Faster inference + More accuracy loss

When to Use Each Precision

FP16 (Half Precision):

  • GPU inference (Tensor Cores optimized for FP16)
  • Need minimal accuracy loss (<0.1%)
  • Size reduction secondary concern
  • Example: Large language models on GPU

INT8 (Standard Quantization):

  • CPU inference (INT8 operations fast on CPU)
  • Edge device deployment
  • Good balance of size/speed/accuracy
  • Most common choice for production deployment
  • Example: Image classification on mobile devices

INT4 (Aggressive Quantization):

  • Extremely memory-constrained (e.g., 1GB mobile devices)
  • Can tolerate larger accuracy loss (1-3%)
  • Need maximum size reduction (8×)
  • Use sparingly - accuracy risk high
  • Example: Large language models (LLaMA-7B: 13GB → 3.5GB)

Decision Flow

def choose_precision(accuracy_tolerance, deployment_target):
    """
    Choose quantization precision based on requirements.

    WHY: Different precisions for different constraints.
    INT8 is default, FP16 for GPU, INT4 for extreme memory constraints.
    """
    if accuracy_tolerance < 0.1:
        return "FP16"  # Minimal accuracy loss required
    elif deployment_target == "GPU":
        return "FP16"  # GPU optimized for FP16
    elif deployment_target in ["CPU", "edge"]:
        return "INT8"  # CPU optimized for INT8
    elif deployment_target == "extreme_edge" and accuracy_tolerance > 1:
        return "INT4"  # Only if can tolerate 1-3% loss
    else:
        return "INT8"  # Default safe choice

Part 5: ONNX Quantization (Cross-Framework)

When to use: Deploying to ONNX Runtime (CPU/edge devices) or need cross-framework compatibility.

ONNX Static Quantization

import onnxruntime
from onnxruntime.quantization import quantize_static, CalibrationDataReader
import numpy as np

class CalibrationDataReaderWrapper(CalibrationDataReader):
    """
    WHY: ONNX requires custom calibration data reader.
    This class feeds calibration data to ONNX quantization engine.
    """
    def __init__(self, calibration_data):
        self.calibration_data = calibration_data
        self.iterator = iter(calibration_data)

    def get_next(self):
        """WHY: Called by ONNX to get next calibration batch."""
        try:
            data, _ = next(self.iterator)
            return {"input": data.numpy()}  # WHY: Return dict of input name → data
        except StopIteration:
            return None

# Step 1: Export PyTorch model to ONNX
model = torch.load('model.pth')
model.eval()
dummy_input = torch.randn(1, 3, 224, 224)

torch.onnx.export(
    model,
    dummy_input,
    'model.onnx',
    input_names=['input'],
    output_names=['output'],
    opset_version=13  # WHY: ONNX opset 13+ supports quantization ops
)

# Step 2: Prepare calibration data
calibration_loader = torch.utils.data.DataLoader(
    calibration_dataset,
    batch_size=1,  # WHY: ONNX calibration uses batch size 1
    shuffle=False
)
calibration_reader = CalibrationDataReaderWrapper(calibration_loader)

# Step 3: Quantize ONNX model
quantize_static(
    'model.onnx',
    'model_quantized.onnx',
    calibration_data_reader=calibration_reader,
    quant_format='QDQ'  # WHY: QDQ format compatible with most backends
)

# Step 4: Benchmark ONNX quantized model
import time

session = onnxruntime.InferenceSession('model_quantized.onnx')
input_name = session.get_inputs()[0].name

test_data = np.random.randn(1, 3, 224, 224).astype(np.float32)

# Warm up
for _ in range(10):
    session.run(None, {input_name: test_data})

# Benchmark
start = time.time()
for _ in range(100):
    session.run(None, {input_name: test_data})
end = time.time()

latency = (end - start) / 100 * 1000  # ms per inference
print(f"ONNX Quantized latency: {latency:.2f}ms")

ONNX advantages:

  • Cross-framework (works with PyTorch, TensorFlow, etc.)
  • Optimized ONNX Runtime for CPU inference
  • Good hardware backend support (x86, ARM)

Part 6: Accuracy Validation (Critical Step)

Why Accuracy Validation Matters

Quantization is lossy compression. Must measure accuracy impact:

  • Some models tolerate quantization well (<0.5% loss)
  • Some models sensitive to quantization (>2% loss)
  • Some layers more sensitive than others
  • Can't assume quantization is safe without measuring

Validation Methodology

def validate_quantization(original_model, quantized_model, val_loader):
    """
    Validate quantization by comparing accuracy.

    WHY: Quantization is lossy - must measure impact.
    Compare baseline vs quantized on same validation set.

    Returns:
        dict with baseline_acc, quantized_acc, accuracy_loss
    """
    def evaluate(model, data_loader):
        model.eval()
        correct = 0
        total = 0
        with torch.no_grad():
            for data, target in data_loader:
                output = model(data)
                pred = output.argmax(dim=1)
                correct += (pred == target).sum().item()
                total += target.size(0)
        return 100.0 * correct / total

    baseline_acc = evaluate(original_model, val_loader)
    quantized_acc = evaluate(quantized_model, val_loader)
    accuracy_loss = baseline_acc - quantized_acc

    return {
        'baseline_acc': baseline_acc,
        'quantized_acc': quantized_acc,
        'accuracy_loss': accuracy_loss,
        'acceptable': accuracy_loss < 2.0  # WHY: <2% loss usually acceptable
    }

# Example validation
results = validate_quantization(original_model, quantized_model, val_loader)
print(f"Baseline accuracy: {results['baseline_acc']:.2f}%")
print(f"Quantized accuracy: {results['quantized_acc']:.2f}%")
print(f"Accuracy loss: {results['accuracy_loss']:.2f}%")
print(f"Acceptable: {results['acceptable']}")

# Decision logic
if results['acceptable']:
    print("✅ Quantization acceptable - deploy quantized model")
else:
    print("❌ Accuracy loss too large - try QAT or reconsider quantization")

Acceptable Accuracy Thresholds

General guidelines:

  • <1% loss: Excellent quantization result
  • 1-2% loss: Acceptable for most applications
  • 2-3% loss: Consider QAT to reduce loss
  • >3% loss: Quantization may not be suitable for this model

Task-specific thresholds:

  • Image classification: 1-2% top-1 accuracy loss acceptable
  • Object detection: 1-2% mAP loss acceptable
  • NLP classification: 0.5-1% accuracy loss acceptable
  • Medical/safety-critical: <0.5% loss required (use QAT)

Part 7: LLM Quantization (GPTQ, AWQ)

Note: This skill covers general quantization. For LLM-specific optimization (GPTQ, AWQ, KV cache, etc.), see the llm-inference-optimization skill in the llm-specialist pack.

LLM Quantization Overview

Why LLMs need quantization:

  • Very large (7B parameters = 13GB in FP16)
  • Memory-bound inference (limited by VRAM)
  • INT4 quantization: 13GB → 3.5GB (fits in consumer GPUs)

LLM-specific quantization methods:

  • GPTQ: Post-training quantization optimized for LLMs
  • AWQ: Activation-aware weight quantization (better quality than GPTQ)
  • Both: Achieve INT4 with <0.5 perplexity increase

When to Use LLM Quantization

Use when:

  • Deploying LLMs locally (consumer GPUs)
  • Memory-constrained (need to fit in 12GB/24GB VRAM)
  • Cost-sensitive (smaller models cheaper to host)
  • Latency-sensitive (smaller models faster to load)

Don't use when:

  • Have sufficient GPU memory for FP16
  • Accuracy critical (medical, legal applications)
  • Already using API (OpenAI, Anthropic) - they handle optimization

LLM Quantization References

For detailed LLM quantization:

  • See skill: llm-inference-optimization (llm-specialist pack)
  • Covers: GPTQ, AWQ, KV cache optimization, token streaming
  • Tools: llama.cpp, vLLM, text-generation-inference

Quick reference (defer to llm-specialist for details):

# GPTQ quantization (example - see llm-specialist for full details)
from transformers import AutoModelForCausalLM, GPTQConfig

# WHY: GPTQ optimizes layer-wise for minimal perplexity increase
quantization_config = GPTQConfig(bits=4, dataset="c4", tokenizer=tokenizer)

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-hf",
    quantization_config=quantization_config,
    device_map="auto"
)

# Result: 13GB → 3.5GB, <0.5 perplexity increase

Part 8: When NOT to Quantize

Scenario 1: Already Fast Enough

Example: MobileNetV2 (14MB, 3ms CPU latency)

  • Quantization: 14MB → 4MB, 3ms → 2ms
  • Benefit: 10MB saved, 1ms faster
  • Cost: Calibration, validation, testing, debugging
  • Decision: Not worth effort unless specific requirement

Rule: If current performance meets requirements, don't optimize.

Scenario 2: GPU-Only Deployment with No Memory Constraints

Example: ResNet50 on Tesla V100 with 32GB VRAM

  • Quantization: 1.5-2× GPU speedup (modest)
  • FP32 already fast on GPU (Tensor Cores optimized)
  • No memory pressure (plenty of VRAM)
  • Decision: Focus on other bottlenecks (data loading, I/O)

Rule: Quantization is most beneficial for CPU inference and memory-constrained GPU.

Scenario 3: Accuracy-Critical Applications

Example: Medical diagnosis model where misdiagnosis has severe consequences

  • Quantization introduces accuracy loss (even if small)
  • Risk not worth benefit
  • Decision: Keep FP32, optimize other parts (batching, caching)

Rule: Safety-critical systems should avoid lossy compression unless thoroughly validated.

Scenario 4: Prototyping Phase

Example: Early development, trying different architectures

  • Quantization is optimization - premature at prototype stage
  • Focus on getting model working first
  • Decision: Defer quantization until production deployment

Rule: Don't optimize until you need to (Knuth: "Premature optimization is root of all evil").

Part 9: Quantization Benchmarks (Expected Results)

Image Classification (ResNet50, ImageNet)

Metric FP32 Baseline Dynamic INT8 Static INT8 QAT INT8
Size 98MB 25MB (4×) 25MB (4×) 25MB (4×)
CPU Latency 15ms 12ms (1.25×) 4ms (3.75×) 4ms (3.75×)
Top-1 Accuracy 76.1% 75.9% (0.2% loss) 75.3% (0.8% loss) 75.9% (0.2% loss)

Insight: Static quantization gives 3.75× speedup with acceptable 0.8% accuracy loss.

Object Detection (YOLOv5s, COCO)

Metric FP32 Baseline Static INT8 QAT INT8
Size 14MB 4MB (3.5×) 4MB (3.5×)
CPU Latency 45ms 15ms (3×) 15ms (3×)
mAP@0.5 37.4% 36.8% (0.6% loss) 37.2% (0.2% loss)

Insight: QAT gives better accuracy (0.2% vs 0.6% loss) with same speedup.

NLP Classification (BERT-base, GLUE)

Metric FP32 Baseline Dynamic INT8 Static INT8
Size 440MB 110MB (4×) 110MB (4×)
CPU Latency 35ms 28ms (1.25×) 12ms (2.9×)
Accuracy 93.5% 93.2% (0.3% loss) 92.8% (0.7% loss)

Insight: Static quantization gives 2.9× speedup but dynamic sufficient if speedup not critical.

LLM Inference (LLaMA-7B)

Metric FP16 Baseline GPTQ INT4 AWQ INT4
Size 13GB 3.5GB (3.7×) 3.5GB (3.7×)
First Token Latency 800ms 250ms (3.2×) 230ms (3.5×)
Perplexity 5.68 5.82 (0.14 increase) 5.77 (0.09 increase)

Insight: AWQ gives better quality than GPTQ with similar speedup.

Part 10: Common Pitfalls and Solutions

Pitfall 1: Skipping Accuracy Validation

Issue: Deploy quantized model without measuring accuracy impact. Risk: Discover accuracy degradation in production (too late). Solution: Always validate accuracy on representative data before deployment.

# ❌ WRONG: Deploy without validation
quantized_model = quantize(model)
deploy(quantized_model)  # Hope it works!

# ✅ RIGHT: Validate before deployment
quantized_model = quantize(model)
results = validate_accuracy(original_model, quantized_model, val_loader)
if results['acceptable']:
    deploy(quantized_model)
else:
    print("Accuracy loss too large - try QAT")

Pitfall 2: Using Wrong Calibration Data

Issue: Calibrate with random/unrepresentative data. Risk: Activation ranges wrong → accuracy collapses. Solution: Use 100-1000 samples from validation set matching deployment distribution.

# ❌ WRONG: Random images from internet
calibration_data = download_random_images()

# ✅ RIGHT: Samples from validation set
calibration_data = torch.utils.data.Subset(val_dataset, range(1000))

Pitfall 3: Choosing Wrong Quantization Type

Issue: Use dynamic quantization when need static speedup. Risk: Get 1.2× speedup instead of 3× speedup. Solution: Match quantization type to requirements (dynamic for size, static for speed).

# ❌ WRONG: Use dynamic when need speed
if need_fast_cpu_inference:
    quantized_model = torch.quantization.quantize_dynamic(model)  # Only 1.2× speedup

# ✅ RIGHT: Use static for speed
if need_fast_cpu_inference:
    model = prepare_and_calibrate(model, calibration_data)
    quantized_model = torch.quantization.convert(model)  # 2-4× speedup

Pitfall 4: Quantizing GPU-Only Deployments

Issue: Quantize model for GPU inference without memory pressure. Risk: Effort not worth modest 1.5-2× GPU speedup. Solution: Only quantize GPU if memory-constrained (multiple models in VRAM).

# ❌ WRONG: Quantize for GPU with no memory issue
if deployment_target == "GPU" and have_plenty_of_memory:
    quantized_model = quantize(model)  # Wasted effort

# ✅ RIGHT: Skip quantization if not needed
if deployment_target == "GPU" and have_plenty_of_memory:
    deploy(model)  # Keep FP32, focus on other optimizations

Pitfall 5: Over-Quantizing (INT4 When INT8 Sufficient)

Issue: Use aggressive INT4 quantization when INT8 would suffice. Risk: Larger accuracy loss than necessary. Solution: Start with INT8 (standard), only use INT4 if extreme memory constraints.

# ❌ WRONG: Jump to INT4 without trying INT8
quantized_model = quantize(model, precision="INT4")  # 2-3% accuracy loss

# ✅ RIGHT: Start with INT8, only use INT4 if needed
quantized_model_int8 = quantize(model, precision="INT8")  # 0.5-1% accuracy loss
if model_still_too_large:
    quantized_model_int4 = quantize(model, precision="INT4")

Pitfall 6: Assuming All Layers Quantize Equally

Issue: Quantize all layers uniformly, but some layers more sensitive. Risk: Accuracy loss dominated by few sensitive layers. Solution: Use mixed precision - keep sensitive layers in FP32/INT8, quantize others to INT4.

# ✅ ADVANCED: Mixed precision quantization
# Keep first/last layers in higher precision, quantize middle layers aggressively
from torch.quantization import QConfigMapping

qconfig_mapping = QConfigMapping()
qconfig_mapping.set_global(get_default_qconfig('fbgemm'))  # INT8 default
qconfig_mapping.set_module_name('model.layer1', None)  # Keep first layer FP32
qconfig_mapping.set_module_name('model.layer10', None)  # Keep last layer FP32

model = quantize_with_qconfig(model, qconfig_mapping)

Part 11: Decision Framework Summary

Step 1: Recognize Quantization Need

Symptoms:

  • Model too slow on CPU (>10ms when need <5ms)
  • Model too large for edge devices (>50MB)
  • Deploying to CPU/edge (not GPU)
  • Need to reduce hosting costs

If YES → Proceed to Step 2 If NO → Don't quantize, focus on other optimizations

Step 2: Choose Quantization Type

Start with Dynamic:
├─ Sufficient? (meets latency/size requirements)
│  ├─ YES → Deploy dynamic quantized model
│  └─ NO → Proceed to Static
│
Static Quantization:
├─ Sufficient? (meets latency/size + accuracy acceptable)
│  ├─ YES → Deploy static quantized model
│  └─ NO → Accuracy loss >2%
│     │
│     └─ Proceed to QAT
│
QAT:
├─ Train with quantization awareness
└─ Achieves <1% accuracy loss → Deploy

Step 3: Calibrate (if Static/QAT)

Calibration data:

  • Source: Validation set (representative samples)
  • Size: 100-1000 samples
  • Characteristics: Match deployment distribution

Calibration process:

  1. Select samples from validation set
  2. Run through model to collect activation ranges
  3. Validate accuracy after calibration
  4. If accuracy loss >2%, try different calibration data or QAT

Step 4: Validate Accuracy

Required measurements:

  • Baseline accuracy (FP32)
  • Quantized accuracy (INT8/INT4)
  • Accuracy loss (baseline - quantized)
  • Acceptable threshold (typically <2%)

Decision:

  • If accuracy loss <2% → Deploy
  • If accuracy loss >2% → Try QAT or reconsider quantization

Step 5: Benchmark Performance

Required measurements:

  • Model size (MB): baseline vs quantized
  • Inference latency (ms): baseline vs quantized
  • Throughput (requests/sec): baseline vs quantized

Verify expected results:

  • Size: 4× reduction (FP32 → INT8)
  • CPU speedup: 2-4× (static quantization)
  • GPU speedup: 1.5-2× (if applicable)

Part 12: Production Deployment Checklist

Before deploying quantized model to production:

Accuracy Validated

  • Baseline accuracy measured on validation set
  • Quantized accuracy measured on same validation set
  • Accuracy loss within acceptable threshold (<2%)
  • Validated on representative production data

Performance Benchmarked

  • Size reduction measured (expect 4× for INT8)
  • Latency improvement measured (expect 2-4× CPU)
  • Throughput improvement measured
  • Performance meets requirements

Calibration Verified (if static/QAT)

  • Used representative samples from validation set (not random data)
  • Used sufficient calibration data (100-1000 samples)
  • Calibration data matches deployment distribution

Edge Cases Tested

  • Tested on diverse inputs (bright/dark images, long/short text)
  • Validated numerical stability (no NaN/Inf outputs)
  • Tested inference on target hardware (CPU/GPU/edge device)

Rollback Plan

  • Can easily revert to FP32 model if issues found
  • Monitoring in place to detect accuracy degradation
  • A/B testing plan to compare FP32 vs quantized

Skill Mastery Checklist

You have mastered quantization for inference when you can:

  • Recognize when quantization is appropriate (CPU/edge deployment, size/speed issues)
  • Choose correct quantization type (dynamic vs static vs QAT) based on requirements
  • Implement dynamic quantization in PyTorch (5 lines of code)
  • Implement static quantization with proper calibration (20 lines of code)
  • Select appropriate calibration data (validation set, 100-1000 samples)
  • Validate accuracy trade-offs systematically (baseline vs quantized)
  • Benchmark performance improvements (size, latency, throughput)
  • Decide when NOT to quantize (GPU-only, already fast, accuracy-critical)
  • Debug quantization issues (accuracy collapse, wrong speedup, numerical instability)
  • Deploy quantized models to production with confidence

Key insight: Quantization is not magic - it's a systematic trade-off of precision for performance. The skill is matching the right quantization approach to your specific requirements.