zhongwei/gh-tachyon-beep-skillpacks-plugins-yzmir-ml-production

Fork 0

Files

Zhongwei Li d208552c88 Initial commit

2025-11-30 08:59:57 +08:00

34 KiB

Raw Blame History

Quantization for Inference Skill

When to Use This Skill

Use this skill when you observe these symptoms:

Performance Symptoms:

Model inference too slow on CPU (e.g., >10ms when need <5ms)
Batch processing taking too long (low throughput)
Need to serve more requests per second with same hardware

Size Symptoms:

Model too large for edge devices (e.g., 100MB+ for mobile)
Want to fit more models in GPU memory
Memory-constrained deployment environment

Deployment Symptoms:

Deploying to CPU servers (quantization gives 2-4× CPU speedup)
Deploying to edge devices (mobile, IoT, embedded systems)
Cost-sensitive deployment (smaller models = lower hosting costs)

When NOT to use this skill:

Model already fast enough and small enough (no problem to solve)
Deploying exclusively on GPU with no memory constraints (modest benefit)
Prototyping phase where optimization is premature
Model so small that quantization overhead not worth it (e.g., <5MB)

Core Principle

Quantization trades precision for performance.

Quantization converts high-precision numbers (FP32: 32 bits) to low-precision integers (INT8: 8 bits or INT4: 4 bits). This provides:

4-8× smaller model size (fewer bits per parameter)
2-4× faster inference on CPU (INT8 operations faster than FP32)
Small accuracy loss (typically 0.5-1% for INT8)

Formula: Lower precision (FP32 → INT8 → INT4) = Smaller size + Faster inference + More accuracy loss

The skill is choosing the right precision for your accuracy tolerance.

Quantization Framework

┌────────────────────────────────────────────┐
│   1. Recognize Quantization Need           │
│   CPU/Edge + (Slow OR Large)               │
└──────────────┬─────────────────────────────┘
               │
               ▼
┌────────────────────────────────────────────┐
│   2. Choose Quantization Type              │
│   Dynamic → Static → QAT (increasing cost) │
└──────────────┬─────────────────────────────┘
               │
               ▼
┌────────────────────────────────────────────┐
│   3. Calibrate (if Static/QAT)             │
│   100-1000 representative samples          │
└──────────────┬─────────────────────────────┘
               │
               ▼
┌────────────────────────────────────────────┐
│   4. Validate Accuracy Trade-offs          │
│   Baseline vs Quantized accuracy           │
└──────────────┬─────────────────────────────┘
               │
               ▼
┌────────────────────────────────────────────┐
│   5. Decide: Accept or Iterate             │
│   <2% loss → Deploy                        │
│   >2% loss → Try QAT or different precision│
└────────────────────────────────────────────┘

Part 1: Quantization Types

Type 1: Dynamic Quantization

What it does: Quantizes weights to INT8, keeps activations in FP32.

When to use:

Simplest quantization (no calibration needed)
Primary goal is size reduction
Batch processing where latency less critical
Quick experiment to see if quantization helps

Benefits:

✅ 4× size reduction (weights are 75% of model size)
✅ 1.2-1.5× CPU speedup (modest, because activations still FP32)
✅ Minimal accuracy loss (~0.2-0.5%)
✅ No calibration data needed

Limitations:

⚠️ Limited CPU speedup (activations still FP32)
⚠️ Not optimal for edge devices needing maximum performance

PyTorch implementation:

import torch
import torch.quantization

# WHY: Dynamic quantization is simplest - just one function call
# No calibration data needed because activations stay FP32
model = torch.load('model.pth')
model.eval()  # WHY: Must be in eval mode (no batchnorm updates)

# WHY: Specify which layers to quantize (Linear, LSTM, etc.)
# These layers benefit most from quantization
quantized_model = torch.quantization.quantize_dynamic(
    model,
    qconfig_spec={torch.nn.Linear},  # WHY: Quantize Linear layers only
    dtype=torch.qint8  # WHY: INT8 is standard precision
)

# Save quantized model
torch.save(quantized_model.state_dict(), 'model_quantized_dynamic.pth')

# Verify size reduction
original_size = os.path.getsize('model.pth') / (1024 ** 2)  # MB
quantized_size = os.path.getsize('model_quantized_dynamic.pth') / (1024 ** 2)
print(f"Original: {original_size:.1f}MB → Quantized: {quantized_size:.1f}MB")
print(f"Size reduction: {original_size / quantized_size:.1f}×")

Example use case: BERT classification model where primary goal is reducing size from 440MB to 110MB for easier deployment.

Type 2: Static Quantization (Post-Training Quantization)

What it does: Quantizes both weights and activations to INT8.

When to use:

Need maximum CPU speedup (2-4×)
Deploying to CPU servers or edge devices
Can afford calibration step (5-10 minutes)
Primary goal is inference speed

Benefits:

✅ 4× size reduction (same as dynamic)
✅ 2-4× CPU speedup (both weights and activations INT8)
✅ No retraining required (post-training)
✅ Acceptable accuracy loss (~0.5-1%)

Requirements:

⚠️ Needs calibration data (100-1000 samples from validation set)
⚠️ Slightly more complex setup than dynamic

PyTorch implementation:

import torch
import torch.quantization

def calibrate_model(model, calibration_loader):
    """
    Calibrate model by running representative data through it.

    WHY: Static quantization needs to know activation ranges.
    Calibration finds min/max values for each activation layer.

    Args:
        model: Model in eval mode with quantization stubs
        calibration_loader: DataLoader with 100-1000 samples
    """
    model.eval()
    with torch.no_grad():
        for batch_idx, (data, _) in enumerate(calibration_loader):
            model(data)
            if batch_idx >= 100:  # WHY: 100 batches usually sufficient
                break
    return model

# Step 1: Prepare model for quantization
model = torch.load('model.pth')
model.eval()

# WHY: Insert quantization/dequantization stubs at boundaries
# This tells PyTorch where to convert between FP32 and INT8
model.qconfig = torch.quantization.get_default_qconfig('fbgemm')
torch.quantization.prepare(model, inplace=True)

# Step 2: Calibrate with representative data
# WHY: Must use data from training/validation set, not random data
# Calibration finds activation ranges - needs real distribution
calibration_dataset = torch.utils.data.Subset(
    val_dataset,
    indices=range(1000)  # WHY: 1000 samples sufficient for most models
)
calibration_loader = torch.utils.data.DataLoader(
    calibration_dataset,
    batch_size=32,
    shuffle=False  # WHY: Order doesn't matter for calibration
)

model = calibrate_model(model, calibration_loader)

# Step 3: Convert to quantized model
torch.quantization.convert(model, inplace=True)

# Save quantized model
torch.save(model.state_dict(), 'model_quantized_static.pth')

# Benchmark speed improvement
import time

def benchmark(model, data, num_iterations=100):
    """WHY: Warm up model first, then measure average latency."""
    model.eval()
    # Warm up (first few iterations slower)
    for _ in range(10):
        model(data)

    start = time.time()
    with torch.no_grad():
        for _ in range(num_iterations):
            model(data)
    end = time.time()
    return (end - start) / num_iterations * 1000  # ms per inference

test_data = torch.randn(1, 3, 224, 224)  # Example input

baseline_latency = benchmark(original_model, test_data)
quantized_latency = benchmark(model, test_data)

print(f"Baseline: {baseline_latency:.2f}ms")
print(f"Quantized: {quantized_latency:.2f}ms")
print(f"Speedup: {baseline_latency / quantized_latency:.2f}×")

Example use case: ResNet50 image classifier for CPU inference - need <5ms latency, achieve 4ms with static quantization (vs 15ms baseline).

Type 3: Quantization-Aware Training (QAT)

What it does: Simulates quantization during training to minimize accuracy loss.

When to use:

Static quantization accuracy loss too large (>2%)
Need best possible accuracy with INT8
Can afford retraining (hours to days)
Critical production system with strict accuracy requirements

Benefits:

✅ Best accuracy (~0.1-0.3% loss vs 0.5-1% for static)
✅ 4× size reduction (same as dynamic/static)
✅ 2-4× CPU speedup (same as static)

Limitations:

⚠️ Requires retraining (most expensive option)
⚠️ Takes hours to days depending on model size
⚠️ More complex implementation

PyTorch implementation:

import torch
import torch.quantization

def train_one_epoch_qat(model, train_loader, optimizer, criterion):
    """
    Train one epoch with quantization-aware training.

    WHY: QAT inserts fake quantization ops during training.
    Model learns to be robust to quantization errors.
    """
    model.train()
    for data, target in train_loader:
        optimizer.zero_grad()
        output = model(data)
        loss = criterion(output, target)
        loss.backward()
        optimizer.step()
    return model

# Step 1: Prepare model for QAT
model = torch.load('model.pth')
model.train()

# WHY: QAT config includes fake quantization ops
# These simulate quantization during forward pass
model.qconfig = torch.quantization.get_default_qat_qconfig('fbgemm')
torch.quantization.prepare_qat(model, inplace=True)

# Step 2: Train with quantization-aware training
# WHY: Model learns to compensate for quantization errors
optimizer = torch.optim.SGD(model.parameters(), lr=0.0001)  # WHY: Low LR for fine-tuning
criterion = torch.nn.CrossEntropyLoss()

num_epochs = 5  # WHY: Usually 5-10 epochs sufficient for QAT fine-tuning
for epoch in range(num_epochs):
    model = train_one_epoch_qat(model, train_loader, optimizer, criterion)
    print(f"Epoch {epoch+1}/{num_epochs} complete")

# Step 3: Convert to quantized model
model.eval()
torch.quantization.convert(model, inplace=True)

# Save QAT quantized model
torch.save(model.state_dict(), 'model_quantized_qat.pth')

Example use case: Medical imaging model where accuracy is critical - static quantization gives 2% accuracy loss, QAT reduces to 0.3%.

Part 2: Quantization Type Decision Matrix

Type	Complexity	Calibration	Retraining	Size Reduction	CPU Speedup	Accuracy Loss
Dynamic	Low	No	No	4×	1.2-1.5×	~0.2-0.5%
Static	Medium	Yes	No	4×	2-4×	~0.5-1%
QAT	High	Yes	Yes	4×	2-4×	~0.1-0.3%

Decision flow:

Start with dynamic quantization: Simplest, verify quantization helps
Upgrade to static quantization: If need more speedup, can afford calibration
Use QAT: Only if accuracy loss from static too large (rare)

Why this order? Incremental cost. Dynamic is free (5 minutes), static is cheap (15 minutes), QAT is expensive (hours/days). Don't pay for QAT unless you need it.

Part 3: Calibration Best Practices

What is Calibration?

Purpose: Find min/max ranges for each activation layer.

Why needed: Static quantization needs to know activation ranges to map FP32 → INT8. Without calibration, ranges are wrong → accuracy collapses.

How it works:

Run representative data through model
Record min/max activation values per layer
Use these ranges to quantize activations at inference time

Calibration Data Requirements

Data source:

✅ Use validation set samples (matches training distribution)
❌ Don't use random images from internet (different distribution)
❌ Don't use single image repeated (insufficient coverage)
❌ Don't use training set that doesn't match deployment (distribution shift)

Data size:

Minimum: 100 samples (sufficient for simple models)
Recommended: 500-1000 samples (better coverage)
Maximum: Full validation set is overkill (slow, no benefit)

Data characteristics:

Must cover range of inputs model sees in production
Include edge cases (bright/dark images, long/short text)
Distribution should match deployment, not just training
Class balance less important than input diversity

Example calibration data selection:

import torch
import numpy as np

def select_calibration_data(val_dataset, num_samples=1000):
    """
    Select diverse calibration samples from validation set.

    WHY: Want samples that cover range of activation values.
    Random selection from validation set usually sufficient.

    Args:
        val_dataset: Full validation dataset
        num_samples: Number of calibration samples (default 1000)

    Returns:
        Calibration dataset subset
    """
    # WHY: Random selection ensures diversity
    # Stratified sampling can help ensure class coverage
    indices = np.random.choice(len(val_dataset), num_samples, replace=False)
    calibration_dataset = torch.utils.data.Subset(val_dataset, indices)

    return calibration_dataset

# Example: Select 1000 random samples from validation set
calibration_dataset = select_calibration_data(val_dataset, num_samples=1000)
calibration_loader = torch.utils.data.DataLoader(
    calibration_dataset,
    batch_size=32,
    shuffle=False  # WHY: Order doesn't matter for calibration
)

Common Calibration Pitfalls

Pitfall 1: Using wrong data distribution

❌ "Random images from internet" for ImageNet-trained model
✅ Use ImageNet validation set samples

Pitfall 2: Too few samples

❌ 10 samples (insufficient coverage of activation ranges)
✅ 100-1000 samples (good coverage)

Pitfall 3: Using training data that doesn't match deployment

❌ Calibrate on sunny outdoor images, deploy on indoor images
✅ Calibrate on data matching deployment distribution

Pitfall 4: Skipping calibration validation

❌ Calibrate once, assume it works
✅ Validate accuracy after calibration to verify ranges are good

Part 4: Precision Selection (INT8 vs INT4 vs FP16)

Precision Spectrum

Precision	Bits	Size vs FP32	Speedup (CPU)	Typical Accuracy Loss
FP32	32	1×	1×	0% (baseline)
FP16	16	2×	1.5×	<0.1%
INT8	8	4×	2-4×	0.5-1%
INT4	4	8×	4-8×	1-3%

Trade-off: Lower precision = Smaller size + Faster inference + More accuracy loss

When to Use Each Precision

FP16 (Half Precision):

GPU inference (Tensor Cores optimized for FP16)
Need minimal accuracy loss (<0.1%)
Size reduction secondary concern
Example: Large language models on GPU

INT8 (Standard Quantization):

CPU inference (INT8 operations fast on CPU)
Edge device deployment
Good balance of size/speed/accuracy
Most common choice for production deployment
Example: Image classification on mobile devices

INT4 (Aggressive Quantization):

Extremely memory-constrained (e.g., 1GB mobile devices)
Can tolerate larger accuracy loss (1-3%)
Need maximum size reduction (8×)
Use sparingly - accuracy risk high
Example: Large language models (LLaMA-7B: 13GB → 3.5GB)

Decision Flow

def choose_precision(accuracy_tolerance, deployment_target):
    """
    Choose quantization precision based on requirements.

    WHY: Different precisions for different constraints.
    INT8 is default, FP16 for GPU, INT4 for extreme memory constraints.
    """
    if accuracy_tolerance < 0.1:
        return "FP16"  # Minimal accuracy loss required
    elif deployment_target == "GPU":
        return "FP16"  # GPU optimized for FP16
    elif deployment_target in ["CPU", "edge"]:
        return "INT8"  # CPU optimized for INT8
    elif deployment_target == "extreme_edge" and accuracy_tolerance > 1:
        return "INT4"  # Only if can tolerate 1-3% loss
    else:
        return "INT8"  # Default safe choice

Part 5: ONNX Quantization (Cross-Framework)

When to use: Deploying to ONNX Runtime (CPU/edge devices) or need cross-framework compatibility.

ONNX Static Quantization

import onnxruntime
from onnxruntime.quantization import quantize_static, CalibrationDataReader
import numpy as np

class CalibrationDataReaderWrapper(CalibrationDataReader):
    """
    WHY: ONNX requires custom calibration data reader.
    This class feeds calibration data to ONNX quantization engine.
    """
    def __init__(self, calibration_data):
        self.calibration_data = calibration_data
        self.iterator = iter(calibration_data)

    def get_next(self):
        """WHY: Called by ONNX to get next calibration batch."""
        try:
            data, _ = next(self.iterator)
            return {"input": data.numpy()}  # WHY: Return dict of input name → data
        except StopIteration:
            return None

# Step 1: Export PyTorch model to ONNX
model = torch.load('model.pth')
model.eval()
dummy_input = torch.randn(1, 3, 224, 224)

torch.onnx.export(
    model,
    dummy_input,
    'model.onnx',
    input_names=['input'],
    output_names=['output'],
    opset_version=13  # WHY: ONNX opset 13+ supports quantization ops
)

# Step 2: Prepare calibration data
calibration_loader = torch.utils.data.DataLoader(
    calibration_dataset,
    batch_size=1,  # WHY: ONNX calibration uses batch size 1
    shuffle=False
)
calibration_reader = CalibrationDataReaderWrapper(calibration_loader)

# Step 3: Quantize ONNX model
quantize_static(
    'model.onnx',
    'model_quantized.onnx',
    calibration_data_reader=calibration_reader,
    quant_format='QDQ'  # WHY: QDQ format compatible with most backends
)

# Step 4: Benchmark ONNX quantized model
import time

session = onnxruntime.InferenceSession('model_quantized.onnx')
input_name = session.get_inputs()[0].name

test_data = np.random.randn(1, 3, 224, 224).astype(np.float32)

# Warm up
for _ in range(10):
    session.run(None, {input_name: test_data})

# Benchmark
start = time.time()
for _ in range(100):
    session.run(None, {input_name: test_data})
end = time.time()

latency = (end - start) / 100 * 1000  # ms per inference
print(f"ONNX Quantized latency: {latency:.2f}ms")

ONNX advantages:

Cross-framework (works with PyTorch, TensorFlow, etc.)
Optimized ONNX Runtime for CPU inference
Good hardware backend support (x86, ARM)

Part 6: Accuracy Validation (Critical Step)

Why Accuracy Validation Matters

Quantization is lossy compression. Must measure accuracy impact:

Some models tolerate quantization well (<0.5% loss)
Some models sensitive to quantization (>2% loss)
Some layers more sensitive than others
Can't assume quantization is safe without measuring

Validation Methodology

def validate_quantization(original_model, quantized_model, val_loader):
    """
    Validate quantization by comparing accuracy.

    WHY: Quantization is lossy - must measure impact.
    Compare baseline vs quantized on same validation set.

    Returns:
        dict with baseline_acc, quantized_acc, accuracy_loss
    """
    def evaluate(model, data_loader):
        model.eval()
        correct = 0
        total = 0
        with torch.no_grad():
            for data, target in data_loader:
                output = model(data)
                pred = output.argmax(dim=1)
                correct += (pred == target).sum().item()
                total += target.size(0)
        return 100.0 * correct / total

    baseline_acc = evaluate(original_model, val_loader)
    quantized_acc = evaluate(quantized_model, val_loader)
    accuracy_loss = baseline_acc - quantized_acc

    return {
        'baseline_acc': baseline_acc,
        'quantized_acc': quantized_acc,
        'accuracy_loss': accuracy_loss,
        'acceptable': accuracy_loss < 2.0  # WHY: <2% loss usually acceptable
    }

# Example validation
results = validate_quantization(original_model, quantized_model, val_loader)
print(f"Baseline accuracy: {results['baseline_acc']:.2f}%")
print(f"Quantized accuracy: {results['quantized_acc']:.2f}%")
print(f"Accuracy loss: {results['accuracy_loss']:.2f}%")
print(f"Acceptable: {results['acceptable']}")

# Decision logic
if results['acceptable']:
    print("✅ Quantization acceptable - deploy quantized model")
else:
    print("❌ Accuracy loss too large - try QAT or reconsider quantization")

Acceptable Accuracy Thresholds

General guidelines:

<1% loss: Excellent quantization result
1-2% loss: Acceptable for most applications
2-3% loss: Consider QAT to reduce loss
>3% loss: Quantization may not be suitable for this model

Task-specific thresholds:

Image classification: 1-2% top-1 accuracy loss acceptable
Object detection: 1-2% mAP loss acceptable
NLP classification: 0.5-1% accuracy loss acceptable
Medical/safety-critical: <0.5% loss required (use QAT)

Part 7: LLM Quantization (GPTQ, AWQ)

Note: This skill covers general quantization. For LLM-specific optimization (GPTQ, AWQ, KV cache, etc.), see the llm-inference-optimization skill in the llm-specialist pack.

LLM Quantization Overview

Why LLMs need quantization:

Very large (7B parameters = 13GB in FP16)
Memory-bound inference (limited by VRAM)
INT4 quantization: 13GB → 3.5GB (fits in consumer GPUs)

LLM-specific quantization methods:

GPTQ: Post-training quantization optimized for LLMs
AWQ: Activation-aware weight quantization (better quality than GPTQ)
Both: Achieve INT4 with <0.5 perplexity increase

When to Use LLM Quantization

✅ Use when:

Deploying LLMs locally (consumer GPUs)
Memory-constrained (need to fit in 12GB/24GB VRAM)
Cost-sensitive (smaller models cheaper to host)
Latency-sensitive (smaller models faster to load)

❌ Don't use when:

Have sufficient GPU memory for FP16
Accuracy critical (medical, legal applications)
Already using API (OpenAI, Anthropic) - they handle optimization

LLM Quantization References

For detailed LLM quantization:

See skill: llm-inference-optimization (llm-specialist pack)
Covers: GPTQ, AWQ, KV cache optimization, token streaming
Tools: llama.cpp, vLLM, text-generation-inference

Quick reference (defer to llm-specialist for details):

# GPTQ quantization (example - see llm-specialist for full details)
from transformers import AutoModelForCausalLM, GPTQConfig

# WHY: GPTQ optimizes layer-wise for minimal perplexity increase
quantization_config = GPTQConfig(bits=4, dataset="c4", tokenizer=tokenizer)

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-hf",
    quantization_config=quantization_config,
    device_map="auto"
)

# Result: 13GB → 3.5GB, <0.5 perplexity increase

Part 8: When NOT to Quantize

Scenario 1: Already Fast Enough

Example: MobileNetV2 (14MB, 3ms CPU latency)

Quantization: 14MB → 4MB, 3ms → 2ms
Benefit: 10MB saved, 1ms faster
Cost: Calibration, validation, testing, debugging
Decision: Not worth effort unless specific requirement

Rule: If current performance meets requirements, don't optimize.

Scenario 2: GPU-Only Deployment with No Memory Constraints

Example: ResNet50 on Tesla V100 with 32GB VRAM

Quantization: 1.5-2× GPU speedup (modest)
FP32 already fast on GPU (Tensor Cores optimized)
No memory pressure (plenty of VRAM)
Decision: Focus on other bottlenecks (data loading, I/O)

Rule: Quantization is most beneficial for CPU inference and memory-constrained GPU.

Scenario 3: Accuracy-Critical Applications

Example: Medical diagnosis model where misdiagnosis has severe consequences

Quantization introduces accuracy loss (even if small)
Risk not worth benefit
Decision: Keep FP32, optimize other parts (batching, caching)

Rule: Safety-critical systems should avoid lossy compression unless thoroughly validated.

Scenario 4: Prototyping Phase

Example: Early development, trying different architectures

Quantization is optimization - premature at prototype stage
Focus on getting model working first
Decision: Defer quantization until production deployment

Rule: Don't optimize until you need to (Knuth: "Premature optimization is root of all evil").

Part 9: Quantization Benchmarks (Expected Results)

Image Classification (ResNet50, ImageNet)

Metric	FP32 Baseline	Dynamic INT8	Static INT8	QAT INT8
Size	98MB	25MB (4×)	25MB (4×)	25MB (4×)
CPU Latency	15ms	12ms (1.25×)	4ms (3.75×)	4ms (3.75×)
Top-1 Accuracy	76.1%	75.9% (0.2% loss)	75.3% (0.8% loss)	75.9% (0.2% loss)

Insight: Static quantization gives 3.75× speedup with acceptable 0.8% accuracy loss.

Object Detection (YOLOv5s, COCO)

Metric	FP32 Baseline	Static INT8	QAT INT8
Size	14MB	4MB (3.5×)	4MB (3.5×)
CPU Latency	45ms	15ms (3×)	15ms (3×)
mAP@0.5	37.4%	36.8% (0.6% loss)	37.2% (0.2% loss)

Insight: QAT gives better accuracy (0.2% vs 0.6% loss) with same speedup.

NLP Classification (BERT-base, GLUE)

Metric	FP32 Baseline	Dynamic INT8	Static INT8
Size	440MB	110MB (4×)	110MB (4×)
CPU Latency	35ms	28ms (1.25×)	12ms (2.9×)
Accuracy	93.5%	93.2% (0.3% loss)	92.8% (0.7% loss)

Insight: Static quantization gives 2.9× speedup but dynamic sufficient if speedup not critical.

LLM Inference (LLaMA-7B)

Metric	FP16 Baseline	GPTQ INT4	AWQ INT4
Size	13GB	3.5GB (3.7×)	3.5GB (3.7×)
First Token Latency	800ms	250ms (3.2×)	230ms (3.5×)
Perplexity	5.68	5.82 (0.14 increase)	5.77 (0.09 increase)

Insight: AWQ gives better quality than GPTQ with similar speedup.

Part 10: Common Pitfalls and Solutions

Pitfall 1: Skipping Accuracy Validation

Issue: Deploy quantized model without measuring accuracy impact. Risk: Discover accuracy degradation in production (too late). Solution: Always validate accuracy on representative data before deployment.

# ❌ WRONG: Deploy without validation
quantized_model = quantize(model)
deploy(quantized_model)  # Hope it works!

# ✅ RIGHT: Validate before deployment
quantized_model = quantize(model)
results = validate_accuracy(original_model, quantized_model, val_loader)
if results['acceptable']:
    deploy(quantized_model)
else:
    print("Accuracy loss too large - try QAT")

Pitfall 2: Using Wrong Calibration Data

Issue: Calibrate with random/unrepresentative data. Risk: Activation ranges wrong → accuracy collapses. Solution: Use 100-1000 samples from validation set matching deployment distribution.

# ❌ WRONG: Random images from internet
calibration_data = download_random_images()

# ✅ RIGHT: Samples from validation set
calibration_data = torch.utils.data.Subset(val_dataset, range(1000))

Pitfall 3: Choosing Wrong Quantization Type

Issue: Use dynamic quantization when need static speedup. Risk: Get 1.2× speedup instead of 3× speedup. Solution: Match quantization type to requirements (dynamic for size, static for speed).

# ❌ WRONG: Use dynamic when need speed
if need_fast_cpu_inference:
    quantized_model = torch.quantization.quantize_dynamic(model)  # Only 1.2× speedup

# ✅ RIGHT: Use static for speed
if need_fast_cpu_inference:
    model = prepare_and_calibrate(model, calibration_data)
    quantized_model = torch.quantization.convert(model)  # 2-4× speedup

Pitfall 4: Quantizing GPU-Only Deployments

Issue: Quantize model for GPU inference without memory pressure. Risk: Effort not worth modest 1.5-2× GPU speedup. Solution: Only quantize GPU if memory-constrained (multiple models in VRAM).

# ❌ WRONG: Quantize for GPU with no memory issue
if deployment_target == "GPU" and have_plenty_of_memory:
    quantized_model = quantize(model)  # Wasted effort

# ✅ RIGHT: Skip quantization if not needed
if deployment_target == "GPU" and have_plenty_of_memory:
    deploy(model)  # Keep FP32, focus on other optimizations

Pitfall 5: Over-Quantizing (INT4 When INT8 Sufficient)

Issue: Use aggressive INT4 quantization when INT8 would suffice. Risk: Larger accuracy loss than necessary. Solution: Start with INT8 (standard), only use INT4 if extreme memory constraints.

# ❌ WRONG: Jump to INT4 without trying INT8
quantized_model = quantize(model, precision="INT4")  # 2-3% accuracy loss

# ✅ RIGHT: Start with INT8, only use INT4 if needed
quantized_model_int8 = quantize(model, precision="INT8")  # 0.5-1% accuracy loss
if model_still_too_large:
    quantized_model_int4 = quantize(model, precision="INT4")

Pitfall 6: Assuming All Layers Quantize Equally

Issue: Quantize all layers uniformly, but some layers more sensitive. Risk: Accuracy loss dominated by few sensitive layers. Solution: Use mixed precision - keep sensitive layers in FP32/INT8, quantize others to INT4.

# ✅ ADVANCED: Mixed precision quantization
# Keep first/last layers in higher precision, quantize middle layers aggressively
from torch.quantization import QConfigMapping

qconfig_mapping = QConfigMapping()
qconfig_mapping.set_global(get_default_qconfig('fbgemm'))  # INT8 default
qconfig_mapping.set_module_name('model.layer1', None)  # Keep first layer FP32
qconfig_mapping.set_module_name('model.layer10', None)  # Keep last layer FP32

model = quantize_with_qconfig(model, qconfig_mapping)

Part 11: Decision Framework Summary

Step 1: Recognize Quantization Need

Symptoms:

Model too slow on CPU (>10ms when need <5ms)
Model too large for edge devices (>50MB)
Deploying to CPU/edge (not GPU)
Need to reduce hosting costs

If YES → Proceed to Step 2 If NO → Don't quantize, focus on other optimizations

Step 2: Choose Quantization Type

Start with Dynamic:
├─ Sufficient? (meets latency/size requirements)
│  ├─ YES → Deploy dynamic quantized model
│  └─ NO → Proceed to Static
│
Static Quantization:
├─ Sufficient? (meets latency/size + accuracy acceptable)
│  ├─ YES → Deploy static quantized model
│  └─ NO → Accuracy loss >2%
│     │
│     └─ Proceed to QAT
│
QAT:
├─ Train with quantization awareness
└─ Achieves <1% accuracy loss → Deploy

Step 3: Calibrate (if Static/QAT)

Calibration data:

Source: Validation set (representative samples)
Size: 100-1000 samples
Characteristics: Match deployment distribution

Calibration process:

Select samples from validation set
Run through model to collect activation ranges
Validate accuracy after calibration
If accuracy loss >2%, try different calibration data or QAT

Step 4: Validate Accuracy

Required measurements:

Baseline accuracy (FP32)
Quantized accuracy (INT8/INT4)
Accuracy loss (baseline - quantized)
Acceptable threshold (typically <2%)

Decision:

If accuracy loss <2% → Deploy
If accuracy loss >2% → Try QAT or reconsider quantization

Step 5: Benchmark Performance

Required measurements:

Model size (MB): baseline vs quantized
Inference latency (ms): baseline vs quantized
Throughput (requests/sec): baseline vs quantized

Verify expected results:

Size: 4× reduction (FP32 → INT8)
CPU speedup: 2-4× (static quantization)
GPU speedup: 1.5-2× (if applicable)

Part 12: Production Deployment Checklist

Before deploying quantized model to production:

✅ Accuracy Validated

Baseline accuracy measured on validation set
Quantized accuracy measured on same validation set
Accuracy loss within acceptable threshold (<2%)
Validated on representative production data

✅ Performance Benchmarked

Size reduction measured (expect 4× for INT8)
Latency improvement measured (expect 2-4× CPU)
Throughput improvement measured
Performance meets requirements

✅ Calibration Verified (if static/QAT)

Used representative samples from validation set (not random data)
Used sufficient calibration data (100-1000 samples)
Calibration data matches deployment distribution

✅ Edge Cases Tested

Tested on diverse inputs (bright/dark images, long/short text)
Validated numerical stability (no NaN/Inf outputs)
Tested inference on target hardware (CPU/GPU/edge device)

✅ Rollback Plan

Can easily revert to FP32 model if issues found
Monitoring in place to detect accuracy degradation
A/B testing plan to compare FP32 vs quantized

Skill Mastery Checklist

You have mastered quantization for inference when you can:

Recognize when quantization is appropriate (CPU/edge deployment, size/speed issues)
Choose correct quantization type (dynamic vs static vs QAT) based on requirements
Implement dynamic quantization in PyTorch (5 lines of code)
Implement static quantization with proper calibration (20 lines of code)
Select appropriate calibration data (validation set, 100-1000 samples)
Validate accuracy trade-offs systematically (baseline vs quantized)
Benchmark performance improvements (size, latency, throughput)
Decide when NOT to quantize (GPU-only, already fast, accuracy-critical)
Debug quantization issues (accuracy collapse, wrong speedup, numerical instability)
Deploy quantized models to production with confidence

Key insight: Quantization is not magic - it's a systematic trade-off of precision for performance. The skill is matching the right quantization approach to your specific requirements.

34 KiB Raw Blame History Unescape Escape

Quantization for Inference Skill

When to Use This Skill

Core Principle

Quantization Framework

Part 1: Quantization Types

Type 1: Dynamic Quantization

Type 2: Static Quantization (Post-Training Quantization)

Type 3: Quantization-Aware Training (QAT)

Part 2: Quantization Type Decision Matrix

Part 3: Calibration Best Practices

What is Calibration?

Calibration Data Requirements

Common Calibration Pitfalls

Part 4: Precision Selection (INT8 vs INT4 vs FP16)

Precision Spectrum

When to Use Each Precision

Decision Flow

Part 5: ONNX Quantization (Cross-Framework)

ONNX Static Quantization

Part 6: Accuracy Validation (Critical Step)

Why Accuracy Validation Matters

Validation Methodology

Acceptable Accuracy Thresholds

Part 7: LLM Quantization (GPTQ, AWQ)

LLM Quantization Overview

When to Use LLM Quantization

LLM Quantization References

Part 8: When NOT to Quantize

Scenario 1: Already Fast Enough

Scenario 2: GPU-Only Deployment with No Memory Constraints

Scenario 3: Accuracy-Critical Applications

Scenario 4: Prototyping Phase

Part 9: Quantization Benchmarks (Expected Results)

Image Classification (ResNet50, ImageNet)

Object Detection (YOLOv5s, COCO)

NLP Classification (BERT-base, GLUE)

LLM Inference (LLaMA-7B)

Part 10: Common Pitfalls and Solutions

Pitfall 1: Skipping Accuracy Validation

Pitfall 2: Using Wrong Calibration Data

Pitfall 3: Choosing Wrong Quantization Type

Pitfall 4: Quantizing GPU-Only Deployments

Pitfall 5: Over-Quantizing (INT4 When INT8 Sufficient)

Pitfall 6: Assuming All Layers Quantize Equally

Part 11: Decision Framework Summary

Step 1: Recognize Quantization Need

Step 2: Choose Quantization Type

Step 3: Calibrate (if Static/QAT)

Step 4: Validate Accuracy

Step 5: Benchmark Performance

Part 12: Production Deployment Checklist

Skill Mastery Checklist

34 KiB

Raw Blame History