1195 lines
43 KiB
Markdown
1195 lines
43 KiB
Markdown
|
||
# Model Compression Techniques
|
||
|
||
## When to Use This Skill
|
||
|
||
Use this skill when:
|
||
- Deploying models to edge devices (mobile, IoT, embedded systems)
|
||
- Model too large for deployment constraints (storage, memory, bandwidth)
|
||
- Inference costs too high (need smaller/faster model)
|
||
- Need to balance model size, speed, and accuracy
|
||
- Combining multiple compression techniques (quantization + pruning + distillation)
|
||
|
||
**When NOT to use:**
|
||
- Model already fits deployment constraints (compression unnecessary)
|
||
- Training optimization needed (use training-optimization pack instead)
|
||
- Quantization is sufficient (use quantization-for-inference instead)
|
||
- LLM-specific optimization (use llm-specialist for KV cache, speculative decoding)
|
||
|
||
**Relationship with quantization-for-inference:**
|
||
- Quantization: Reduce precision (FP32 → INT8/INT4) - 4× size reduction
|
||
- Compression: Reduce architecture (pruning, distillation) - 2-10× size reduction
|
||
- Often combined: Quantization + pruning + distillation = 10-50× total reduction
|
||
|
||
## Core Principle
|
||
|
||
**Compression is not one-size-fits-all. Architecture and deployment target determine technique.**
|
||
|
||
Without systematic compression:
|
||
- Mobile deployment: 440MB model crashes 2GB devices
|
||
- Wrong technique: Pruning transformers → 33pp accuracy drop
|
||
- Unstructured pruning: No speedup on standard hardware
|
||
- Aggressive distillation: 77× compression produces gibberish
|
||
- No recovery: 5pp preventable accuracy loss
|
||
|
||
**Formula:** Architecture analysis (transformer vs CNN) + Deployment constraints (hardware, latency, size) + Technique selection (pruning vs distillation) + Quality preservation (recovery, progressive compression) = Production-ready compressed model.
|
||
|
||
|
||
## Compression Decision Framework
|
||
|
||
```
|
||
Model Compression Decision Tree
|
||
|
||
1. What is target deployment?
|
||
├─ Edge/Mobile (strict size/memory) → Aggressive compression (4-10×)
|
||
├─ Cloud/Server (cost optimization) → Moderate compression (2-4×)
|
||
└─ On-premises (moderate constraints) → Balanced approach
|
||
|
||
2. What is model architecture?
|
||
├─ Transformer (BERT, GPT, T5)
|
||
│ └─ Primary: Knowledge distillation (preserves attention)
|
||
│ └─ Secondary: Layer dropping, quantization
|
||
│ └─ AVOID: Aggressive unstructured pruning (destroys quality)
|
||
│
|
||
├─ CNN (ResNet, EfficientNet, MobileNet)
|
||
│ └─ Primary: Structured channel pruning (works well)
|
||
│ └─ Secondary: Quantization (INT8 standard)
|
||
│ └─ Tertiary: Knowledge distillation (classification tasks)
|
||
│
|
||
└─ RNN/LSTM
|
||
└─ Primary: Quantization (safe, effective)
|
||
└─ Secondary: Structured pruning (hidden dimension)
|
||
└─ AVOID: Unstructured pruning (breaks sequential dependencies)
|
||
|
||
3. What is deployment hardware?
|
||
├─ CPU/GPU/Mobile (standard) → Structured pruning (actual speedup)
|
||
└─ Specialized (A100, sparse accelerators) → Unstructured pruning possible
|
||
|
||
4. What is acceptable quality loss?
|
||
├─ <2pp → Conservative: Quantization only (4× reduction)
|
||
├─ 2-5pp → Moderate: Quantization + structured pruning (6-10× reduction)
|
||
└─ >5pp → Aggressive: Full pipeline with distillation (10-50× reduction)
|
||
|
||
5. Combine techniques for maximum compression:
|
||
Quantization (4×) + Pruning (2×) + Distillation (2×) = 16× total reduction
|
||
```
|
||
|
||
|
||
## Part 1: Structured vs Unstructured Pruning
|
||
|
||
### When to Use Each
|
||
|
||
**Unstructured Pruning:**
|
||
- **Use when:** Sparse hardware available (NVIDIA A100, specialized accelerators)
|
||
- **Benefit:** Highest compression (70-90% sparsity possible)
|
||
- **Drawback:** No speedup on standard hardware (computes zeros anyway)
|
||
- **Hardware support:** Rare (most deployments use standard CPU/GPU)
|
||
|
||
**Structured Pruning:**
|
||
- **Use when:** Standard hardware (CPU, GPU, mobile) - 99% of deployments
|
||
- **Benefit:** Actual speedup (smaller dense matrices)
|
||
- **Drawback:** Lower compression ratio (50-70% typical)
|
||
- **Variants:** Channel pruning (CNNs), layer dropping (transformers), attention head pruning
|
||
|
||
### Structured Channel Pruning (CNNs)
|
||
|
||
**Problem:** Unstructured pruning creates sparse tensors that don't accelerate on standard hardware.
|
||
|
||
**Solution:** Remove entire channels to create smaller dense model (actual speedup).
|
||
|
||
```python
|
||
import torch
|
||
import torch.nn as nn
|
||
import torch_pruning as tp
|
||
|
||
def structured_channel_pruning_cnn(model, pruning_ratio=0.5, example_input=None):
|
||
"""
|
||
Structured channel pruning for CNNs (actual speedup on all hardware).
|
||
|
||
WHY structured: Removes entire channels/filters, creating smaller dense model
|
||
WHY works: Smaller dense matrices compute faster than sparse matrices on standard hardware
|
||
|
||
Args:
|
||
model: CNN model to prune
|
||
pruning_ratio: Fraction of channels to remove (0.5 = remove 50%)
|
||
example_input: Example input tensor for tracing dependencies
|
||
|
||
Returns:
|
||
Pruned model (smaller, faster)
|
||
"""
|
||
if example_input is None:
|
||
example_input = torch.randn(1, 3, 224, 224)
|
||
|
||
# Define importance metric (L1 norm of channels)
|
||
# WHY L1 norm: Channels with small L1 norm contribute less to output
|
||
importance = tp.importance.MagnitudeImportance(p=1)
|
||
|
||
# Create pruner
|
||
pruner = tp.pruner.MagnitudePruner(
|
||
model,
|
||
example_inputs=example_input,
|
||
importance=importance,
|
||
pruning_ratio=pruning_ratio,
|
||
global_pruning=False # Prune each layer independently
|
||
)
|
||
|
||
# Execute pruning (removes channels, creates smaller model)
|
||
# WHY remove channels: Conv2d(64, 128) → Conv2d(32, 64) after 50% pruning
|
||
pruner.step()
|
||
|
||
return model
|
||
|
||
# Example: Prune ResNet18
|
||
from torchvision.models import resnet18
|
||
|
||
model = resnet18(pretrained=True)
|
||
print(f"Original model size: {get_model_size(model):.1f}MB") # 44.7MB
|
||
print(f"Original params: {count_parameters(model):,}") # 11,689,512
|
||
|
||
# Apply 50% channel pruning
|
||
model_pruned = structured_channel_pruning_cnn(
|
||
model,
|
||
pruning_ratio=0.5,
|
||
example_input=torch.randn(1, 3, 224, 224)
|
||
)
|
||
|
||
print(f"Pruned model size: {get_model_size(model_pruned):.1f}MB") # 22.4MB (50% reduction)
|
||
print(f"Pruned params: {count_parameters(model_pruned):,}") # 5,844,756 (50% reduction)
|
||
|
||
# Benchmark inference speed
|
||
# WHY faster: Smaller dense matrices (fewer FLOPs, less memory bandwidth)
|
||
original_time = benchmark_inference(model) # 25ms
|
||
pruned_time = benchmark_inference(model_pruned) # 12.5ms (2× FASTER!)
|
||
|
||
# Accuracy (before fine-tuning)
|
||
original_acc = evaluate(model) # 69.8%
|
||
pruned_acc = evaluate(model_pruned) # 64.2% (5.6pp drop - needs fine-tuning)
|
||
|
||
# Fine-tune to recover accuracy
|
||
fine_tune(model_pruned, epochs=5, lr=1e-4)
|
||
pruned_acc_recovered = evaluate(model_pruned) # 68.5% (1.3pp drop, acceptable)
|
||
```
|
||
|
||
**Helper functions:**
|
||
|
||
```python
|
||
def get_model_size(model):
|
||
"""Calculate model size in MB."""
|
||
# WHY: Multiply parameters by 4 bytes (FP32)
|
||
param_size = sum(p.numel() for p in model.parameters()) * 4 / (1024 ** 2)
|
||
return param_size
|
||
|
||
def count_parameters(model):
|
||
"""Count trainable parameters."""
|
||
return sum(p.numel() for p in model.parameters() if p.requires_grad)
|
||
|
||
def benchmark_inference(model, num_runs=100):
|
||
"""Benchmark inference time (ms)."""
|
||
import time
|
||
model.eval()
|
||
example_input = torch.randn(1, 3, 224, 224)
|
||
|
||
# Warmup
|
||
with torch.no_grad():
|
||
for _ in range(10):
|
||
model(example_input)
|
||
|
||
# Benchmark
|
||
start = time.time()
|
||
with torch.no_grad():
|
||
for _ in range(num_runs):
|
||
model(example_input)
|
||
end = time.time()
|
||
|
||
return (end - start) / num_runs * 1000 # Convert to ms
|
||
```
|
||
|
||
### Iterative Pruning (Quality Preservation)
|
||
|
||
**Problem:** Pruning all at once (50% in one step) → 10pp accuracy drop.
|
||
|
||
**Solution:** Iterative pruning (5 steps × 10% each) → 2pp accuracy drop.
|
||
|
||
```python
|
||
def iterative_pruning(model, target_ratio=0.5, num_iterations=5, finetune_epochs=2):
|
||
"""
|
||
Iterative pruning with fine-tuning between steps.
|
||
|
||
WHY iterative: Gradual pruning allows model to adapt
|
||
WHY fine-tune: Remaining weights compensate for removed weights
|
||
|
||
Example: 50% pruning
|
||
- One-shot: 50% pruning → 10pp accuracy drop
|
||
- Iterative: 5 steps × 10% each → 2pp accuracy drop (better!)
|
||
|
||
Args:
|
||
model: Model to prune
|
||
target_ratio: Final pruning ratio (0.5 = remove 50% of weights)
|
||
num_iterations: Number of pruning steps (more = gradual = better quality)
|
||
finetune_epochs: Fine-tuning epochs after each step
|
||
|
||
Returns:
|
||
Pruned model with quality preservation
|
||
"""
|
||
# Calculate pruning amount per iteration
|
||
# WHY: Distribute total pruning across iterations
|
||
amount_per_iteration = 1 - (1 - target_ratio) ** (1 / num_iterations)
|
||
|
||
print(f"Pruning {target_ratio*100:.0f}% in {num_iterations} steps")
|
||
print(f"Amount per step: {amount_per_iteration*100:.1f}%")
|
||
|
||
for step in range(num_iterations):
|
||
print(f"\n=== Iteration {step+1}/{num_iterations} ===")
|
||
|
||
# Prune this iteration
|
||
# WHY global_unstructured: Prune across all layers (balanced sparsity)
|
||
parameters_to_prune = [
|
||
(module, "weight")
|
||
for module in model.modules()
|
||
if isinstance(module, (nn.Linear, nn.Conv2d))
|
||
]
|
||
|
||
prune.global_unstructured(
|
||
parameters_to_prune,
|
||
pruning_method=prune.L1Unstructured,
|
||
amount=amount_per_iteration
|
||
)
|
||
|
||
# Evaluate current accuracy
|
||
acc_before_finetune = evaluate(model)
|
||
print(f"Accuracy after pruning: {acc_before_finetune:.2f}%")
|
||
|
||
# Fine-tune to recover accuracy
|
||
# WHY: Allow remaining weights to compensate for removed weights
|
||
fine_tune(model, epochs=finetune_epochs, lr=1e-4)
|
||
|
||
acc_after_finetune = evaluate(model)
|
||
print(f"Accuracy after fine-tuning: {acc_after_finetune:.2f}%")
|
||
|
||
# Make pruning permanent (remove masks)
|
||
for module, param_name in parameters_to_prune:
|
||
prune.remove(module, param_name)
|
||
|
||
return model
|
||
|
||
# Example usage
|
||
model = resnet18(pretrained=True)
|
||
original_acc = evaluate(model) # 69.8%
|
||
|
||
# One-shot pruning (worse quality)
|
||
model_oneshot = copy.deepcopy(model)
|
||
prune_global_unstructured(model_oneshot, amount=0.5) # Prune 50% immediately
|
||
oneshot_acc = evaluate(model_oneshot) # 59.7% (10.1pp drop!)
|
||
|
||
# Iterative pruning (better quality)
|
||
model_iterative = copy.deepcopy(model)
|
||
model_iterative = iterative_pruning(
|
||
model_iterative,
|
||
target_ratio=0.5, # 50% pruning
|
||
num_iterations=5, # Gradual over 5 steps
|
||
finetune_epochs=2 # Fine-tune after each step
|
||
)
|
||
iterative_acc = evaluate(model_iterative) # 67.5% (2.3pp drop, much better!)
|
||
|
||
# Quality comparison:
|
||
# - One-shot: 10.1pp drop (unacceptable)
|
||
# - Iterative: 2.3pp drop (acceptable)
|
||
```
|
||
|
||
### Structured Layer Pruning (Transformers)
|
||
|
||
**Problem:** Transformers sensitive to unstructured pruning (destroys attention patterns).
|
||
|
||
**Solution:** Drop entire layers (structured pruning for transformers).
|
||
|
||
```python
|
||
def drop_transformer_layers(model, num_layers_to_drop=6):
|
||
"""
|
||
Drop transformer layers (structured pruning for transformers).
|
||
|
||
WHY drop layers: Transformers learn hierarchical features, later layers refine
|
||
WHY not unstructured: Attention patterns are dense, pruning destroys them
|
||
|
||
Example: BERT-base (12 layers) → BERT-small (6 layers)
|
||
- Size: 440MB → 220MB (2× reduction)
|
||
- Speed: 2× faster (half the layers)
|
||
- Accuracy: 95% → 92% (3pp drop with fine-tuning)
|
||
|
||
Args:
|
||
model: Transformer model (BERT, GPT, T5)
|
||
num_layers_to_drop: Number of layers to remove
|
||
|
||
Returns:
|
||
Smaller transformer model
|
||
"""
|
||
# Identify which layers to drop
|
||
# WHY drop middle layers: Keep early (low-level features) and late (task-specific)
|
||
# Alternative: Drop early or late layers depending on task
|
||
total_layers = len(model.encoder.layer) # BERT example
|
||
layers_to_keep = total_layers - num_layers_to_drop
|
||
|
||
# Drop middle layers (preserve early and late layers)
|
||
start_idx = num_layers_to_drop // 2
|
||
end_idx = start_idx + layers_to_keep
|
||
|
||
new_layers = model.encoder.layer[start_idx:end_idx]
|
||
model.encoder.layer = nn.ModuleList(new_layers)
|
||
|
||
# Update config
|
||
model.config.num_hidden_layers = layers_to_keep
|
||
|
||
print(f"Dropped {num_layers_to_drop} layers ({total_layers} → {layers_to_keep})")
|
||
|
||
return model
|
||
|
||
# Example: Compress BERT-base
|
||
from transformers import BertForSequenceClassification
|
||
|
||
model = BertForSequenceClassification.from_pretrained('bert-base-uncased')
|
||
print(f"Original: {model.config.num_hidden_layers} layers, {get_model_size(model):.0f}MB")
|
||
# Original: 12 layers, 440MB
|
||
|
||
# Drop 6 layers (50% reduction)
|
||
model_compressed = drop_transformer_layers(model, num_layers_to_drop=6)
|
||
print(f"Compressed: {model_compressed.config.num_hidden_layers} layers, {get_model_size(model_compressed):.0f}MB")
|
||
# Compressed: 6 layers, 220MB
|
||
|
||
# Accuracy before fine-tuning
|
||
original_acc = evaluate(model) # 95.2%
|
||
compressed_acc = evaluate(model_compressed) # 88.5% (6.7pp drop)
|
||
|
||
# Fine-tune to recover accuracy
|
||
# WHY fine-tune: Remaining layers adapt to missing layers
|
||
fine_tune(model_compressed, epochs=3, lr=2e-5)
|
||
compressed_acc_recovered = evaluate(model_compressed) # 92.1% (3.1pp drop, acceptable)
|
||
```
|
||
|
||
|
||
## Part 2: Knowledge Distillation
|
||
|
||
### Progressive Distillation (Quality Preservation)
|
||
|
||
**Problem:** Single-stage aggressive distillation fails (77× compression → unusable quality).
|
||
|
||
**Solution:** Progressive distillation in multiple stages (2-4× per stage).
|
||
|
||
```python
|
||
def progressive_distillation(
|
||
teacher,
|
||
num_stages=2,
|
||
compression_per_stage=2.5,
|
||
distill_epochs=10,
|
||
finetune_epochs=3
|
||
):
|
||
"""
|
||
Progressive knowledge distillation (quality preservation for aggressive compression).
|
||
|
||
WHY progressive: Large capacity gap (teacher → tiny student) loses too much knowledge
|
||
WHY multi-stage: Smooth transition preserves quality (teacher → intermediate → final)
|
||
|
||
Example: 6× compression
|
||
- Single-stage: 774M → 130M (6× in one step) → 15pp accuracy drop (bad)
|
||
- Progressive: 774M → 310M → 130M (2.5× per stage) → 5pp accuracy drop (better)
|
||
|
||
Args:
|
||
teacher: Large pre-trained model (e.g., BERT-large, GPT-2 Large)
|
||
num_stages: Number of distillation stages (2-3 typical)
|
||
compression_per_stage: Compression ratio per stage (2-4× safe)
|
||
distill_epochs: Distillation training epochs per stage
|
||
finetune_epochs: Fine-tuning epochs on hard labels
|
||
|
||
Returns:
|
||
Final compressed student model
|
||
"""
|
||
current_teacher = teacher
|
||
teacher_params = count_parameters(teacher)
|
||
|
||
print(f"Teacher: {teacher_params:,} params")
|
||
|
||
for stage in range(num_stages):
|
||
print(f"\n=== Stage {stage+1}/{num_stages} ===")
|
||
|
||
# Calculate student capacity for this stage
|
||
# WHY: Reduce by compression_per_stage factor
|
||
student_params = teacher_params // (compression_per_stage ** (stage + 1))
|
||
|
||
# Create student architecture (model-specific)
|
||
# WHY smaller: Fewer layers, smaller hidden dimension, fewer heads
|
||
student = create_student_model(
|
||
teacher_architecture=current_teacher,
|
||
target_params=student_params
|
||
)
|
||
|
||
print(f"Student {stage+1}: {count_parameters(student):,} params")
|
||
|
||
# Stage 1: Distillation training (learn from teacher)
|
||
# WHY soft targets: Teacher's probability distribution (richer than hard labels)
|
||
student = train_distillation(
|
||
teacher=current_teacher,
|
||
student=student,
|
||
train_loader=train_loader,
|
||
epochs=distill_epochs,
|
||
temperature=2.0, # WHY 2.0: Softer probabilities (more knowledge transfer)
|
||
alpha=0.7 # WHY 0.7: Weight distillation loss higher than hard loss
|
||
)
|
||
|
||
# Stage 2: Fine-tuning on hard labels (task optimization)
|
||
# WHY: Optimize student for actual task performance (not just mimicking teacher)
|
||
student = fine_tune_on_labels(
|
||
student=student,
|
||
train_loader=train_loader,
|
||
epochs=finetune_epochs,
|
||
lr=2e-5
|
||
)
|
||
|
||
# Evaluate this stage
|
||
teacher_acc = evaluate(current_teacher)
|
||
student_acc = evaluate(student)
|
||
print(f"Teacher accuracy: {teacher_acc:.2f}%")
|
||
print(f"Student accuracy: {student_acc:.2f}% (drop: {teacher_acc - student_acc:.2f}pp)")
|
||
|
||
# Student becomes teacher for next stage
|
||
current_teacher = student
|
||
|
||
return student
|
||
|
||
def create_student_model(teacher_architecture, target_params):
|
||
"""
|
||
Create student model with target parameter count.
|
||
|
||
WHY: Match architecture type but scale down capacity
|
||
"""
|
||
# Example for BERT
|
||
if isinstance(teacher_architecture, BertForSequenceClassification):
|
||
# Scale down: fewer layers, smaller hidden size, fewer heads
|
||
# WHY: Preserve architecture but reduce capacity
|
||
teacher_config = teacher_architecture.config
|
||
|
||
# Calculate scaling factor
|
||
scaling_factor = (target_params / count_parameters(teacher_architecture)) ** 0.5
|
||
|
||
student_config = BertConfig(
|
||
num_hidden_layers=int(teacher_config.num_hidden_layers * scaling_factor),
|
||
hidden_size=int(teacher_config.hidden_size * scaling_factor),
|
||
num_attention_heads=max(1, int(teacher_config.num_attention_heads * scaling_factor)),
|
||
intermediate_size=int(teacher_config.intermediate_size * scaling_factor),
|
||
num_labels=teacher_config.num_labels
|
||
)
|
||
|
||
return BertForSequenceClassification(student_config)
|
||
|
||
# Add other architectures as needed
|
||
raise ValueError(f"Unsupported architecture: {type(teacher_architecture)}")
|
||
|
||
def train_distillation(teacher, student, train_loader, epochs, temperature, alpha):
|
||
"""
|
||
Train student to mimic teacher (knowledge distillation).
|
||
|
||
WHY distillation loss: Student learns soft targets (probability distributions)
|
||
WHY temperature: Softens probabilities (exposes dark knowledge)
|
||
"""
|
||
import torch.nn.functional as F
|
||
|
||
teacher.eval() # WHY: Teacher is frozen (pre-trained knowledge)
|
||
student.train()
|
||
|
||
optimizer = torch.optim.AdamW(student.parameters(), lr=2e-5)
|
||
|
||
for epoch in range(epochs):
|
||
total_loss = 0
|
||
|
||
for batch, labels in train_loader:
|
||
# Teacher predictions (soft targets)
|
||
with torch.no_grad():
|
||
teacher_logits = teacher(batch).logits
|
||
|
||
# Student predictions
|
||
student_logits = student(batch).logits
|
||
|
||
# Distillation loss (KL divergence with temperature scaling)
|
||
# WHY temperature: Softens probabilities, exposes similarities between classes
|
||
soft_targets = F.softmax(teacher_logits / temperature, dim=-1)
|
||
soft_predictions = F.log_softmax(student_logits / temperature, dim=-1)
|
||
|
||
distillation_loss = F.kl_div(
|
||
soft_predictions,
|
||
soft_targets,
|
||
reduction='batchmean'
|
||
) * (temperature ** 2) # WHY T^2: Scale loss appropriately
|
||
|
||
# Hard label loss (cross-entropy with ground truth)
|
||
hard_loss = F.cross_entropy(student_logits, labels)
|
||
|
||
# Combined loss
|
||
# WHY alpha: Balance distillation (learn from teacher) and hard loss (task performance)
|
||
loss = alpha * distillation_loss + (1 - alpha) * hard_loss
|
||
|
||
optimizer.zero_grad()
|
||
loss.backward()
|
||
optimizer.step()
|
||
|
||
total_loss += loss.item()
|
||
|
||
print(f"Epoch {epoch+1}/{epochs}, Loss: {total_loss/len(train_loader):.4f}")
|
||
|
||
return student
|
||
|
||
# Example usage: Compress GPT-2 Large (774M) to GPT-2 Small (124M)
|
||
from transformers import GPT2LMHeadModel
|
||
|
||
teacher = GPT2LMHeadModel.from_pretrained('gpt2-large') # 774M params
|
||
|
||
# Progressive distillation (2 stages, 2.5× per stage = 6.25× total)
|
||
student_final = progressive_distillation(
|
||
teacher=teacher,
|
||
num_stages=2,
|
||
compression_per_stage=2.5, # 2.5× per stage
|
||
distill_epochs=10,
|
||
finetune_epochs=3
|
||
)
|
||
|
||
# Results:
|
||
# - Teacher (GPT-2 Large): 774M params, perplexity 18.5
|
||
# - Student 1 (intermediate): 310M params, perplexity 22.1 (3.6pp drop)
|
||
# - Student 2 (final): 124M params, perplexity 28.5 (10pp drop)
|
||
# - Single-stage (direct 774M → 124M): perplexity 45.2 (26.7pp drop, much worse!)
|
||
```
|
||
|
||
### Capacity Matching Guidelines
|
||
|
||
**Problem:** Student too small → can't learn teacher knowledge. Student too large → inefficient.
|
||
|
||
**Solution:** Match student capacity to compression target and quality tolerance.
|
||
|
||
```python
|
||
def calculate_optimal_student_capacity(
|
||
teacher_params,
|
||
target_compression,
|
||
quality_tolerance,
|
||
architecture_type
|
||
):
|
||
"""
|
||
Calculate optimal student model capacity.
|
||
|
||
Compression guidelines:
|
||
- 2-4× compression: Minimal quality loss (1-3pp)
|
||
- 4-8× compression: Acceptable quality loss (3-7pp) with fine-tuning
|
||
- 8-15× compression: Significant quality loss (7-15pp), risky
|
||
- >15× compression: Usually fails, student lacks capacity
|
||
|
||
Progressive distillation for >4× compression:
|
||
- Stage 1: Teacher → Student 1 (2-4× compression)
|
||
- Stage 2: Student 1 → Student 2 (2-4× compression)
|
||
- Total: 4-16× compression with quality preservation
|
||
|
||
Args:
|
||
teacher_params: Number of parameters in teacher model
|
||
target_compression: Desired compression ratio (e.g., 6.0 for 6× smaller)
|
||
quality_tolerance: Acceptable accuracy drop (pp)
|
||
architecture_type: "transformer", "cnn", "rnn"
|
||
|
||
Returns:
|
||
(student_params, num_stages, compression_per_stage)
|
||
"""
|
||
# Compression difficulty by architecture
|
||
# WHY: Different architectures have different distillation friendliness
|
||
difficulty_factor = {
|
||
"transformer": 1.0, # Distills well (attention patterns transferable)
|
||
"cnn": 0.8, # Distills very well (spatial features transferable)
|
||
"rnn": 1.2 # Distills poorly (sequential dependencies fragile)
|
||
}[architecture_type]
|
||
|
||
# Adjust target compression by difficulty
|
||
effective_compression = target_compression * difficulty_factor
|
||
|
||
# Determine number of stages
|
||
if effective_compression <= 4:
|
||
# Single-stage distillation sufficient
|
||
num_stages = 1
|
||
compression_per_stage = effective_compression
|
||
elif effective_compression <= 16:
|
||
# Two-stage distillation
|
||
num_stages = 2
|
||
compression_per_stage = effective_compression ** 0.5
|
||
else:
|
||
# Three-stage distillation (or warn that compression is too aggressive)
|
||
num_stages = 3
|
||
compression_per_stage = effective_compression ** (1/3)
|
||
|
||
if quality_tolerance < 0.15: # <15pp drop
|
||
print(f"WARNING: {target_compression}× compression may exceed quality tolerance")
|
||
print(f"Consider: Target compression {target_compression/2:.1f}× instead")
|
||
|
||
# Calculate final student capacity
|
||
student_params = teacher_params / target_compression
|
||
|
||
return student_params, num_stages, compression_per_stage
|
||
|
||
# Example usage
|
||
teacher_params = 774_000_000 # GPT-2 Large
|
||
|
||
# Conservative compression (2× - safe)
|
||
student_params, stages, per_stage = calculate_optimal_student_capacity(
|
||
teacher_params=teacher_params,
|
||
target_compression=2.0,
|
||
quality_tolerance=0.03, # Accept 3pp drop
|
||
architecture_type="transformer"
|
||
)
|
||
print(f"2× compression: {student_params:,} params, {stages} stage(s), {per_stage:.1f}× per stage")
|
||
# Output: 387M params, 1 stage, 2.0× per stage
|
||
|
||
# Moderate compression (6× - requires planning)
|
||
student_params, stages, per_stage = calculate_optimal_student_capacity(
|
||
teacher_params=teacher_params,
|
||
target_compression=6.0,
|
||
quality_tolerance=0.10, # Accept 10pp drop
|
||
architecture_type="transformer"
|
||
)
|
||
print(f"6× compression: {student_params:,} params, {stages} stage(s), {per_stage:.1f}× per stage")
|
||
# Output: 129M params, 2 stages, 2.4× per stage
|
||
|
||
# Aggressive compression (15× - risky)
|
||
student_params, stages, per_stage = calculate_optimal_student_capacity(
|
||
teacher_params=teacher_params,
|
||
target_compression=15.0,
|
||
quality_tolerance=0.20, # Accept 20pp drop
|
||
architecture_type="transformer"
|
||
)
|
||
print(f"15× compression: {student_params:,} params, {stages} stage(s), {per_stage:.1f}× per stage")
|
||
# Output: 52M params, 3 stages, 2.5× per stage
|
||
# WARNING: High quality loss expected
|
||
```
|
||
|
||
|
||
## Part 3: Low-Rank Decomposition
|
||
|
||
### Singular Value Decomposition (SVD) for Linear Layers
|
||
|
||
**Problem:** Large weight matrices (e.g., 4096×4096 in transformers) consume memory.
|
||
|
||
**Solution:** Decompose into two smaller matrices (low-rank factorization).
|
||
|
||
```python
|
||
import torch
|
||
import torch.nn as nn
|
||
|
||
def decompose_linear_layer_svd(layer, rank_ratio=0.5):
|
||
"""
|
||
Decompose linear layer using SVD (low-rank approximation).
|
||
|
||
WHY: Large matrix W (m×n) → two smaller matrices U (m×r) and V (r×n)
|
||
WHY works: Weight matrices often have low effective rank (redundancy)
|
||
|
||
Example: Linear(4096, 4096) with 50% rank
|
||
- Original: 16.8M parameters (4096×4096)
|
||
- Decomposed: 4.1M parameters (4096×2048 + 2048×4096) - 4× reduction!
|
||
|
||
Args:
|
||
layer: nn.Linear layer to decompose
|
||
rank_ratio: Fraction of original rank to keep (0.5 = keep 50%)
|
||
|
||
Returns:
|
||
Sequential module with two linear layers (equivalent to original)
|
||
"""
|
||
# Get weight matrix
|
||
W = layer.weight.data # Shape: (out_features, in_features)
|
||
bias = layer.bias.data if layer.bias is not None else None
|
||
|
||
# Perform SVD: W = U @ S @ V^T
|
||
# WHY SVD: Optimal low-rank approximation (minimizes reconstruction error)
|
||
U, S, Vt = torch.linalg.svd(W, full_matrices=False)
|
||
|
||
# Determine rank to keep
|
||
original_rank = min(W.shape)
|
||
target_rank = int(original_rank * rank_ratio)
|
||
|
||
print(f"Original rank: {original_rank}, Target rank: {target_rank}")
|
||
|
||
# Truncate to target rank
|
||
# WHY keep largest singular values: They capture most of the information
|
||
U_k = U[:, :target_rank] # Shape: (out_features, target_rank)
|
||
S_k = S[:target_rank] # Shape: (target_rank,)
|
||
Vt_k = Vt[:target_rank, :] # Shape: (target_rank, in_features)
|
||
|
||
# Create two linear layers: W ≈ U_k @ diag(S_k) @ Vt_k
|
||
# Layer 1: Linear(in_features, target_rank) with weights Vt_k
|
||
# Layer 2: Linear(target_rank, out_features) with weights U_k @ diag(S_k)
|
||
layer1 = nn.Linear(W.shape[1], target_rank, bias=False)
|
||
layer1.weight.data = Vt_k
|
||
|
||
layer2 = nn.Linear(target_rank, W.shape[0], bias=(bias is not None))
|
||
layer2.weight.data = U_k * S_k.unsqueeze(0) # Incorporate S into second layer
|
||
|
||
if bias is not None:
|
||
layer2.bias.data = bias
|
||
|
||
# Return sequential module (equivalent to original layer)
|
||
return nn.Sequential(layer1, layer2)
|
||
|
||
# Example: Decompose large transformer feedforward layer
|
||
original_layer = nn.Linear(4096, 4096)
|
||
print(f"Original params: {count_parameters(original_layer):,}") # 16,781,312
|
||
|
||
# Decompose with 50% rank retention
|
||
decomposed_layer = decompose_linear_layer_svd(original_layer, rank_ratio=0.5)
|
||
print(f"Decomposed params: {count_parameters(decomposed_layer):,}") # 4,194,304 (4× reduction!)
|
||
|
||
# Verify reconstruction quality
|
||
x = torch.randn(1, 128, 4096) # Example input
|
||
y_original = original_layer(x)
|
||
y_decomposed = decomposed_layer(x)
|
||
|
||
reconstruction_error = torch.norm(y_original - y_decomposed) / torch.norm(y_original)
|
||
print(f"Reconstruction error: {reconstruction_error.item():.4f}") # Small error (good approximation)
|
||
```
|
||
|
||
### Apply SVD to Entire Model
|
||
|
||
```python
|
||
def decompose_model_svd(model, rank_ratio=0.5, layer_threshold=1024):
|
||
"""
|
||
Apply SVD decomposition to all large linear layers in model.
|
||
|
||
WHY selective: Only decompose large layers (small layers don't benefit)
|
||
WHY threshold: Layers with <1024 input/output features too small to benefit
|
||
|
||
Args:
|
||
model: Model to compress
|
||
rank_ratio: Fraction of rank to keep (0.5 = 2× reduction per layer)
|
||
layer_threshold: Minimum layer size to decompose (skip small layers)
|
||
|
||
Returns:
|
||
Model with decomposed layers
|
||
"""
|
||
for name, module in model.named_children():
|
||
if isinstance(module, nn.Linear):
|
||
# Only decompose large layers
|
||
if module.in_features >= layer_threshold and module.out_features >= layer_threshold:
|
||
print(f"Decomposing {name}: {module.in_features}×{module.out_features}")
|
||
|
||
# Decompose layer
|
||
decomposed = decompose_linear_layer_svd(module, rank_ratio=rank_ratio)
|
||
|
||
# Replace in model
|
||
setattr(model, name, decomposed)
|
||
elif len(list(module.children())) > 0:
|
||
# Recursively decompose nested modules
|
||
decompose_model_svd(module, rank_ratio, layer_threshold)
|
||
|
||
return model
|
||
|
||
# Example: Compress transformer model
|
||
from transformers import BertModel
|
||
|
||
model = BertModel.from_pretrained('bert-base-uncased')
|
||
original_params = count_parameters(model)
|
||
print(f"Original params: {original_params:,}") # 109M
|
||
|
||
# Apply SVD (50% rank) to feedforward layers
|
||
model_compressed = decompose_model_svd(model, rank_ratio=0.5, layer_threshold=512)
|
||
compressed_params = count_parameters(model_compressed)
|
||
print(f"Compressed params: {compressed_params:,}") # 82M (1.3× reduction)
|
||
|
||
# Fine-tune to recover accuracy
|
||
# WHY: Low-rank approximation introduces small errors, fine-tuning compensates
|
||
fine_tune(model_compressed, epochs=3, lr=2e-5)
|
||
```
|
||
|
||
|
||
## Part 4: Combined Compression Pipelines
|
||
|
||
### Quantization + Pruning + Distillation
|
||
|
||
**Problem:** Single technique insufficient for aggressive compression (e.g., 20× for mobile).
|
||
|
||
**Solution:** Combine multiple techniques (multiplicative compression).
|
||
|
||
```python
|
||
def full_compression_pipeline(
|
||
teacher_model,
|
||
target_compression=20,
|
||
deployment_target="mobile"
|
||
):
|
||
"""
|
||
Combined compression pipeline for maximum compression.
|
||
|
||
WHY combine techniques: Multiplicative compression
|
||
- Quantization: 4× reduction (FP32 → INT8)
|
||
- Pruning: 2× reduction (50% structured pruning)
|
||
- Distillation: 2.5× reduction (progressive distillation)
|
||
- Total: 4 × 2 × 2.5 = 20× reduction!
|
||
|
||
Pipeline order:
|
||
1. Knowledge distillation (preserve quality first)
|
||
2. Structured pruning (remove redundancy)
|
||
3. Quantization (reduce precision last)
|
||
|
||
WHY this order:
|
||
- Distillation first: Creates smaller model with quality preservation
|
||
- Pruning second: Removes redundancy from distilled model
|
||
- Quantization last: Works well on already-compressed model
|
||
|
||
Args:
|
||
teacher_model: Large pre-trained model to compress
|
||
target_compression: Desired compression ratio (e.g., 20 for 20× smaller)
|
||
deployment_target: "mobile", "edge", "server"
|
||
|
||
Returns:
|
||
Fully compressed model ready for deployment
|
||
"""
|
||
print(f"=== Full Compression Pipeline (target: {target_compression}× reduction) ===\n")
|
||
|
||
# Original model metrics
|
||
original_size = get_model_size(teacher_model)
|
||
original_params = count_parameters(teacher_model)
|
||
original_acc = evaluate(teacher_model)
|
||
|
||
print(f"Original: {original_params:,} params, {original_size:.1f}MB, {original_acc:.2f}% acc")
|
||
|
||
# Step 1: Knowledge Distillation (2-2.5× compression)
|
||
# WHY first: Preserves quality better than pruning teacher directly
|
||
print("\n--- Step 1: Knowledge Distillation ---")
|
||
|
||
distillation_ratio = min(2.5, target_compression ** (1/3)) # Allocate ~1/3 of compression
|
||
student_model = progressive_distillation(
|
||
teacher=teacher_model,
|
||
num_stages=2,
|
||
compression_per_stage=distillation_ratio ** 0.5,
|
||
distill_epochs=10,
|
||
finetune_epochs=3
|
||
)
|
||
|
||
student_size = get_model_size(student_model)
|
||
student_params = count_parameters(student_model)
|
||
student_acc = evaluate(student_model)
|
||
|
||
print(f"After distillation: {student_params:,} params, {student_size:.1f}MB, {student_acc:.2f}% acc")
|
||
print(f"Compression: {original_size/student_size:.1f}×")
|
||
|
||
# Step 2: Structured Pruning (1.5-2× compression)
|
||
# WHY after distillation: Prune smaller model (easier to maintain quality)
|
||
print("\n--- Step 2: Structured Pruning ---")
|
||
|
||
pruning_ratio = min(0.5, 1 - 1/(target_compression ** (1/3))) # Allocate ~1/3 of compression
|
||
pruned_model = iterative_pruning(
|
||
model=student_model,
|
||
target_ratio=pruning_ratio,
|
||
num_iterations=5,
|
||
finetune_epochs=2
|
||
)
|
||
|
||
pruned_size = get_model_size(pruned_model)
|
||
pruned_params = count_parameters(pruned_model)
|
||
pruned_acc = evaluate(pruned_model)
|
||
|
||
print(f"After pruning: {pruned_params:,} params, {pruned_size:.1f}MB, {pruned_acc:.2f}% acc")
|
||
print(f"Compression: {original_size/pruned_size:.1f}×")
|
||
|
||
# Step 3: Quantization (4× compression)
|
||
# WHY last: Works well on already-compressed model, easy to apply
|
||
print("\n--- Step 3: Quantization (INT8) ---")
|
||
|
||
# Quantization-aware training
|
||
pruned_model.qconfig = torch.quantization.get_default_qat_qconfig('fbgemm')
|
||
model_prepared = torch.quantization.prepare_qat(pruned_model)
|
||
|
||
# Fine-tune with fake quantization
|
||
fine_tune(model_prepared, epochs=3, lr=1e-4)
|
||
|
||
# Convert to INT8
|
||
model_prepared.eval()
|
||
quantized_model = torch.quantization.convert(model_prepared)
|
||
|
||
quantized_size = get_model_size(quantized_model)
|
||
quantized_acc = evaluate(quantized_model)
|
||
|
||
print(f"After quantization: {quantized_size:.1f}MB, {quantized_acc:.2f}% acc")
|
||
print(f"Total compression: {original_size/quantized_size:.1f}×")
|
||
|
||
# Summary
|
||
print("\n=== Compression Pipeline Summary ===")
|
||
print(f"Original: {original_size:.1f}MB, {original_acc:.2f}% acc")
|
||
print(f"Distilled: {student_size:.1f}MB, {student_acc:.2f}% acc ({original_size/student_size:.1f}×)")
|
||
print(f"Pruned: {pruned_size:.1f}MB, {pruned_acc:.2f}% acc ({original_size/pruned_size:.1f}×)")
|
||
print(f"Quantized: {quantized_size:.1f}MB, {quantized_acc:.2f}% acc ({original_size/quantized_size:.1f}×)")
|
||
print(f"\nFinal compression: {original_size/quantized_size:.1f}× (target: {target_compression}×)")
|
||
print(f"Accuracy drop: {original_acc - quantized_acc:.2f}pp")
|
||
|
||
# Deployment checks
|
||
if deployment_target == "mobile":
|
||
assert quantized_size <= 100, f"Model too large for mobile ({quantized_size:.1f}MB > 100MB)"
|
||
assert quantized_acc >= original_acc - 5, f"Quality loss too high ({original_acc - quantized_acc:.2f}pp)"
|
||
print("\n✓ Ready for mobile deployment")
|
||
|
||
return quantized_model
|
||
|
||
# Example: Compress BERT for mobile deployment
|
||
from transformers import BertForSequenceClassification
|
||
|
||
teacher = BertForSequenceClassification.from_pretrained('bert-base-uncased')
|
||
|
||
# Target: 20× compression (440MB → 22MB)
|
||
compressed_model = full_compression_pipeline(
|
||
teacher_model=teacher,
|
||
target_compression=20,
|
||
deployment_target="mobile"
|
||
)
|
||
|
||
# Results:
|
||
# - Original: 440MB, 95.2% acc
|
||
# - Distilled: 180MB, 93.5% acc (2.4× compression)
|
||
# - Pruned: 90MB, 92.8% acc (4.9× compression)
|
||
# - Quantized: 22MB, 92.1% acc (20× compression!)
|
||
# - Accuracy drop: 3.1pp (acceptable for mobile)
|
||
```
|
||
|
||
|
||
## Part 5: Architecture Optimization
|
||
|
||
### Neural Architecture Search for Efficiency
|
||
|
||
**Problem:** Manual architecture design for compression is time-consuming.
|
||
|
||
**Solution:** Automated search for efficient architectures (NAS for compression).
|
||
|
||
```python
|
||
def efficient_architecture_search(
|
||
task_type,
|
||
target_latency_ms,
|
||
target_accuracy,
|
||
search_space="mobilenet"
|
||
):
|
||
"""
|
||
Search for efficient architecture meeting constraints.
|
||
|
||
WHY NAS: Automated discovery of architectures optimized for efficiency
|
||
WHY search space: MobileNet, EfficientNet designed for edge deployment
|
||
|
||
Search strategies:
|
||
- Width multiplier: Scale number of channels (0.5× - 1.5×)
|
||
- Depth multiplier: Scale number of layers (0.75× - 1.25×)
|
||
- Resolution multiplier: Scale input resolution (128px - 384px)
|
||
|
||
Args:
|
||
task_type: "classification", "detection", "segmentation"
|
||
target_latency_ms: Maximum inference latency (ms)
|
||
target_accuracy: Minimum acceptable accuracy
|
||
search_space: "mobilenet", "efficientnet", "custom"
|
||
|
||
Returns:
|
||
Optimal architecture configuration
|
||
"""
|
||
# Example: MobileNetV3 search space
|
||
# WHY MobileNet: Designed for mobile (depthwise separable convolutions)
|
||
from torchvision.models import mobilenet_v3_small, mobilenet_v3_large
|
||
|
||
configurations = [
|
||
{
|
||
"model": mobilenet_v3_small,
|
||
"width_multiplier": 0.5,
|
||
"expected_latency": 8, # ms on mobile CPU
|
||
"expected_accuracy": 60.5 # ImageNet top-1
|
||
},
|
||
{
|
||
"model": mobilenet_v3_small,
|
||
"width_multiplier": 1.0,
|
||
"expected_latency": 15,
|
||
"expected_accuracy": 67.4
|
||
},
|
||
{
|
||
"model": mobilenet_v3_large,
|
||
"width_multiplier": 0.75,
|
||
"expected_latency": 25,
|
||
"expected_accuracy": 73.3
|
||
},
|
||
{
|
||
"model": mobilenet_v3_large,
|
||
"width_multiplier": 1.0,
|
||
"expected_latency": 35,
|
||
"expected_accuracy": 75.2
|
||
}
|
||
]
|
||
|
||
# Find configurations meeting constraints
|
||
# WHY filter: Only consider configs within latency budget and accuracy requirement
|
||
valid_configs = [
|
||
config for config in configurations
|
||
if config["expected_latency"] <= target_latency_ms
|
||
and config["expected_accuracy"] >= target_accuracy
|
||
]
|
||
|
||
if not valid_configs:
|
||
print(f"No configuration meets constraints (latency={target_latency_ms}ms, accuracy={target_accuracy}%)")
|
||
print("Consider: Relax constraints or use custom search")
|
||
return None
|
||
|
||
# Select best (highest accuracy within latency budget)
|
||
best_config = max(valid_configs, key=lambda c: c["expected_accuracy"])
|
||
|
||
print(f"Selected: {best_config['model'].__name__} (width={best_config['width_multiplier']})")
|
||
print(f"Expected: {best_config['expected_latency']}ms, {best_config['expected_accuracy']}% acc")
|
||
|
||
return best_config
|
||
|
||
# Example usage: Find architecture for mobile deployment
|
||
config = efficient_architecture_search(
|
||
task_type="classification",
|
||
target_latency_ms=20, # 20ms latency budget
|
||
target_accuracy=70.0, # Minimum 70% accuracy
|
||
search_space="mobilenet"
|
||
)
|
||
|
||
# Output: Selected MobileNetV3-Large (width=0.75)
|
||
# Expected: 25ms latency, 73.3% accuracy
|
||
# Meets both constraints!
|
||
```
|
||
|
||
|
||
## Part 6: Quality Preservation Strategies
|
||
|
||
### Trade-off Analysis Framework
|
||
|
||
```python
|
||
def analyze_compression_tradeoffs(
|
||
model,
|
||
compression_techniques,
|
||
deployment_constraints
|
||
):
|
||
"""
|
||
Analyze compression technique trade-offs.
|
||
|
||
WHY: Different techniques have different trade-offs
|
||
- Quantization: Best size/speed, minimal quality loss (0.5-2pp)
|
||
- Pruning: Good size/speed, moderate quality loss (2-5pp)
|
||
- Distillation: Excellent quality, requires training time
|
||
|
||
Args:
|
||
model: Model to compress
|
||
compression_techniques: List of techniques to try
|
||
deployment_constraints: Dict with size_mb, latency_ms, accuracy_min
|
||
|
||
Returns:
|
||
Recommended technique and expected metrics
|
||
"""
|
||
results = []
|
||
|
||
# Quantization (FP32 → INT8)
|
||
if "quantization" in compression_techniques:
|
||
results.append({
|
||
"technique": "quantization",
|
||
"compression_ratio": 4.0,
|
||
"expected_accuracy_drop": 0.5, # 0.5-2pp with QAT
|
||
"training_time_hours": 2, # QAT training
|
||
"complexity": "low"
|
||
})
|
||
|
||
# Structured pruning (50%)
|
||
if "pruning" in compression_techniques:
|
||
results.append({
|
||
"technique": "structured_pruning_50%",
|
||
"compression_ratio": 2.0,
|
||
"expected_accuracy_drop": 2.5, # 2-5pp with iterative pruning
|
||
"training_time_hours": 8, # Iterative pruning + fine-tuning
|
||
"complexity": "medium"
|
||
})
|
||
|
||
# Knowledge distillation (2× compression)
|
||
if "distillation" in compression_techniques:
|
||
results.append({
|
||
"technique": "distillation_2x",
|
||
"compression_ratio": 2.0,
|
||
"expected_accuracy_drop": 1.5, # 1-3pp
|
||
"training_time_hours": 20, # Full distillation training
|
||
"complexity": "high"
|
||
})
|
||
|
||
# Combined (quantization + pruning)
|
||
if "combined" in compression_techniques:
|
||
results.append({
|
||
"technique": "quantization+pruning",
|
||
"compression_ratio": 8.0, # 4× × 2× = 8×
|
||
"expected_accuracy_drop": 3.5, # Additive: 0.5 + 2.5 + interaction
|
||
"training_time_hours": 12, # Pruning + QAT
|
||
"complexity": "high"
|
||
})
|
||
|
||
# Filter by constraints
|
||
original_size = get_model_size(model)
|
||
original_acc = evaluate(model)
|
||
|
||
valid_techniques = [
|
||
r for r in results
|
||
if (original_size / r["compression_ratio"]) <= deployment_constraints["size_mb"]
|
||
and (original_acc - r["expected_accuracy_drop"]) >= deployment_constraints["accuracy_min"]
|
||
]
|
||
|
||
if not valid_techniques:
|
||
print("No technique meets all constraints")
|
||
return None
|
||
|
||
# Recommend technique (prioritize: best quality, then fastest training, then simplest)
|
||
best = min(
|
||
valid_techniques,
|
||
key=lambda r: (r["expected_accuracy_drop"], r["training_time_hours"], r["complexity"])
|
||
)
|
||
|
||
print(f"Recommended: {best['technique']}")
|
||
print(f"Expected: {original_size/best['compression_ratio']:.1f}MB (from {original_size:.1f}MB)")
|
||
print(f"Accuracy: {original_acc - best['expected_accuracy_drop']:.1f}% (drop: {best['expected_accuracy_drop']}pp)")
|
||
print(f"Training time: {best['training_time_hours']} hours")
|
||
|
||
return best
|
||
|
||
# Example usage
|
||
deployment_constraints = {
|
||
"size_mb": 50, # Model must be <50MB
|
||
"latency_ms": 100, # <100ms inference
|
||
"accuracy_min": 90.0 # >90% accuracy
|
||
}
|
||
|
||
recommendation = analyze_compression_tradeoffs(
|
||
model=my_model,
|
||
compression_techniques=["quantization", "pruning", "distillation", "combined"],
|
||
deployment_constraints=deployment_constraints
|
||
)
|
||
```
|
||
|
||
|
||
## Common Mistakes to Avoid
|
||
|
||
| Mistake | Why It's Wrong | Correct Approach |
|
||
|---------|----------------|------------------|
|
||
| "Pruning works for all architectures" | Destroys transformer attention | Use distillation for transformers |
|
||
| "More compression is always better" | 77× compression produces gibberish | Progressive distillation for >4× |
|
||
| "Unstructured pruning speeds up inference" | No speedup on standard hardware | Use structured pruning (channel/layer) |
|
||
| "Quantize and deploy immediately" | 5pp accuracy drop without recovery | QAT + fine-tuning for quality preservation |
|
||
| "Single technique is enough" | Can't reach aggressive targets (20×) | Combine: quantization + pruning + distillation |
|
||
| "Skip fine-tuning to save time" | Preventable accuracy loss | Always include recovery step |
|
||
|
||
|
||
## Success Criteria
|
||
|
||
You've correctly compressed a model when:
|
||
|
||
✅ Selected appropriate technique for architecture (distillation for transformers, pruning for CNNs)
|
||
✅ Matched student capacity to compression target (2-4× per stage, progressive for >4×)
|
||
✅ Used structured pruning for standard hardware (actual speedup)
|
||
✅ Applied iterative/progressive compression (quality preservation)
|
||
✅ Included accuracy recovery (QAT, fine-tuning, calibration)
|
||
✅ Achieved target compression with acceptable quality loss (<5pp for most tasks)
|
||
✅ Verified deployment constraints (size, latency, accuracy) are met
|
||
|
||
|
||
## References
|
||
|
||
**Key papers:**
|
||
- DistilBERT (Sanh et al., 2019): Knowledge distillation for transformers
|
||
- The Lottery Ticket Hypothesis (Frankle & Carbin, 2019): Iterative magnitude pruning
|
||
- Pruning Filters for Efficient ConvNets (Li et al., 2017): Structured channel pruning
|
||
- Deep Compression (Han et al., 2016): Pruning + quantization + Huffman coding
|
||
|
||
**When to combine with other skills:**
|
||
- Use with quantization-for-inference: Quantization (4×) + compression (2-5×) = 8-20× total
|
||
- Use with hardware-optimization-strategies: Optimize compressed model for target hardware
|
||
- Use with model-serving-patterns: Deploy compressed model with batching/caching
|