# Model Compression Techniques ## When to Use This Skill Use this skill when: - Deploying models to edge devices (mobile, IoT, embedded systems) - Model too large for deployment constraints (storage, memory, bandwidth) - Inference costs too high (need smaller/faster model) - Need to balance model size, speed, and accuracy - Combining multiple compression techniques (quantization + pruning + distillation) **When NOT to use:** - Model already fits deployment constraints (compression unnecessary) - Training optimization needed (use training-optimization pack instead) - Quantization is sufficient (use quantization-for-inference instead) - LLM-specific optimization (use llm-specialist for KV cache, speculative decoding) **Relationship with quantization-for-inference:** - Quantization: Reduce precision (FP32 → INT8/INT4) - 4× size reduction - Compression: Reduce architecture (pruning, distillation) - 2-10× size reduction - Often combined: Quantization + pruning + distillation = 10-50× total reduction ## Core Principle **Compression is not one-size-fits-all. Architecture and deployment target determine technique.** Without systematic compression: - Mobile deployment: 440MB model crashes 2GB devices - Wrong technique: Pruning transformers → 33pp accuracy drop - Unstructured pruning: No speedup on standard hardware - Aggressive distillation: 77× compression produces gibberish - No recovery: 5pp preventable accuracy loss **Formula:** Architecture analysis (transformer vs CNN) + Deployment constraints (hardware, latency, size) + Technique selection (pruning vs distillation) + Quality preservation (recovery, progressive compression) = Production-ready compressed model. ## Compression Decision Framework ``` Model Compression Decision Tree 1. What is target deployment? ├─ Edge/Mobile (strict size/memory) → Aggressive compression (4-10×) ├─ Cloud/Server (cost optimization) → Moderate compression (2-4×) └─ On-premises (moderate constraints) → Balanced approach 2. What is model architecture? ├─ Transformer (BERT, GPT, T5) │ └─ Primary: Knowledge distillation (preserves attention) │ └─ Secondary: Layer dropping, quantization │ └─ AVOID: Aggressive unstructured pruning (destroys quality) │ ├─ CNN (ResNet, EfficientNet, MobileNet) │ └─ Primary: Structured channel pruning (works well) │ └─ Secondary: Quantization (INT8 standard) │ └─ Tertiary: Knowledge distillation (classification tasks) │ └─ RNN/LSTM └─ Primary: Quantization (safe, effective) └─ Secondary: Structured pruning (hidden dimension) └─ AVOID: Unstructured pruning (breaks sequential dependencies) 3. What is deployment hardware? ├─ CPU/GPU/Mobile (standard) → Structured pruning (actual speedup) └─ Specialized (A100, sparse accelerators) → Unstructured pruning possible 4. What is acceptable quality loss? ├─ <2pp → Conservative: Quantization only (4× reduction) ├─ 2-5pp → Moderate: Quantization + structured pruning (6-10× reduction) └─ >5pp → Aggressive: Full pipeline with distillation (10-50× reduction) 5. Combine techniques for maximum compression: Quantization (4×) + Pruning (2×) + Distillation (2×) = 16× total reduction ``` ## Part 1: Structured vs Unstructured Pruning ### When to Use Each **Unstructured Pruning:** - **Use when:** Sparse hardware available (NVIDIA A100, specialized accelerators) - **Benefit:** Highest compression (70-90% sparsity possible) - **Drawback:** No speedup on standard hardware (computes zeros anyway) - **Hardware support:** Rare (most deployments use standard CPU/GPU) **Structured Pruning:** - **Use when:** Standard hardware (CPU, GPU, mobile) - 99% of deployments - **Benefit:** Actual speedup (smaller dense matrices) - **Drawback:** Lower compression ratio (50-70% typical) - **Variants:** Channel pruning (CNNs), layer dropping (transformers), attention head pruning ### Structured Channel Pruning (CNNs) **Problem:** Unstructured pruning creates sparse tensors that don't accelerate on standard hardware. **Solution:** Remove entire channels to create smaller dense model (actual speedup). ```python import torch import torch.nn as nn import torch_pruning as tp def structured_channel_pruning_cnn(model, pruning_ratio=0.5, example_input=None): """ Structured channel pruning for CNNs (actual speedup on all hardware). WHY structured: Removes entire channels/filters, creating smaller dense model WHY works: Smaller dense matrices compute faster than sparse matrices on standard hardware Args: model: CNN model to prune pruning_ratio: Fraction of channels to remove (0.5 = remove 50%) example_input: Example input tensor for tracing dependencies Returns: Pruned model (smaller, faster) """ if example_input is None: example_input = torch.randn(1, 3, 224, 224) # Define importance metric (L1 norm of channels) # WHY L1 norm: Channels with small L1 norm contribute less to output importance = tp.importance.MagnitudeImportance(p=1) # Create pruner pruner = tp.pruner.MagnitudePruner( model, example_inputs=example_input, importance=importance, pruning_ratio=pruning_ratio, global_pruning=False # Prune each layer independently ) # Execute pruning (removes channels, creates smaller model) # WHY remove channels: Conv2d(64, 128) → Conv2d(32, 64) after 50% pruning pruner.step() return model # Example: Prune ResNet18 from torchvision.models import resnet18 model = resnet18(pretrained=True) print(f"Original model size: {get_model_size(model):.1f}MB") # 44.7MB print(f"Original params: {count_parameters(model):,}") # 11,689,512 # Apply 50% channel pruning model_pruned = structured_channel_pruning_cnn( model, pruning_ratio=0.5, example_input=torch.randn(1, 3, 224, 224) ) print(f"Pruned model size: {get_model_size(model_pruned):.1f}MB") # 22.4MB (50% reduction) print(f"Pruned params: {count_parameters(model_pruned):,}") # 5,844,756 (50% reduction) # Benchmark inference speed # WHY faster: Smaller dense matrices (fewer FLOPs, less memory bandwidth) original_time = benchmark_inference(model) # 25ms pruned_time = benchmark_inference(model_pruned) # 12.5ms (2× FASTER!) # Accuracy (before fine-tuning) original_acc = evaluate(model) # 69.8% pruned_acc = evaluate(model_pruned) # 64.2% (5.6pp drop - needs fine-tuning) # Fine-tune to recover accuracy fine_tune(model_pruned, epochs=5, lr=1e-4) pruned_acc_recovered = evaluate(model_pruned) # 68.5% (1.3pp drop, acceptable) ``` **Helper functions:** ```python def get_model_size(model): """Calculate model size in MB.""" # WHY: Multiply parameters by 4 bytes (FP32) param_size = sum(p.numel() for p in model.parameters()) * 4 / (1024 ** 2) return param_size def count_parameters(model): """Count trainable parameters.""" return sum(p.numel() for p in model.parameters() if p.requires_grad) def benchmark_inference(model, num_runs=100): """Benchmark inference time (ms).""" import time model.eval() example_input = torch.randn(1, 3, 224, 224) # Warmup with torch.no_grad(): for _ in range(10): model(example_input) # Benchmark start = time.time() with torch.no_grad(): for _ in range(num_runs): model(example_input) end = time.time() return (end - start) / num_runs * 1000 # Convert to ms ``` ### Iterative Pruning (Quality Preservation) **Problem:** Pruning all at once (50% in one step) → 10pp accuracy drop. **Solution:** Iterative pruning (5 steps × 10% each) → 2pp accuracy drop. ```python def iterative_pruning(model, target_ratio=0.5, num_iterations=5, finetune_epochs=2): """ Iterative pruning with fine-tuning between steps. WHY iterative: Gradual pruning allows model to adapt WHY fine-tune: Remaining weights compensate for removed weights Example: 50% pruning - One-shot: 50% pruning → 10pp accuracy drop - Iterative: 5 steps × 10% each → 2pp accuracy drop (better!) Args: model: Model to prune target_ratio: Final pruning ratio (0.5 = remove 50% of weights) num_iterations: Number of pruning steps (more = gradual = better quality) finetune_epochs: Fine-tuning epochs after each step Returns: Pruned model with quality preservation """ # Calculate pruning amount per iteration # WHY: Distribute total pruning across iterations amount_per_iteration = 1 - (1 - target_ratio) ** (1 / num_iterations) print(f"Pruning {target_ratio*100:.0f}% in {num_iterations} steps") print(f"Amount per step: {amount_per_iteration*100:.1f}%") for step in range(num_iterations): print(f"\n=== Iteration {step+1}/{num_iterations} ===") # Prune this iteration # WHY global_unstructured: Prune across all layers (balanced sparsity) parameters_to_prune = [ (module, "weight") for module in model.modules() if isinstance(module, (nn.Linear, nn.Conv2d)) ] prune.global_unstructured( parameters_to_prune, pruning_method=prune.L1Unstructured, amount=amount_per_iteration ) # Evaluate current accuracy acc_before_finetune = evaluate(model) print(f"Accuracy after pruning: {acc_before_finetune:.2f}%") # Fine-tune to recover accuracy # WHY: Allow remaining weights to compensate for removed weights fine_tune(model, epochs=finetune_epochs, lr=1e-4) acc_after_finetune = evaluate(model) print(f"Accuracy after fine-tuning: {acc_after_finetune:.2f}%") # Make pruning permanent (remove masks) for module, param_name in parameters_to_prune: prune.remove(module, param_name) return model # Example usage model = resnet18(pretrained=True) original_acc = evaluate(model) # 69.8% # One-shot pruning (worse quality) model_oneshot = copy.deepcopy(model) prune_global_unstructured(model_oneshot, amount=0.5) # Prune 50% immediately oneshot_acc = evaluate(model_oneshot) # 59.7% (10.1pp drop!) # Iterative pruning (better quality) model_iterative = copy.deepcopy(model) model_iterative = iterative_pruning( model_iterative, target_ratio=0.5, # 50% pruning num_iterations=5, # Gradual over 5 steps finetune_epochs=2 # Fine-tune after each step ) iterative_acc = evaluate(model_iterative) # 67.5% (2.3pp drop, much better!) # Quality comparison: # - One-shot: 10.1pp drop (unacceptable) # - Iterative: 2.3pp drop (acceptable) ``` ### Structured Layer Pruning (Transformers) **Problem:** Transformers sensitive to unstructured pruning (destroys attention patterns). **Solution:** Drop entire layers (structured pruning for transformers). ```python def drop_transformer_layers(model, num_layers_to_drop=6): """ Drop transformer layers (structured pruning for transformers). WHY drop layers: Transformers learn hierarchical features, later layers refine WHY not unstructured: Attention patterns are dense, pruning destroys them Example: BERT-base (12 layers) → BERT-small (6 layers) - Size: 440MB → 220MB (2× reduction) - Speed: 2× faster (half the layers) - Accuracy: 95% → 92% (3pp drop with fine-tuning) Args: model: Transformer model (BERT, GPT, T5) num_layers_to_drop: Number of layers to remove Returns: Smaller transformer model """ # Identify which layers to drop # WHY drop middle layers: Keep early (low-level features) and late (task-specific) # Alternative: Drop early or late layers depending on task total_layers = len(model.encoder.layer) # BERT example layers_to_keep = total_layers - num_layers_to_drop # Drop middle layers (preserve early and late layers) start_idx = num_layers_to_drop // 2 end_idx = start_idx + layers_to_keep new_layers = model.encoder.layer[start_idx:end_idx] model.encoder.layer = nn.ModuleList(new_layers) # Update config model.config.num_hidden_layers = layers_to_keep print(f"Dropped {num_layers_to_drop} layers ({total_layers} → {layers_to_keep})") return model # Example: Compress BERT-base from transformers import BertForSequenceClassification model = BertForSequenceClassification.from_pretrained('bert-base-uncased') print(f"Original: {model.config.num_hidden_layers} layers, {get_model_size(model):.0f}MB") # Original: 12 layers, 440MB # Drop 6 layers (50% reduction) model_compressed = drop_transformer_layers(model, num_layers_to_drop=6) print(f"Compressed: {model_compressed.config.num_hidden_layers} layers, {get_model_size(model_compressed):.0f}MB") # Compressed: 6 layers, 220MB # Accuracy before fine-tuning original_acc = evaluate(model) # 95.2% compressed_acc = evaluate(model_compressed) # 88.5% (6.7pp drop) # Fine-tune to recover accuracy # WHY fine-tune: Remaining layers adapt to missing layers fine_tune(model_compressed, epochs=3, lr=2e-5) compressed_acc_recovered = evaluate(model_compressed) # 92.1% (3.1pp drop, acceptable) ``` ## Part 2: Knowledge Distillation ### Progressive Distillation (Quality Preservation) **Problem:** Single-stage aggressive distillation fails (77× compression → unusable quality). **Solution:** Progressive distillation in multiple stages (2-4× per stage). ```python def progressive_distillation( teacher, num_stages=2, compression_per_stage=2.5, distill_epochs=10, finetune_epochs=3 ): """ Progressive knowledge distillation (quality preservation for aggressive compression). WHY progressive: Large capacity gap (teacher → tiny student) loses too much knowledge WHY multi-stage: Smooth transition preserves quality (teacher → intermediate → final) Example: 6× compression - Single-stage: 774M → 130M (6× in one step) → 15pp accuracy drop (bad) - Progressive: 774M → 310M → 130M (2.5× per stage) → 5pp accuracy drop (better) Args: teacher: Large pre-trained model (e.g., BERT-large, GPT-2 Large) num_stages: Number of distillation stages (2-3 typical) compression_per_stage: Compression ratio per stage (2-4× safe) distill_epochs: Distillation training epochs per stage finetune_epochs: Fine-tuning epochs on hard labels Returns: Final compressed student model """ current_teacher = teacher teacher_params = count_parameters(teacher) print(f"Teacher: {teacher_params:,} params") for stage in range(num_stages): print(f"\n=== Stage {stage+1}/{num_stages} ===") # Calculate student capacity for this stage # WHY: Reduce by compression_per_stage factor student_params = teacher_params // (compression_per_stage ** (stage + 1)) # Create student architecture (model-specific) # WHY smaller: Fewer layers, smaller hidden dimension, fewer heads student = create_student_model( teacher_architecture=current_teacher, target_params=student_params ) print(f"Student {stage+1}: {count_parameters(student):,} params") # Stage 1: Distillation training (learn from teacher) # WHY soft targets: Teacher's probability distribution (richer than hard labels) student = train_distillation( teacher=current_teacher, student=student, train_loader=train_loader, epochs=distill_epochs, temperature=2.0, # WHY 2.0: Softer probabilities (more knowledge transfer) alpha=0.7 # WHY 0.7: Weight distillation loss higher than hard loss ) # Stage 2: Fine-tuning on hard labels (task optimization) # WHY: Optimize student for actual task performance (not just mimicking teacher) student = fine_tune_on_labels( student=student, train_loader=train_loader, epochs=finetune_epochs, lr=2e-5 ) # Evaluate this stage teacher_acc = evaluate(current_teacher) student_acc = evaluate(student) print(f"Teacher accuracy: {teacher_acc:.2f}%") print(f"Student accuracy: {student_acc:.2f}% (drop: {teacher_acc - student_acc:.2f}pp)") # Student becomes teacher for next stage current_teacher = student return student def create_student_model(teacher_architecture, target_params): """ Create student model with target parameter count. WHY: Match architecture type but scale down capacity """ # Example for BERT if isinstance(teacher_architecture, BertForSequenceClassification): # Scale down: fewer layers, smaller hidden size, fewer heads # WHY: Preserve architecture but reduce capacity teacher_config = teacher_architecture.config # Calculate scaling factor scaling_factor = (target_params / count_parameters(teacher_architecture)) ** 0.5 student_config = BertConfig( num_hidden_layers=int(teacher_config.num_hidden_layers * scaling_factor), hidden_size=int(teacher_config.hidden_size * scaling_factor), num_attention_heads=max(1, int(teacher_config.num_attention_heads * scaling_factor)), intermediate_size=int(teacher_config.intermediate_size * scaling_factor), num_labels=teacher_config.num_labels ) return BertForSequenceClassification(student_config) # Add other architectures as needed raise ValueError(f"Unsupported architecture: {type(teacher_architecture)}") def train_distillation(teacher, student, train_loader, epochs, temperature, alpha): """ Train student to mimic teacher (knowledge distillation). WHY distillation loss: Student learns soft targets (probability distributions) WHY temperature: Softens probabilities (exposes dark knowledge) """ import torch.nn.functional as F teacher.eval() # WHY: Teacher is frozen (pre-trained knowledge) student.train() optimizer = torch.optim.AdamW(student.parameters(), lr=2e-5) for epoch in range(epochs): total_loss = 0 for batch, labels in train_loader: # Teacher predictions (soft targets) with torch.no_grad(): teacher_logits = teacher(batch).logits # Student predictions student_logits = student(batch).logits # Distillation loss (KL divergence with temperature scaling) # WHY temperature: Softens probabilities, exposes similarities between classes soft_targets = F.softmax(teacher_logits / temperature, dim=-1) soft_predictions = F.log_softmax(student_logits / temperature, dim=-1) distillation_loss = F.kl_div( soft_predictions, soft_targets, reduction='batchmean' ) * (temperature ** 2) # WHY T^2: Scale loss appropriately # Hard label loss (cross-entropy with ground truth) hard_loss = F.cross_entropy(student_logits, labels) # Combined loss # WHY alpha: Balance distillation (learn from teacher) and hard loss (task performance) loss = alpha * distillation_loss + (1 - alpha) * hard_loss optimizer.zero_grad() loss.backward() optimizer.step() total_loss += loss.item() print(f"Epoch {epoch+1}/{epochs}, Loss: {total_loss/len(train_loader):.4f}") return student # Example usage: Compress GPT-2 Large (774M) to GPT-2 Small (124M) from transformers import GPT2LMHeadModel teacher = GPT2LMHeadModel.from_pretrained('gpt2-large') # 774M params # Progressive distillation (2 stages, 2.5× per stage = 6.25× total) student_final = progressive_distillation( teacher=teacher, num_stages=2, compression_per_stage=2.5, # 2.5× per stage distill_epochs=10, finetune_epochs=3 ) # Results: # - Teacher (GPT-2 Large): 774M params, perplexity 18.5 # - Student 1 (intermediate): 310M params, perplexity 22.1 (3.6pp drop) # - Student 2 (final): 124M params, perplexity 28.5 (10pp drop) # - Single-stage (direct 774M → 124M): perplexity 45.2 (26.7pp drop, much worse!) ``` ### Capacity Matching Guidelines **Problem:** Student too small → can't learn teacher knowledge. Student too large → inefficient. **Solution:** Match student capacity to compression target and quality tolerance. ```python def calculate_optimal_student_capacity( teacher_params, target_compression, quality_tolerance, architecture_type ): """ Calculate optimal student model capacity. Compression guidelines: - 2-4× compression: Minimal quality loss (1-3pp) - 4-8× compression: Acceptable quality loss (3-7pp) with fine-tuning - 8-15× compression: Significant quality loss (7-15pp), risky - >15× compression: Usually fails, student lacks capacity Progressive distillation for >4× compression: - Stage 1: Teacher → Student 1 (2-4× compression) - Stage 2: Student 1 → Student 2 (2-4× compression) - Total: 4-16× compression with quality preservation Args: teacher_params: Number of parameters in teacher model target_compression: Desired compression ratio (e.g., 6.0 for 6× smaller) quality_tolerance: Acceptable accuracy drop (pp) architecture_type: "transformer", "cnn", "rnn" Returns: (student_params, num_stages, compression_per_stage) """ # Compression difficulty by architecture # WHY: Different architectures have different distillation friendliness difficulty_factor = { "transformer": 1.0, # Distills well (attention patterns transferable) "cnn": 0.8, # Distills very well (spatial features transferable) "rnn": 1.2 # Distills poorly (sequential dependencies fragile) }[architecture_type] # Adjust target compression by difficulty effective_compression = target_compression * difficulty_factor # Determine number of stages if effective_compression <= 4: # Single-stage distillation sufficient num_stages = 1 compression_per_stage = effective_compression elif effective_compression <= 16: # Two-stage distillation num_stages = 2 compression_per_stage = effective_compression ** 0.5 else: # Three-stage distillation (or warn that compression is too aggressive) num_stages = 3 compression_per_stage = effective_compression ** (1/3) if quality_tolerance < 0.15: # <15pp drop print(f"WARNING: {target_compression}× compression may exceed quality tolerance") print(f"Consider: Target compression {target_compression/2:.1f}× instead") # Calculate final student capacity student_params = teacher_params / target_compression return student_params, num_stages, compression_per_stage # Example usage teacher_params = 774_000_000 # GPT-2 Large # Conservative compression (2× - safe) student_params, stages, per_stage = calculate_optimal_student_capacity( teacher_params=teacher_params, target_compression=2.0, quality_tolerance=0.03, # Accept 3pp drop architecture_type="transformer" ) print(f"2× compression: {student_params:,} params, {stages} stage(s), {per_stage:.1f}× per stage") # Output: 387M params, 1 stage, 2.0× per stage # Moderate compression (6× - requires planning) student_params, stages, per_stage = calculate_optimal_student_capacity( teacher_params=teacher_params, target_compression=6.0, quality_tolerance=0.10, # Accept 10pp drop architecture_type="transformer" ) print(f"6× compression: {student_params:,} params, {stages} stage(s), {per_stage:.1f}× per stage") # Output: 129M params, 2 stages, 2.4× per stage # Aggressive compression (15× - risky) student_params, stages, per_stage = calculate_optimal_student_capacity( teacher_params=teacher_params, target_compression=15.0, quality_tolerance=0.20, # Accept 20pp drop architecture_type="transformer" ) print(f"15× compression: {student_params:,} params, {stages} stage(s), {per_stage:.1f}× per stage") # Output: 52M params, 3 stages, 2.5× per stage # WARNING: High quality loss expected ``` ## Part 3: Low-Rank Decomposition ### Singular Value Decomposition (SVD) for Linear Layers **Problem:** Large weight matrices (e.g., 4096×4096 in transformers) consume memory. **Solution:** Decompose into two smaller matrices (low-rank factorization). ```python import torch import torch.nn as nn def decompose_linear_layer_svd(layer, rank_ratio=0.5): """ Decompose linear layer using SVD (low-rank approximation). WHY: Large matrix W (m×n) → two smaller matrices U (m×r) and V (r×n) WHY works: Weight matrices often have low effective rank (redundancy) Example: Linear(4096, 4096) with 50% rank - Original: 16.8M parameters (4096×4096) - Decomposed: 4.1M parameters (4096×2048 + 2048×4096) - 4× reduction! Args: layer: nn.Linear layer to decompose rank_ratio: Fraction of original rank to keep (0.5 = keep 50%) Returns: Sequential module with two linear layers (equivalent to original) """ # Get weight matrix W = layer.weight.data # Shape: (out_features, in_features) bias = layer.bias.data if layer.bias is not None else None # Perform SVD: W = U @ S @ V^T # WHY SVD: Optimal low-rank approximation (minimizes reconstruction error) U, S, Vt = torch.linalg.svd(W, full_matrices=False) # Determine rank to keep original_rank = min(W.shape) target_rank = int(original_rank * rank_ratio) print(f"Original rank: {original_rank}, Target rank: {target_rank}") # Truncate to target rank # WHY keep largest singular values: They capture most of the information U_k = U[:, :target_rank] # Shape: (out_features, target_rank) S_k = S[:target_rank] # Shape: (target_rank,) Vt_k = Vt[:target_rank, :] # Shape: (target_rank, in_features) # Create two linear layers: W ≈ U_k @ diag(S_k) @ Vt_k # Layer 1: Linear(in_features, target_rank) with weights Vt_k # Layer 2: Linear(target_rank, out_features) with weights U_k @ diag(S_k) layer1 = nn.Linear(W.shape[1], target_rank, bias=False) layer1.weight.data = Vt_k layer2 = nn.Linear(target_rank, W.shape[0], bias=(bias is not None)) layer2.weight.data = U_k * S_k.unsqueeze(0) # Incorporate S into second layer if bias is not None: layer2.bias.data = bias # Return sequential module (equivalent to original layer) return nn.Sequential(layer1, layer2) # Example: Decompose large transformer feedforward layer original_layer = nn.Linear(4096, 4096) print(f"Original params: {count_parameters(original_layer):,}") # 16,781,312 # Decompose with 50% rank retention decomposed_layer = decompose_linear_layer_svd(original_layer, rank_ratio=0.5) print(f"Decomposed params: {count_parameters(decomposed_layer):,}") # 4,194,304 (4× reduction!) # Verify reconstruction quality x = torch.randn(1, 128, 4096) # Example input y_original = original_layer(x) y_decomposed = decomposed_layer(x) reconstruction_error = torch.norm(y_original - y_decomposed) / torch.norm(y_original) print(f"Reconstruction error: {reconstruction_error.item():.4f}") # Small error (good approximation) ``` ### Apply SVD to Entire Model ```python def decompose_model_svd(model, rank_ratio=0.5, layer_threshold=1024): """ Apply SVD decomposition to all large linear layers in model. WHY selective: Only decompose large layers (small layers don't benefit) WHY threshold: Layers with <1024 input/output features too small to benefit Args: model: Model to compress rank_ratio: Fraction of rank to keep (0.5 = 2× reduction per layer) layer_threshold: Minimum layer size to decompose (skip small layers) Returns: Model with decomposed layers """ for name, module in model.named_children(): if isinstance(module, nn.Linear): # Only decompose large layers if module.in_features >= layer_threshold and module.out_features >= layer_threshold: print(f"Decomposing {name}: {module.in_features}×{module.out_features}") # Decompose layer decomposed = decompose_linear_layer_svd(module, rank_ratio=rank_ratio) # Replace in model setattr(model, name, decomposed) elif len(list(module.children())) > 0: # Recursively decompose nested modules decompose_model_svd(module, rank_ratio, layer_threshold) return model # Example: Compress transformer model from transformers import BertModel model = BertModel.from_pretrained('bert-base-uncased') original_params = count_parameters(model) print(f"Original params: {original_params:,}") # 109M # Apply SVD (50% rank) to feedforward layers model_compressed = decompose_model_svd(model, rank_ratio=0.5, layer_threshold=512) compressed_params = count_parameters(model_compressed) print(f"Compressed params: {compressed_params:,}") # 82M (1.3× reduction) # Fine-tune to recover accuracy # WHY: Low-rank approximation introduces small errors, fine-tuning compensates fine_tune(model_compressed, epochs=3, lr=2e-5) ``` ## Part 4: Combined Compression Pipelines ### Quantization + Pruning + Distillation **Problem:** Single technique insufficient for aggressive compression (e.g., 20× for mobile). **Solution:** Combine multiple techniques (multiplicative compression). ```python def full_compression_pipeline( teacher_model, target_compression=20, deployment_target="mobile" ): """ Combined compression pipeline for maximum compression. WHY combine techniques: Multiplicative compression - Quantization: 4× reduction (FP32 → INT8) - Pruning: 2× reduction (50% structured pruning) - Distillation: 2.5× reduction (progressive distillation) - Total: 4 × 2 × 2.5 = 20× reduction! Pipeline order: 1. Knowledge distillation (preserve quality first) 2. Structured pruning (remove redundancy) 3. Quantization (reduce precision last) WHY this order: - Distillation first: Creates smaller model with quality preservation - Pruning second: Removes redundancy from distilled model - Quantization last: Works well on already-compressed model Args: teacher_model: Large pre-trained model to compress target_compression: Desired compression ratio (e.g., 20 for 20× smaller) deployment_target: "mobile", "edge", "server" Returns: Fully compressed model ready for deployment """ print(f"=== Full Compression Pipeline (target: {target_compression}× reduction) ===\n") # Original model metrics original_size = get_model_size(teacher_model) original_params = count_parameters(teacher_model) original_acc = evaluate(teacher_model) print(f"Original: {original_params:,} params, {original_size:.1f}MB, {original_acc:.2f}% acc") # Step 1: Knowledge Distillation (2-2.5× compression) # WHY first: Preserves quality better than pruning teacher directly print("\n--- Step 1: Knowledge Distillation ---") distillation_ratio = min(2.5, target_compression ** (1/3)) # Allocate ~1/3 of compression student_model = progressive_distillation( teacher=teacher_model, num_stages=2, compression_per_stage=distillation_ratio ** 0.5, distill_epochs=10, finetune_epochs=3 ) student_size = get_model_size(student_model) student_params = count_parameters(student_model) student_acc = evaluate(student_model) print(f"After distillation: {student_params:,} params, {student_size:.1f}MB, {student_acc:.2f}% acc") print(f"Compression: {original_size/student_size:.1f}×") # Step 2: Structured Pruning (1.5-2× compression) # WHY after distillation: Prune smaller model (easier to maintain quality) print("\n--- Step 2: Structured Pruning ---") pruning_ratio = min(0.5, 1 - 1/(target_compression ** (1/3))) # Allocate ~1/3 of compression pruned_model = iterative_pruning( model=student_model, target_ratio=pruning_ratio, num_iterations=5, finetune_epochs=2 ) pruned_size = get_model_size(pruned_model) pruned_params = count_parameters(pruned_model) pruned_acc = evaluate(pruned_model) print(f"After pruning: {pruned_params:,} params, {pruned_size:.1f}MB, {pruned_acc:.2f}% acc") print(f"Compression: {original_size/pruned_size:.1f}×") # Step 3: Quantization (4× compression) # WHY last: Works well on already-compressed model, easy to apply print("\n--- Step 3: Quantization (INT8) ---") # Quantization-aware training pruned_model.qconfig = torch.quantization.get_default_qat_qconfig('fbgemm') model_prepared = torch.quantization.prepare_qat(pruned_model) # Fine-tune with fake quantization fine_tune(model_prepared, epochs=3, lr=1e-4) # Convert to INT8 model_prepared.eval() quantized_model = torch.quantization.convert(model_prepared) quantized_size = get_model_size(quantized_model) quantized_acc = evaluate(quantized_model) print(f"After quantization: {quantized_size:.1f}MB, {quantized_acc:.2f}% acc") print(f"Total compression: {original_size/quantized_size:.1f}×") # Summary print("\n=== Compression Pipeline Summary ===") print(f"Original: {original_size:.1f}MB, {original_acc:.2f}% acc") print(f"Distilled: {student_size:.1f}MB, {student_acc:.2f}% acc ({original_size/student_size:.1f}×)") print(f"Pruned: {pruned_size:.1f}MB, {pruned_acc:.2f}% acc ({original_size/pruned_size:.1f}×)") print(f"Quantized: {quantized_size:.1f}MB, {quantized_acc:.2f}% acc ({original_size/quantized_size:.1f}×)") print(f"\nFinal compression: {original_size/quantized_size:.1f}× (target: {target_compression}×)") print(f"Accuracy drop: {original_acc - quantized_acc:.2f}pp") # Deployment checks if deployment_target == "mobile": assert quantized_size <= 100, f"Model too large for mobile ({quantized_size:.1f}MB > 100MB)" assert quantized_acc >= original_acc - 5, f"Quality loss too high ({original_acc - quantized_acc:.2f}pp)" print("\n✓ Ready for mobile deployment") return quantized_model # Example: Compress BERT for mobile deployment from transformers import BertForSequenceClassification teacher = BertForSequenceClassification.from_pretrained('bert-base-uncased') # Target: 20× compression (440MB → 22MB) compressed_model = full_compression_pipeline( teacher_model=teacher, target_compression=20, deployment_target="mobile" ) # Results: # - Original: 440MB, 95.2% acc # - Distilled: 180MB, 93.5% acc (2.4× compression) # - Pruned: 90MB, 92.8% acc (4.9× compression) # - Quantized: 22MB, 92.1% acc (20× compression!) # - Accuracy drop: 3.1pp (acceptable for mobile) ``` ## Part 5: Architecture Optimization ### Neural Architecture Search for Efficiency **Problem:** Manual architecture design for compression is time-consuming. **Solution:** Automated search for efficient architectures (NAS for compression). ```python def efficient_architecture_search( task_type, target_latency_ms, target_accuracy, search_space="mobilenet" ): """ Search for efficient architecture meeting constraints. WHY NAS: Automated discovery of architectures optimized for efficiency WHY search space: MobileNet, EfficientNet designed for edge deployment Search strategies: - Width multiplier: Scale number of channels (0.5× - 1.5×) - Depth multiplier: Scale number of layers (0.75× - 1.25×) - Resolution multiplier: Scale input resolution (128px - 384px) Args: task_type: "classification", "detection", "segmentation" target_latency_ms: Maximum inference latency (ms) target_accuracy: Minimum acceptable accuracy search_space: "mobilenet", "efficientnet", "custom" Returns: Optimal architecture configuration """ # Example: MobileNetV3 search space # WHY MobileNet: Designed for mobile (depthwise separable convolutions) from torchvision.models import mobilenet_v3_small, mobilenet_v3_large configurations = [ { "model": mobilenet_v3_small, "width_multiplier": 0.5, "expected_latency": 8, # ms on mobile CPU "expected_accuracy": 60.5 # ImageNet top-1 }, { "model": mobilenet_v3_small, "width_multiplier": 1.0, "expected_latency": 15, "expected_accuracy": 67.4 }, { "model": mobilenet_v3_large, "width_multiplier": 0.75, "expected_latency": 25, "expected_accuracy": 73.3 }, { "model": mobilenet_v3_large, "width_multiplier": 1.0, "expected_latency": 35, "expected_accuracy": 75.2 } ] # Find configurations meeting constraints # WHY filter: Only consider configs within latency budget and accuracy requirement valid_configs = [ config for config in configurations if config["expected_latency"] <= target_latency_ms and config["expected_accuracy"] >= target_accuracy ] if not valid_configs: print(f"No configuration meets constraints (latency={target_latency_ms}ms, accuracy={target_accuracy}%)") print("Consider: Relax constraints or use custom search") return None # Select best (highest accuracy within latency budget) best_config = max(valid_configs, key=lambda c: c["expected_accuracy"]) print(f"Selected: {best_config['model'].__name__} (width={best_config['width_multiplier']})") print(f"Expected: {best_config['expected_latency']}ms, {best_config['expected_accuracy']}% acc") return best_config # Example usage: Find architecture for mobile deployment config = efficient_architecture_search( task_type="classification", target_latency_ms=20, # 20ms latency budget target_accuracy=70.0, # Minimum 70% accuracy search_space="mobilenet" ) # Output: Selected MobileNetV3-Large (width=0.75) # Expected: 25ms latency, 73.3% accuracy # Meets both constraints! ``` ## Part 6: Quality Preservation Strategies ### Trade-off Analysis Framework ```python def analyze_compression_tradeoffs( model, compression_techniques, deployment_constraints ): """ Analyze compression technique trade-offs. WHY: Different techniques have different trade-offs - Quantization: Best size/speed, minimal quality loss (0.5-2pp) - Pruning: Good size/speed, moderate quality loss (2-5pp) - Distillation: Excellent quality, requires training time Args: model: Model to compress compression_techniques: List of techniques to try deployment_constraints: Dict with size_mb, latency_ms, accuracy_min Returns: Recommended technique and expected metrics """ results = [] # Quantization (FP32 → INT8) if "quantization" in compression_techniques: results.append({ "technique": "quantization", "compression_ratio": 4.0, "expected_accuracy_drop": 0.5, # 0.5-2pp with QAT "training_time_hours": 2, # QAT training "complexity": "low" }) # Structured pruning (50%) if "pruning" in compression_techniques: results.append({ "technique": "structured_pruning_50%", "compression_ratio": 2.0, "expected_accuracy_drop": 2.5, # 2-5pp with iterative pruning "training_time_hours": 8, # Iterative pruning + fine-tuning "complexity": "medium" }) # Knowledge distillation (2× compression) if "distillation" in compression_techniques: results.append({ "technique": "distillation_2x", "compression_ratio": 2.0, "expected_accuracy_drop": 1.5, # 1-3pp "training_time_hours": 20, # Full distillation training "complexity": "high" }) # Combined (quantization + pruning) if "combined" in compression_techniques: results.append({ "technique": "quantization+pruning", "compression_ratio": 8.0, # 4× × 2× = 8× "expected_accuracy_drop": 3.5, # Additive: 0.5 + 2.5 + interaction "training_time_hours": 12, # Pruning + QAT "complexity": "high" }) # Filter by constraints original_size = get_model_size(model) original_acc = evaluate(model) valid_techniques = [ r for r in results if (original_size / r["compression_ratio"]) <= deployment_constraints["size_mb"] and (original_acc - r["expected_accuracy_drop"]) >= deployment_constraints["accuracy_min"] ] if not valid_techniques: print("No technique meets all constraints") return None # Recommend technique (prioritize: best quality, then fastest training, then simplest) best = min( valid_techniques, key=lambda r: (r["expected_accuracy_drop"], r["training_time_hours"], r["complexity"]) ) print(f"Recommended: {best['technique']}") print(f"Expected: {original_size/best['compression_ratio']:.1f}MB (from {original_size:.1f}MB)") print(f"Accuracy: {original_acc - best['expected_accuracy_drop']:.1f}% (drop: {best['expected_accuracy_drop']}pp)") print(f"Training time: {best['training_time_hours']} hours") return best # Example usage deployment_constraints = { "size_mb": 50, # Model must be <50MB "latency_ms": 100, # <100ms inference "accuracy_min": 90.0 # >90% accuracy } recommendation = analyze_compression_tradeoffs( model=my_model, compression_techniques=["quantization", "pruning", "distillation", "combined"], deployment_constraints=deployment_constraints ) ``` ## Common Mistakes to Avoid | Mistake | Why It's Wrong | Correct Approach | |---------|----------------|------------------| | "Pruning works for all architectures" | Destroys transformer attention | Use distillation for transformers | | "More compression is always better" | 77× compression produces gibberish | Progressive distillation for >4× | | "Unstructured pruning speeds up inference" | No speedup on standard hardware | Use structured pruning (channel/layer) | | "Quantize and deploy immediately" | 5pp accuracy drop without recovery | QAT + fine-tuning for quality preservation | | "Single technique is enough" | Can't reach aggressive targets (20×) | Combine: quantization + pruning + distillation | | "Skip fine-tuning to save time" | Preventable accuracy loss | Always include recovery step | ## Success Criteria You've correctly compressed a model when: ✅ Selected appropriate technique for architecture (distillation for transformers, pruning for CNNs) ✅ Matched student capacity to compression target (2-4× per stage, progressive for >4×) ✅ Used structured pruning for standard hardware (actual speedup) ✅ Applied iterative/progressive compression (quality preservation) ✅ Included accuracy recovery (QAT, fine-tuning, calibration) ✅ Achieved target compression with acceptable quality loss (<5pp for most tasks) ✅ Verified deployment constraints (size, latency, accuracy) are met ## References **Key papers:** - DistilBERT (Sanh et al., 2019): Knowledge distillation for transformers - The Lottery Ticket Hypothesis (Frankle & Carbin, 2019): Iterative magnitude pruning - Pruning Filters for Efficient ConvNets (Li et al., 2017): Structured channel pruning - Deep Compression (Han et al., 2016): Pruning + quantization + Huffman coding **When to combine with other skills:** - Use with quantization-for-inference: Quantization (4×) + compression (2-5×) = 8-20× total - Use with hardware-optimization-strategies: Optimize compressed model for target hardware - Use with model-serving-patterns: Deploy compressed model with batching/caching