# Hardware Optimization Strategies ## Overview This skill provides systematic methodology for optimizing ML model inference performance on specific hardware platforms. Covers GPU optimization (CUDA, TensorRT), CPU optimization (threading, SIMD), and edge optimization (ARM, quantization), with emphasis on profiling-driven optimization and hardware-appropriate technique selection. **Core Principle**: Profile first to identify bottlenecks, then apply hardware-specific optimizations. Different hardware requires different optimization strategies - GPU benefits from batch size and operator fusion, CPU from threading and SIMD, edge devices from quantization and model architecture. ## When to Use Use this skill when: - Model inference performance depends on hardware utilization (not just model architecture) - Need to optimize for specific hardware: NVIDIA GPU, Intel/AMD CPU, ARM edge devices - Model is serving bottleneck after profiling (vs data loading, preprocessing) - Want to maximize throughput or minimize latency on given hardware - Deploying to resource-constrained edge devices - User mentions: "optimize for GPU", "CPU inference slow", "edge deployment", "TensorRT", "ONNX Runtime", "batch size tuning" **Don't use for**: - Training optimization → use `training-optimization` pack - Model architecture selection → use `neural-architectures` - Model compression (pruning, distillation) → use `model-compression-techniques` - Quantization specifically → use `quantization-for-inference` - Serving infrastructure → use `model-serving-patterns` **Boundary with quantization-for-inference**: - This skill covers hardware-aware quantization deployment (INT8 on CPU vs GPU, ARM NEON) - `quantization-for-inference` covers quantization techniques (PTQ, QAT, calibration) - Use both when quantization is part of hardware optimization strategy ## Core Methodology ### Step 1: Profile to Identify Bottlenecks **ALWAYS profile before optimizing**. Don't guess where time is spent. #### PyTorch Profiler (Comprehensive) ```python import torch from torch.profiler import profile, ProfilerActivity, record_function model = load_model().cuda().eval() # Profile inference with profile( activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA], record_shapes=True, profile_memory=True, with_stack=True ) as prof: with record_function("inference"): with torch.no_grad(): output = model(input_tensor.cuda()) # Print results print(prof.key_averages().table(sort_by="cuda_time_total", row_limit=20)) print(prof.key_averages().table(sort_by="cpu_time_total", row_limit=20)) # Export for visualization prof.export_chrome_trace("trace.json") # View in chrome://tracing ``` **What to look for**: - **CPU time high**: Data preprocessing, Python overhead, CPU-bound ops - **CUDA time high**: Model compute is bottleneck, optimize model inference - **Memory**: Check for out-of-memory issues or unnecessary allocations - **Operator breakdown**: Which layers/ops are slowest? #### NVIDIA Profiling Tools ```bash # Nsight Systems - high-level timeline nsys profile -o output python inference.py # Nsight Compute - kernel-level profiling ncu --set full -o kernel_profile python inference.py # Simple nvidia-smi monitoring nvidia-smi dmon -s u -i 0 # Monitor GPU utilization ``` #### Intel VTune (CPU profiling) ```bash # Profile CPU bottlenecks vtune -collect hotspots -r vtune_results -- python inference.py # Analyze results vtune-gui vtune_results ``` #### Simple Timing ```python import time import torch def profile_pipeline(model, input_data, device='cuda'): """Profile each stage of inference pipeline""" # Warmup for _ in range(10): with torch.no_grad(): _ = model(input_data.to(device)) if device == 'cuda': torch.cuda.synchronize() # Critical for accurate GPU timing # Profile preprocessing t0 = time.time() preprocessed = preprocess(input_data) t1 = time.time() # Profile model inference preprocessed = preprocessed.to(device) if device == 'cuda': torch.cuda.synchronize() t2 = time.time() with torch.no_grad(): output = model(preprocessed) if device == 'cuda': torch.cuda.synchronize() t3 = time.time() # Profile postprocessing result = postprocess(output.cpu()) t4 = time.time() print(f"Preprocessing: {(t1-t0)*1000:.2f}ms") print(f"Model Inference: {(t3-t2)*1000:.2f}ms") print(f"Postprocessing: {(t4-t3)*1000:.2f}ms") print(f"Total: {(t4-t0)*1000:.2f}ms") return { 'preprocess': (t1-t0)*1000, 'inference': (t3-t2)*1000, 'postprocess': (t4-t3)*1000, } ``` **Critical**: Always use `torch.cuda.synchronize()` before timing GPU operations, otherwise you measure kernel launch time, not execution time. ### Step 2: Select Hardware-Appropriate Optimizations Based on profiling results and target hardware, select appropriate optimization strategies. ## GPU Optimization (NVIDIA CUDA) ### Strategy 1: TensorRT (2-5x Speedup for CNNs/Transformers) **When to use**: - NVIDIA GPU (T4, V100, A100, RTX series) - Model architecture supported (CNN, Transformer, RNN) - Inference-only workload (not training) - Want automatic optimization (fusion, precision, kernels) **Best for**: Production deployment on NVIDIA GPUs, predictable performance gains ```python import torch import torch_tensorrt # Load PyTorch model model = load_model().eval().cuda() # Compile to TensorRT trt_model = torch_tensorrt.compile( model, inputs=[torch_tensorrt.Input( min_shape=[1, 3, 224, 224], # Minimum batch size opt_shape=[8, 3, 224, 224], # Optimal batch size max_shape=[32, 3, 224, 224], # Maximum batch size dtype=torch.float16 )], enabled_precisions={torch.float16}, # Use FP16 workspace_size=1 << 30, # 1GB workspace for optimization truncate_long_and_double=True ) # Save compiled model torch.jit.save(trt_model, "model_trt.ts") # Inference (same API as PyTorch) with torch.no_grad(): output = trt_model(input_tensor.cuda()) ``` **What TensorRT does**: 1. **Operator fusion**: Combines conv + bn + relu into single kernel 2. **Precision calibration**: Automatic mixed precision (FP16/INT8) 3. **Kernel auto-tuning**: Selects fastest CUDA kernel for each op 4. **Memory optimization**: Reduces memory transfers 5. **Graph optimization**: Removes unnecessary operations **Limitations**: - Only supports NVIDIA GPUs - Some custom ops may not be supported - Compilation time (minutes for large models) - Fixed input shapes (or min/max range) **Troubleshooting**: ```python # If compilation fails, try: # 1. Enable verbose logging import logging logging.getLogger("torch_tensorrt").setLevel(logging.DEBUG) # 2. Disable unsupported layers (fallback to PyTorch) trt_model = torch_tensorrt.compile( model, inputs=[...], enabled_precisions={torch.float16}, torch_fallback=torch_tensorrt.TorchFallback() # Fallback for unsupported ops ) # 3. Check for unsupported ops torch_tensorrt.logging.set_reportable_log_level(torch_tensorrt.logging.Level.Warning) ``` ### Strategy 2: torch.compile() (PyTorch 2.0+ - Easy 1.5-2x Speedup) **When to use**: - PyTorch 2.0+ available - Want easy optimization without complexity - Model has custom operations (TensorRT may not support) - Rapid prototyping (faster than TensorRT compilation) **Best for**: Quick wins, development iteration, custom models ```python import torch model = load_model().eval().cuda() # Compile with default backend (inductor) compiled_model = torch.compile(model) # Compile with specific mode compiled_model = torch.compile( model, mode="reduce-overhead", # Options: "default", "reduce-overhead", "max-autotune" fullgraph=True, # Compile entire graph (vs subgraphs) ) # First run compiles (slow), subsequent runs are fast with torch.no_grad(): output = compiled_model(input_tensor.cuda()) ``` **Modes**: - `default`: Balanced compilation time and runtime performance - `reduce-overhead`: Minimize Python overhead (best for small models) - `max-autotune`: Maximum optimization (long compilation, best runtime) **What torch.compile() does**: 1. **Operator fusion**: Similar to TensorRT 2. **Python overhead reduction**: Removes Python interpreter overhead 3. **Memory optimization**: Reduces allocations 4. **CUDA graph generation**: For fixed-size models **Advantages over TensorRT**: - Easier to use (one line of code) - Supports custom operations - Faster compilation - No fixed input shapes **Disadvantages vs TensorRT**: - Smaller speedup (1.5-2x vs 2-5x) - Less mature (newer feature) ### Strategy 3: Mixed Precision (FP16 - Easy 2x Speedup) **When to use**: - NVIDIA GPU with Tensor Cores (V100, A100, T4, RTX) - Model doesn't require FP32 precision - Want simple optimization (minimal code change) - Memory-bound models (FP16 uses half the memory) ```python import torch from torch.cuda.amp import autocast model = load_model().eval().cuda().half() # Convert model to FP16 # Inference with autocast with torch.no_grad(): with autocast(): output = model(input_tensor.cuda().half()) ``` **Caution**: Some models lose accuracy with FP16. Test accuracy before deploying. ```python # Validate FP16 accuracy def validate_fp16_accuracy(model, test_loader, tolerance=0.01): model_fp32 = model.float() model_fp16 = model.half() diffs = [] for inputs, _ in test_loader: with torch.no_grad(): output_fp32 = model_fp32(inputs.cuda().float()) output_fp16 = model_fp16(inputs.cuda().half()) diff = (output_fp32 - output_fp16.float()).abs().mean().item() diffs.append(diff) avg_diff = sum(diffs) / len(diffs) print(f"Average FP32-FP16 difference: {avg_diff:.6f}") if avg_diff > tolerance: print(f"WARNING: FP16 accuracy loss exceeds tolerance ({tolerance})") return False return True ``` ### Strategy 4: Batch Size Tuning **When to use**: Always! Batch size is the most important parameter for GPU throughput. **Trade-off**: - **Larger batch** = Higher throughput, higher latency, more memory - **Smaller batch** = Lower latency, lower throughput, less memory #### Find Optimal Batch Size ```python def find_optimal_batch_size(model, input_shape, device='cuda', max_memory_pct=0.9): """Binary search for maximum batch size that fits in memory""" model = model.to(device).eval() # Start with batch size 1, increase until OOM batch_size = 1 max_batch = 1024 # Upper bound while batch_size < max_batch: try: torch.cuda.empty_cache() test_batch = torch.randn(batch_size, *input_shape).to(device) with torch.no_grad(): _ = model(test_batch) # Check memory usage mem_allocated = torch.cuda.memory_allocated() / torch.cuda.max_memory_allocated() if mem_allocated > max_memory_pct: print(f"Batch size {batch_size}: {mem_allocated*100:.1f}% memory (near limit)") break print(f"Batch size {batch_size}: OK ({mem_allocated*100:.1f}% memory)") batch_size *= 2 except RuntimeError as e: if "out of memory" in str(e): print(f"Batch size {batch_size}: OOM") batch_size = batch_size // 2 break else: raise e print(f"\nOptimal batch size: {batch_size}") return batch_size ``` #### Measure Latency vs Throughput ```python def benchmark_batch_sizes(model, input_shape, batch_sizes=[1, 4, 8, 16, 32, 64], device='cuda', num_runs=100): """Measure latency and throughput for different batch sizes""" model = model.to(device).eval() results = [] for batch_size in batch_sizes: try: test_batch = torch.randn(batch_size, *input_shape).to(device) # Warmup for _ in range(10): with torch.no_grad(): _ = model(test_batch) torch.cuda.synchronize() # Benchmark start = time.time() for _ in range(num_runs): with torch.no_grad(): _ = model(test_batch) torch.cuda.synchronize() elapsed = time.time() - start latency_per_batch = (elapsed / num_runs) * 1000 # ms throughput = (batch_size * num_runs) / elapsed # samples/sec latency_per_sample = latency_per_batch / batch_size # ms/sample results.append({ 'batch_size': batch_size, 'latency_per_batch_ms': latency_per_batch, 'latency_per_sample_ms': latency_per_sample, 'throughput_samples_per_sec': throughput, }) print(f"Batch {batch_size:3d}: {latency_per_batch:6.2f}ms/batch, " f"{latency_per_sample:6.2f}ms/sample, {throughput:8.1f} samples/sec") except RuntimeError as e: if "out of memory" in str(e): print(f"Batch {batch_size:3d}: OOM") break return results ``` **Decision criteria**: - **Online serving (real-time API)**: Use small batch (1-8) for low latency - **Batch serving**: Use large batch (32-128) for high throughput - **Dynamic batching**: Let serving framework accumulate requests (TorchServe, Triton) ### Strategy 5: CUDA Graphs (Fixed-Size Inputs - 20-30% Speedup) **When to use**: - Fixed input size (no dynamic shapes) - Small models with many kernel launches - Already optimized but want last 20% speedup **What CUDA graphs do**: Record sequence of CUDA operations, replay without CPU overhead ```python import torch model = load_model().eval().cuda() # Static input (fixed size) static_input = torch.randn(8, 3, 224, 224).cuda() static_output = torch.randn(8, 1000).cuda() # Warmup for _ in range(10): with torch.no_grad(): _ = model(static_input) # Capture graph graph = torch.cuda.CUDAGraph() with torch.cuda.graph(graph): with torch.no_grad(): static_output = model(static_input) # Replay graph (very fast) def inference_with_graph(input_tensor): # Copy input to static buffer static_input.copy_(input_tensor) # Replay graph graph.replay() # Copy output from static buffer return static_output.clone() # Benchmark input_tensor = torch.randn(8, 3, 224, 224).cuda() output = inference_with_graph(input_tensor) ``` **Limitations**: - Fixed input/output shapes (no dynamic batching) - No control flow (if/else) in model - Adds complexity (buffer management) ## CPU Optimization (Intel/AMD) ### Strategy 1: Threading Configuration (Critical for Multi-Core) **When to use**: Always for CPU inference on multi-core machines **Problem**: PyTorch defaults to 4-8 threads, leaving cores idle ```python import torch # Check current configuration print(f"Intra-op threads: {torch.get_num_threads()}") print(f"Inter-op threads: {torch.get_num_interop_threads()}") # Set to number of physical cores (not hyperthreads) import os num_cores = os.cpu_count() // 2 # Divide by 2 if hyperthreading enabled torch.set_num_threads(num_cores) # Intra-op parallelism (within operations) torch.set_num_interop_threads(1) # Inter-op parallelism (between operations, disable to avoid oversubscription) # Verify print(f"Set intra-op threads: {torch.get_num_threads()}") ``` **Intra-op vs Inter-op**: - **Intra-op**: Parallelizes single operation (e.g., matrix multiply uses 32 cores) - **Inter-op**: Parallelizes independent operations (e.g., run conv1 and conv2 simultaneously) **Best practice**: - **Intra-op threads** = number of physical cores (enables each op to use all cores) - **Inter-op threads** = 1 (disable to avoid oversubscription and context switching) **Warning**: If using DataLoader with workers, account for those threads: ```python num_cores = os.cpu_count() // 2 num_dataloader_workers = 4 torch.set_num_threads(num_cores - num_dataloader_workers) # Leave cores for DataLoader ``` ### Strategy 2: MKLDNN/OneDNN Backend (Intel-Optimized Operations) **When to use**: Intel CPUs (Xeon, Core i7/i9) **What it does**: Uses Intel's optimized math libraries (AVX, AVX-512) ```python import torch # Enable MKLDNN torch.backends.mkldnn.enabled = True # Check if available print(f"MKLDNN available: {torch.backends.mkldnn.is_available()}") # Inference (automatically uses MKLDNN when beneficial) model = load_model().eval() with torch.no_grad(): output = model(input_tensor) ``` **For maximum performance**: Use channels-last memory format (better cache locality) ```python model = model.eval() # Convert to channels-last format (NHWC instead of NCHW) model = model.to(memory_format=torch.channels_last) # Input also channels-last input_tensor = input_tensor.to(memory_format=torch.channels_last) with torch.no_grad(): output = model(input_tensor) ``` **Speedup**: 1.5-2x on Intel CPUs with AVX-512 ### Strategy 3: ONNX Runtime (Best CPU Performance) **When to use**: - Dedicated CPU inference deployment - Want best possible CPU performance - Model is fully supported by ONNX **Advantages**: - Optimized for CPU (MLAS, DNNL, OpenMP) - Graph optimizations (fusion, constant folding) - Quantization support (INT8) ```python import torch import onnx import onnxruntime as ort # Export PyTorch model to ONNX model = load_model().eval() dummy_input = torch.randn(1, 3, 224, 224) torch.onnx.export( model, dummy_input, "model.onnx", export_params=True, opset_version=14, input_names=['input'], output_names=['output'], dynamic_axes={'input': {0: 'batch_size'}, 'output': {0: 'batch_size'}} ) # Optimize ONNX graph import onnxruntime.transformers.optimizer as optimizer optimized_model = optimizer.optimize_model("model.onnx", model_type='bert', num_heads=8, hidden_size=512) optimized_model.save_model_to_file("model_optimized.onnx") # Create inference session with optimizations sess_options = ort.SessionOptions() sess_options.graph_optimization_level = ort.GraphOptimizationLevel.ORT_ENABLE_ALL sess_options.intra_op_num_threads = os.cpu_count() // 2 sess_options.inter_op_num_threads = 1 session = ort.InferenceSession( "model_optimized.onnx", sess_options, providers=['CPUExecutionProvider'] # Use CPU ) # Inference input_data = input_tensor.numpy() output = session.run(None, {'input': input_data})[0] ``` **Expected speedup**: 2-3x over PyTorch CPU inference ### Strategy 4: OpenVINO (Intel-Specific - Best Performance) **When to use**: Intel CPUs (Xeon, Core), want absolute best CPU performance **Advantages**: - Intel-specific optimizations (AVX, AVX-512, VNNI) - Best-in-class CPU inference performance - Integrated optimization tools ```python # Convert PyTorch to OpenVINO IR # First: Export to ONNX (as above) # Then: Use Model Optimizer # Command-line conversion !mo --input_model model.onnx --output_dir openvino_model --data_type FP16 # Python API from openvino.runtime import Core # Load model ie = Core() model = ie.read_model(model="openvino_model/model.xml") compiled_model = ie.compile_model(model=model, device_name="CPU") # Inference input_tensor = np.random.randn(1, 3, 224, 224).astype(np.float32) output = compiled_model([input_tensor])[0] ``` **Expected speedup**: 3-4x over PyTorch CPU inference on Intel CPUs ### Strategy 5: Batch Size for CPU **Different from GPU**: Smaller batches often better for CPU **Why**: - CPU has smaller cache than GPU memory - Large batches may not fit in cache → cache misses → slower - Diminishing returns from batching on CPU **Recommendation**: - Start with batch size 1-4 - Profile to find optimal - Don't assume large batches are better (unlike GPU) ```python # CPU batch size tuning def find_optimal_cpu_batch(model, input_shape, max_batch=32): model = model.eval() results = [] for batch_size in [1, 2, 4, 8, 16, 32]: if batch_size > max_batch: break test_input = torch.randn(batch_size, *input_shape) # Warmup for _ in range(10): with torch.no_grad(): _ = model(test_input) # Benchmark start = time.time() for _ in range(100): with torch.no_grad(): _ = model(test_input) elapsed = time.time() - start throughput = (batch_size * 100) / elapsed latency = (elapsed / 100) * 1000 # ms results.append({ 'batch_size': batch_size, 'throughput': throughput, 'latency_ms': latency, }) print(f"Batch {batch_size}: {throughput:.1f} samples/sec, {latency:.2f}ms latency") return results ``` ## Edge/ARM Optimization ### Strategy 1: INT8 Quantization (2-4x Speedup on ARM) **When to use**: ARM CPU deployment (Raspberry Pi, mobile, edge devices) **Why INT8 on ARM**: - ARM NEON instructions accelerate INT8 operations - 2-4x faster than FP32 on ARM CPUs - 4x smaller model size (critical for edge devices) ```python import torch from torch.quantization import quantize_dynamic, quantize_static, get_default_qconfig # Dynamic quantization (easiest, no calibration) model = load_model().eval() quantized_model = quantize_dynamic( model, {torch.nn.Linear, torch.nn.Conv2d}, # Quantize these layers dtype=torch.qint8 ) # Save quantized model torch.save(quantized_model.state_dict(), 'model_int8.pth') # Inference (same API) with torch.no_grad(): output = quantized_model(input_tensor) ``` **For better accuracy**: Use static quantization with calibration (see `quantization-for-inference` skill) ### Strategy 2: TensorFlow Lite (Best for ARM/Mobile) **When to use**: - ARM edge devices (Raspberry Pi, Coral, mobile) - Need maximum ARM performance - Can convert model to TensorFlow Lite **Advantages**: - XNNPACK backend (ARM NEON optimizations) - Highly optimized for edge devices - Delegate support (GPU, NPU on mobile) ```python import torch import tensorflow as tf # Convert PyTorch to ONNX to TensorFlow to TFLite # Step 1: PyTorch → ONNX torch.onnx.export(model, dummy_input, "model.onnx") # Step 2: ONNX → TensorFlow (use onnx-tf) from onnx_tf.backend import prepare import onnx onnx_model = onnx.load("model.onnx") tf_rep = prepare(onnx_model) tf_rep.export_graph("model_tf") # Step 3: TensorFlow → TFLite with optimizations converter = tf.lite.TFLiteConverter.from_saved_model("model_tf") converter.optimizations = [tf.lite.Optimize.DEFAULT] converter.target_spec.supported_ops = [tf.lite.OpsSet.TFLITE_BUILTINS_INT8] converter.inference_input_type = tf.int8 converter.inference_output_type = tf.int8 # Provide representative dataset for calibration def representative_dataset(): for _ in range(100): yield [np.random.randn(1, 3, 224, 224).astype(np.float32)] converter.representative_dataset = representative_dataset tflite_model = converter.convert() with open('model.tflite', 'wb') as f: f.write(tflite_model) ``` **Inference with TFLite**: ```python import tensorflow as tf # Load TFLite model interpreter = tf.lite.Interpreter( model_path="model.tflite", num_threads=4 # Use all 4 cores on Raspberry Pi ) interpreter.allocate_tensors() # Get input/output details input_details = interpreter.get_input_details() output_details = interpreter.get_output_details() # Inference input_data = np.random.randn(1, 3, 224, 224).astype(np.float32) interpreter.set_tensor(input_details[0]['index'], input_data) interpreter.invoke() output_data = interpreter.get_tensor(output_details[0]['index']) ``` **Expected speedup**: 3-5x over PyTorch on Raspberry Pi ### Strategy 3: ONNX Runtime for ARM **When to use**: ARM Linux (Raspberry Pi, Jetson Nano), simpler than TFLite ```python import onnxruntime as ort # Export to ONNX (as above) torch.onnx.export(model, dummy_input, "model.onnx") # Inference session with ARM optimizations sess_options = ort.SessionOptions() sess_options.graph_optimization_level = ort.GraphOptimizationLevel.ORT_ENABLE_ALL sess_options.intra_op_num_threads = 4 # Raspberry Pi has 4 cores sess_options.inter_op_num_threads = 1 session = ort.InferenceSession( "model.onnx", sess_options, providers=['CPUExecutionProvider'] ) # Inference output = session.run(None, {'input': input_data.numpy()})[0] ``` **Quantize ONNX for ARM**: ```python from onnxruntime.quantization import quantize_dynamic, QuantType quantize_dynamic( "model.onnx", "model_int8.onnx", weight_type=QuantType.QInt8 ) ``` ### Strategy 4: Model Architecture for Edge **When to use**: Inference too slow even after quantization **Consider smaller architectures**: - MobileNetV3-Small instead of MobileNetV2 - EfficientNet-Lite instead of EfficientNet - TinyBERT instead of BERT **Trade-off**: Accuracy vs speed. Profile to find acceptable balance. ```python # Compare architectures on edge device architectures = [ ('MobileNetV2', models.mobilenet_v2(pretrained=True)), ('MobileNetV3-Small', models.mobilenet_v3_small(pretrained=True)), ('EfficientNet-B0', models.efficientnet_b0(pretrained=True)), ] for name, model in architectures: model = model.eval() # Quantize quantized_model = quantize_dynamic(model, {torch.nn.Linear, torch.nn.Conv2d}, dtype=torch.qint8) # Benchmark input_tensor = torch.randn(1, 3, 224, 224) start = time.time() for _ in range(100): with torch.no_grad(): _ = quantized_model(input_tensor) elapsed = time.time() - start print(f"{name}: {elapsed/100*1000:.2f}ms per inference") ``` ## Hardware-Specific Decision Tree ### GPU (NVIDIA) ``` 1. Profile with PyTorch profiler or nvidia-smi ↓ 2. Is GPU utilization low (<50%)? YES → Problem: - Batch size too small → Increase batch size - CPU preprocessing bottleneck → Move preprocessing to GPU or parallelize - CPU-GPU transfers → Minimize .cuda()/.cpu() calls NO → GPU is bottleneck, optimize model ↓ 3. Apply optimizations in order: a. Increase batch size (measure latency/throughput trade-off) b. Mixed precision FP16 (easy 2x speedup if Tensor Cores available) c. torch.compile() (easy 1.5-2x speedup, PyTorch 2.0+) d. TensorRT (2-5x speedup, more effort) e. CUDA graphs (20-30% speedup for small models) ↓ 4. Measure after each optimization ↓ 5. If still not meeting requirements: - Consider quantization (INT8) → see quantization-for-inference skill - Consider model compression → see model-compression-techniques skill - Scale horizontally → add more GPU instances ``` ### CPU (Intel/AMD) ``` 1. Profile with PyTorch profiler or perf ↓ 2. Check threading configuration - torch.get_num_threads() == num physical cores? - If not, set torch.set_num_threads(num_cores) ↓ 3. Apply optimizations in order: a. Set intra-op threads to num physical cores b. Enable MKLDNN (Intel CPUs) c. Use channels-last memory format d. Try ONNX Runtime with graph optimizations e. If Intel CPU: Try OpenVINO (best performance) ↓ 4. Measure batch size trade-off (smaller may be better for CPU) ↓ 5. If still not meeting requirements: - Quantize to INT8 → 2-3x speedup on CPU - Consider model compression - Scale horizontally ``` ### Edge/ARM ``` 1. Profile on target device (Raspberry Pi, etc.) ↓ 2. Is inference >100ms per sample? YES → Model too large for device - Try smaller architecture (MobileNetV3-Small, EfficientNet-Lite) - If accuracy allows, use smaller model NO → Optimize current model ↓ 3. Apply optimizations in order: a. Quantize to INT8 (2-4x speedup on ARM, critical!) b. Set num_threads to device's CPU cores c. Convert to TensorFlow Lite with XNNPACK (best ARM performance) OR use ONNX Runtime with INT8 ↓ 4. Measure memory usage - Model fits in RAM? - If not, must use smaller model or offload to storage ↓ 5. If still not meeting requirements: - Use smaller model architecture - Consider model pruning - Hardware accelerator (Coral TPU, Jetson GPU) ``` ## Common Patterns ### Pattern 1: Latency-Critical Online Serving **Requirements**: <50ms latency, moderate throughput (100-500 req/s) **Strategy**: ```python # 1. Small batch size for low latency batch_size = 1 # or dynamic batching in serving framework # 2. Use torch.compile() or TensorRT model = torch.compile(model, mode="reduce-overhead") # 3. FP16 for speed (if accuracy allows) model = model.half() # 4. Profile to ensure <50ms # 5. If CPU: ensure threading configured correctly ``` ### Pattern 2: Throughput-Critical Batch Serving **Requirements**: High throughput (>1000 samples/sec), latency flexible (100-500ms OK) **Strategy**: ```python # 1. Large batch size for throughput batch_size = 64 # or maximum that fits in memory # 2. Use TensorRT for maximum optimization trt_model = torch_tensorrt.compile(model, inputs=[...], enabled_precisions={torch.float16}) # 3. FP16 or INT8 for speed # 4. Profile to maximize throughput # 5. Consider CUDA graphs for fixed-size batches ``` ### Pattern 3: Edge Deployment (Raspberry Pi) **Requirements**: <500ms latency, limited memory (1-2GB), ARM CPU **Strategy**: ```python # 1. Quantize to INT8 (critical for ARM) quantized_model = quantize_dynamic(model, {torch.nn.Linear, torch.nn.Conv2d}, dtype=torch.qint8) # 2. Convert to TensorFlow Lite with XNNPACK # (see TFLite section above) # 3. Set threads to device cores (4 for Raspberry Pi 4) # 4. Profile on device (not on development machine!) # 5. If too slow, use smaller architecture (MobileNetV3-Small) ``` ### Pattern 4: Multi-GPU Inference **Requirements**: Very high throughput, multiple GPUs available **Strategy**: ```python # Option 1: DataParallel (simple, less efficient) model = torch.nn.DataParallel(model) # Option 2: Pipeline parallelism (large models) # Split model across GPUs model.layer1.to('cuda:0') model.layer2.to('cuda:1') # Option 3: Model replication with load balancer (best throughput) # Run separate inference server per GPU # Use NGINX or serving framework to distribute requests ``` ## Memory vs Compute Trade-offs ### Memory-Constrained Scenarios **Symptoms**: OOM errors, model barely fits in memory **Optimizations** (trade compute for memory): 1. **Reduce precision**: FP16 (2x memory reduction) or INT8 (4x reduction) 2. **Reduce batch size**: Smaller batches use less memory 3. **Gradient checkpointing**: (Training only) Recompute activations during backward 4. **Model pruning**: Remove unnecessary parameters 5. **Offload to CPU**: Store some layers/activations on CPU, transfer to GPU when needed ```python # Example: Reduce precision model = model.half() # FP32 → FP16 (2x memory reduction) # Example: Offload to CPU class OffloadWrapper(nn.Module): def __init__(self, model): super().__init__() self.model = model self.model.layer1.to('cuda') self.model.layer2.to('cpu') # Offload to CPU self.model.layer3.to('cuda') def forward(self, x): x = self.model.layer1(x) x = self.model.layer2(x.cpu()).cuda() # Transfer CPU → GPU x = self.model.layer3(x) return x ``` ### Compute-Constrained Scenarios **Symptoms**: Low throughput, long latency, GPU/CPU underutilized **Optimizations** (trade memory for compute): 1. **Increase batch size**: Use available memory for larger batches (higher throughput) 2. **Operator fusion**: Combine operations (TensorRT, torch.compile()) 3. **Precision increase**: If accuracy suffers from FP16/INT8, use FP32 (slower but accurate) 4. **Larger model**: If accuracy requirements not met, use larger (slower) model ```python # Example: Increase batch size # Find maximum batch size that fits in memory optimal_batch_size = find_optimal_batch_size(model, input_shape) # Example: Operator fusion (TensorRT) trt_model = torch_tensorrt.compile(model, inputs=[...], enabled_precisions={torch.float16}) ``` ## Profiling Checklist Before optimizing, profile to answer: ### GPU Profiling Questions - [ ] What is GPU utilization? (nvidia-smi) - [ ] What is memory utilization? - [ ] What are the slowest operations? (PyTorch profiler) - [ ] Is there CPU-GPU transfer overhead? (.cuda()/.cpu() calls) - [ ] Is batch size optimal? (measure latency/throughput) - [ ] Are Tensor Cores being used? (FP16/INT8 operations) ### CPU Profiling Questions - [ ] What is CPU utilization? (all cores used?) - [ ] What is threading configuration? (torch.get_num_threads()) - [ ] What are the slowest operations? (PyTorch profiler) - [ ] Is MKLDNN enabled? (Intel CPUs) - [ ] Is batch size optimal? (may be smaller for CPU) ### Edge Profiling Questions - [ ] What is inference latency on target device? (not development machine!) - [ ] What is memory usage? (fits in device RAM?) - [ ] Is model quantized to INT8? (critical for ARM) - [ ] Is threading configured for device cores? - [ ] Is model architecture appropriate for device? (too large?) ## Common Pitfalls ### Pitfall 1: Optimizing Without Profiling **Mistake**: Applying optimizations blindly without measuring bottleneck **Example**: ```python # Wrong: Apply TensorRT without profiling trt_model = torch_tensorrt.compile(model, ...) # Right: Profile first with torch.profiler.profile() as prof: output = model(input) print(prof.key_averages().table()) # Then optimize based on findings ``` **Why wrong**: May optimize wrong part of pipeline (e.g., model is fast, preprocessing is slow) ### Pitfall 2: GPU Optimization for CPU Deployment **Mistake**: Using GPU-specific optimizations for CPU deployment **Example**: ```python # Wrong: TensorRT for CPU deployment trt_model = torch_tensorrt.compile(model, ...) # TensorRT requires NVIDIA GPU! # Right: Use CPU-optimized framework session = ort.InferenceSession("model.onnx", providers=['CPUExecutionProvider']) ``` ### Pitfall 3: Ignoring torch.cuda.synchronize() in GPU Timing **Mistake**: Measuring GPU time without synchronization (measures kernel launch, not execution) **Example**: ```python # Wrong: Inaccurate timing start = time.time() output = model(input.cuda()) elapsed = time.time() - start # Only measures kernel launch! # Right: Synchronize before measuring torch.cuda.synchronize() start = time.time() output = model(input.cuda()) torch.cuda.synchronize() # Wait for GPU to finish elapsed = time.time() - start # Accurate GPU execution time ``` ### Pitfall 4: Batch Size "Bigger is Better" **Mistake**: Using largest possible batch size without considering latency **Example**: ```python # Wrong: Maximum batch size without measuring latency batch_size = 256 # May violate latency SLA! # Right: Measure latency vs throughput trade-off benchmark_batch_sizes(model, input_shape, batch_sizes=[1, 4, 8, 16, 32, 64]) # Select batch size that meets latency requirement ``` **Why wrong**: Large batches increase latency (queue time + compute time), may violate SLA ### Pitfall 5: Not Validating Accuracy After Optimization **Mistake**: Deploying FP16/INT8 model without checking accuracy **Example**: ```python # Wrong: Deploy quantized model without validation quantized_model = quantize_dynamic(model, ...) # Deploy immediately # Right: Validate accuracy first validate_fp16_accuracy(model, test_loader, tolerance=0.01) if validation_passes: deploy(quantized_model) ``` ### Pitfall 6: Over-Optimizing When Requirements Already Met **Mistake**: Spending effort optimizing when already meeting requirements **Example**: ```python # Current: 20ms latency, requirement is <50ms # Wrong: Spend days optimizing to 10ms (unnecessary) # Right: Check if requirements met if current_latency < required_latency: print("Requirements met, skip optimization") ``` ### Pitfall 7: Wrong Threading Configuration (CPU) **Mistake**: Not setting intra-op threads, or oversubscribing cores **Example**: ```python # Wrong: Default threading (only uses 4-8 cores on 32-core machine) # (no torch.set_num_threads() call) # Wrong: Oversubscription torch.set_num_threads(32) # Intra-op threads torch.set_num_interop_threads(32) # Inter-op threads (total 64 threads on 32 cores!) # Right: Set intra-op to num cores, disable inter-op torch.set_num_threads(32) torch.set_num_interop_threads(1) ``` ## When NOT to Optimize **Skip hardware optimization when**: 1. **Requirements already met**: Current performance satisfies latency/throughput SLA 2. **Model is not bottleneck**: Profiling shows preprocessing or postprocessing is slow 3. **Development phase**: Still iterating on model architecture (optimize after finalizing) 4. **Accuracy degradation**: Optimization (FP16/INT8) causes unacceptable accuracy loss 5. **Rare inference**: Model runs infrequently (e.g., 1x per hour), optimization effort not justified **Red flag**: Spending days optimizing when requirements already met or infrastructure scaling is cheaper. ## Integration with Other Skills ### With quantization-for-inference - **This skill**: Hardware-aware quantization deployment (INT8 on CPU vs GPU vs ARM) - **quantization-for-inference**: Quantization techniques (PTQ, QAT, calibration) - **Use both**: When quantization is part of hardware optimization strategy ### With model-compression-techniques - **This skill**: Hardware optimization (batching, frameworks, profiling) - **model-compression-techniques**: Model size reduction (pruning, distillation) - **Use both**: When both hardware optimization and model compression needed ### With model-serving-patterns - **This skill**: Optimize model inference on hardware - **model-serving-patterns**: Serve optimized model via API/container - **Sequential**: Optimize model first (this skill), then serve (model-serving-patterns) ### With production-monitoring-and-alerting - **This skill**: Optimize for target latency/throughput - **production-monitoring-and-alerting**: Monitor actual latency/throughput in production - **Feedback loop**: Monitor performance, optimize if degraded ## Success Criteria You've succeeded when: - ✅ Profiled before optimizing (identified actual bottleneck) - ✅ Selected hardware-appropriate optimizations (GPU vs CPU vs edge) - ✅ Measured performance before/after each optimization - ✅ Met latency/throughput/memory requirements - ✅ Validated accuracy after optimization (if using FP16/INT8) - ✅ Considered cost vs benefit (optimization effort vs infrastructure scaling) - ✅ Documented optimization choices and trade-offs - ✅ Avoided premature optimization (requirements already met) ## References **Profiling**: - PyTorch Profiler: https://pytorch.org/tutorials/recipes/recipes/profiler_recipe.html - NVIDIA Nsight Systems: https://developer.nvidia.com/nsight-systems - Intel VTune: https://www.intel.com/content/www/us/en/developer/tools/oneapi/vtune-profiler.html **GPU Optimization**: - TensorRT: https://developer.nvidia.com/tensorrt - torch.compile(): https://pytorch.org/tutorials/intermediate/torch_compile_tutorial.html - CUDA Graphs: https://pytorch.org/docs/stable/notes/cuda.html#cuda-graphs **CPU Optimization**: - ONNX Runtime: https://onnxruntime.ai/docs/performance/tune-performance.html - OpenVINO: https://docs.openvino.ai/latest/index.html - MKLDNN: https://github.com/oneapi-src/oneDNN **Edge Optimization**: - TensorFlow Lite: https://www.tensorflow.org/lite/performance/best_practices - ONNX Runtime Mobile: https://onnxruntime.ai/docs/tutorials/mobile/ **Batch Size Tuning**: - Dynamic Batching: https://github.com/pytorch/serve/blob/master/docs/batch_inference_with_ts.md - TorchServe Batching: https://pytorch.org/serve/batch_inference.html