1894 lines
56 KiB
Markdown
1894 lines
56 KiB
Markdown
|
|
# Performance Profiling and Bottleneck Analysis
|
|
|
|
## Overview
|
|
|
|
**Core Principle:** Optimization without measurement is guessing. Profile systematically (whole → component → operation) using the right tools to identify actual bottlenecks before attempting fixes. 90% of runtime is usually in 10% of code - find that 10% with profiling, not intuition.
|
|
|
|
Performance issues stem from: data loading bottlenecks (CPU-bound), inefficient operations (GPU-bound), memory bandwidth limits (memory-bound), or I/O bottlenecks. Profiling reveals which category applies. Guessing leads to optimizing the wrong thing, wasting hours on marginal improvements while real bottleneck remains.
|
|
|
|
## When to Use
|
|
|
|
**Use this skill when:**
|
|
- Training or inference slower than expected
|
|
- Need to identify performance bottleneck in PyTorch code
|
|
- High GPU memory usage, need to understand what's using memory
|
|
- Evaluating whether optimization actually improved performance
|
|
- Debugging low GPU utilization issues
|
|
- Comparing performance of different implementations
|
|
- Need to profile specific operations or model components
|
|
|
|
**Don't use when:**
|
|
- Performance is already acceptable (no problem to solve)
|
|
- Architecture design questions (use module-design-patterns)
|
|
- Debugging correctness issues (use debugging-techniques)
|
|
- Memory leaks (use tensor-operations-and-memory)
|
|
|
|
**Symptoms triggering this skill:**
|
|
- "Training is slower than expected"
|
|
- "Low GPU utilization but training still slow"
|
|
- "Which part of my model is the bottleneck?"
|
|
- "Does this optimization actually help?"
|
|
- "Memory usage is high, what's using it?"
|
|
- "First iteration much slower than subsequent ones"
|
|
|
|
|
|
## Systematic Profiling Methodology
|
|
|
|
### The Four-Phase Framework
|
|
|
|
**Phase 1: Establish Baseline**
|
|
- Define metric (throughput, latency, memory)
|
|
- Measure end-to-end performance
|
|
- Set improvement target
|
|
- Document measurement conditions
|
|
|
|
**Phase 2: Identify Bottleneck Type**
|
|
- CPU-bound vs GPU-bound vs I/O-bound vs memory-bound
|
|
- Check GPU utilization (nvidia-smi)
|
|
- Profile data loading separately from computation
|
|
- Determine which component to investigate
|
|
|
|
**Phase 3: Narrow to Component**
|
|
- Profile at coarse granularity
|
|
- Identify which phase is slow (forward/backward/optimizer/data loading)
|
|
- Focus profiling on bottleneck component
|
|
- Use iterative narrowing
|
|
|
|
**Phase 4: Identify Operation**
|
|
- Profile bottleneck component in detail
|
|
- Examine both table view and trace view
|
|
- Find specific operation or pattern
|
|
- Measure improvement after fix
|
|
|
|
**Critical Rule:** ALWAYS work through phases in order. Don't jump to Phase 4 without Phases 1-3.
|
|
|
|
|
|
### Phase 1: Establish Baseline
|
|
|
|
**Step 1: Define Performance Metric**
|
|
|
|
```python
|
|
# Choose the right metric for your use case:
|
|
|
|
# Throughput (samples/second) - for training
|
|
def measure_throughput(model, dataloader, num_batches=100):
|
|
model.train()
|
|
torch.cuda.synchronize()
|
|
start = torch.cuda.Event(enable_timing=True)
|
|
end = torch.cuda.Event(enable_timing=True)
|
|
|
|
total_samples = 0
|
|
start.record()
|
|
|
|
for i, (data, target) in enumerate(dataloader):
|
|
if i >= num_batches:
|
|
break
|
|
data, target = data.cuda(), target.cuda()
|
|
output = model(data)
|
|
loss = criterion(output, target)
|
|
loss.backward()
|
|
optimizer.step()
|
|
optimizer.zero_grad()
|
|
total_samples += data.size(0)
|
|
|
|
end.record()
|
|
torch.cuda.synchronize()
|
|
elapsed_ms = start.elapsed_time(end)
|
|
|
|
throughput = total_samples / (elapsed_ms / 1000.0)
|
|
print(f"Throughput: {throughput:.2f} samples/sec")
|
|
print(f"Time per batch: {elapsed_ms / num_batches:.2f} ms")
|
|
return throughput
|
|
|
|
# Latency (time per sample) - for inference
|
|
def measure_latency(model, sample_input, num_iterations=100, warmup=10):
|
|
model.eval()
|
|
sample_input = sample_input.cuda()
|
|
|
|
# Warmup (CRITICAL - don't skip!)
|
|
with torch.no_grad():
|
|
for _ in range(warmup):
|
|
_ = model(sample_input)
|
|
|
|
# Measure
|
|
latencies = []
|
|
with torch.no_grad():
|
|
for _ in range(num_iterations):
|
|
start = torch.cuda.Event(enable_timing=True)
|
|
end = torch.cuda.Event(enable_timing=True)
|
|
|
|
start.record()
|
|
output = model(sample_input)
|
|
end.record()
|
|
|
|
torch.cuda.synchronize()
|
|
latencies.append(start.elapsed_time(end))
|
|
|
|
# Report statistics (not just average!)
|
|
import numpy as np
|
|
latencies = np.array(latencies)
|
|
print(f"Latency - Mean: {latencies.mean():.2f} ms, "
|
|
f"Std: {latencies.std():.2f} ms, "
|
|
f"Median: {np.median(latencies):.2f} ms, "
|
|
f"P95: {np.percentile(latencies, 95):.2f} ms, "
|
|
f"P99: {np.percentile(latencies, 99):.2f} ms")
|
|
return latencies
|
|
|
|
# Memory usage (peak GB)
|
|
def measure_memory(model, sample_batch):
|
|
torch.cuda.reset_peak_memory_stats()
|
|
torch.cuda.synchronize()
|
|
|
|
# Run one iteration
|
|
output = model(sample_batch)
|
|
loss = criterion(output, target)
|
|
loss.backward()
|
|
|
|
torch.cuda.synchronize()
|
|
peak_memory = torch.cuda.max_memory_allocated() / 1e9
|
|
print(f"Peak memory: {peak_memory:.2f} GB")
|
|
return peak_memory
|
|
```
|
|
|
|
**Why this matters:**
|
|
- Without baseline, can't measure improvement
|
|
- Need statistics (mean, std, percentiles), not just average
|
|
- Must use CUDA Events for GPU timing (not `time.time()`)
|
|
- Warmup critical to exclude JIT compilation overhead
|
|
|
|
|
|
**Step 2: Document Measurement Conditions**
|
|
|
|
```python
|
|
# Record all relevant configuration
|
|
profiling_config = {
|
|
'model': model.__class__.__name__,
|
|
'batch_size': 32,
|
|
'input_shape': (3, 224, 224),
|
|
'device': 'cuda:0',
|
|
'dtype': 'float16' if using_amp else 'float32',
|
|
'mode': 'train' or 'eval',
|
|
'num_workers': dataloader.num_workers,
|
|
'cudnn_benchmark': torch.backends.cudnn.benchmark,
|
|
'gpu': torch.cuda.get_device_name(0),
|
|
}
|
|
|
|
print(json.dumps(profiling_config, indent=2))
|
|
```
|
|
|
|
**Why this matters:**
|
|
- Performance changes with configuration
|
|
- Need to reproduce results
|
|
- Comparing different runs requires same conditions
|
|
- Document before optimizing, re-measure after
|
|
|
|
|
|
### Phase 2: Identify Bottleneck Type
|
|
|
|
**Step 1: Check GPU Utilization**
|
|
|
|
```bash
|
|
# In terminal, monitor GPU utilization in real-time
|
|
nvidia-smi dmon -s u -i 0 -d 1
|
|
|
|
# Or within Python
|
|
import subprocess
|
|
result = subprocess.run(['nvidia-smi', '--query-gpu=utilization.gpu,memory.used',
|
|
'--format=csv,noheader,nounits'],
|
|
capture_output=True, text=True)
|
|
gpu_util, mem_used = result.stdout.strip().split(',')
|
|
print(f"GPU Utilization: {gpu_util}%, Memory: {mem_used} MB")
|
|
```
|
|
|
|
**Interpretation:**
|
|
|
|
| GPU Utilization | Likely Bottleneck | Next Step |
|
|
|----------------|-------------------|-----------|
|
|
| < 70% | CPU-bound (data loading, preprocessing) | Profile data loading |
|
|
| > 90% | GPU-bound (computation) | Profile model operations |
|
|
| 50-80% | Mixed or memory-bound | Check memory bandwidth |
|
|
|
|
**Why this matters:** GPU utilization tells you WHERE to look. If GPU isn't saturated, optimizing GPU operations won't help.
|
|
|
|
|
|
**Step 2: Profile Data Loading vs Computation**
|
|
|
|
```python
|
|
import time
|
|
|
|
def profile_dataloader_vs_model(model, dataloader, num_batches=50):
|
|
"""Separate data loading time from model computation time"""
|
|
model.train()
|
|
|
|
data_times = []
|
|
compute_times = []
|
|
|
|
batch_iterator = iter(dataloader)
|
|
|
|
for i in range(num_batches):
|
|
# Time data loading
|
|
data_start = time.time()
|
|
data, target = next(batch_iterator)
|
|
data_end = time.time()
|
|
data_times.append(data_end - data_start)
|
|
|
|
# Time computation
|
|
data, target = data.cuda(non_blocking=True), target.cuda(non_blocking=True)
|
|
torch.cuda.synchronize()
|
|
|
|
compute_start = torch.cuda.Event(enable_timing=True)
|
|
compute_end = torch.cuda.Event(enable_timing=True)
|
|
|
|
compute_start.record()
|
|
output = model(data)
|
|
loss = criterion(output, target)
|
|
loss.backward()
|
|
optimizer.step()
|
|
optimizer.zero_grad()
|
|
compute_end.record()
|
|
|
|
torch.cuda.synchronize()
|
|
compute_times.append(compute_start.elapsed_time(compute_end))
|
|
|
|
import numpy as np
|
|
avg_data_time = np.mean(data_times) * 1000 # to ms
|
|
avg_compute_time = np.mean(compute_times)
|
|
|
|
print(f"Avg data loading time: {avg_data_time:.2f} ms")
|
|
print(f"Avg computation time: {avg_compute_time:.2f} ms")
|
|
print(f"Data loading is {avg_data_time/avg_compute_time:.1f}x "
|
|
f"{'slower' if avg_data_time > avg_compute_time else 'faster'} than compute")
|
|
|
|
if avg_data_time > avg_compute_time:
|
|
print("⚠️ BOTTLENECK: Data loading (CPU-bound)")
|
|
print(" Solutions: Increase num_workers, use pin_memory=True, "
|
|
"move preprocessing to GPU")
|
|
else:
|
|
print("✅ Data loading is fast enough. Bottleneck is in model computation.")
|
|
|
|
return avg_data_time, avg_compute_time
|
|
```
|
|
|
|
**Why this matters:**
|
|
- If data loading > computation time, GPU is starved (increase workers)
|
|
- If computation > data loading, GPU is bottleneck (optimize model)
|
|
- Common mistake: Optimizing model when data loading is the bottleneck
|
|
|
|
|
|
**Step 3: Determine Bottleneck Category**
|
|
|
|
```python
|
|
def diagnose_bottleneck_type(model, dataloader):
|
|
"""Systematic bottleneck categorization"""
|
|
|
|
# 1. Check GPU utilization
|
|
print("=== GPU Utilization Check ===")
|
|
# Run training for a bit while monitoring GPU
|
|
# If GPU util < 70% → CPU-bound
|
|
# If GPU util > 90% → GPU-bound
|
|
|
|
# 2. Check memory bandwidth
|
|
print("\n=== Memory Bandwidth Check ===")
|
|
from torch.profiler import profile, ProfilerActivity
|
|
|
|
with profile(activities=[ProfilerActivity.CUDA]) as prof:
|
|
for i, (data, target) in enumerate(dataloader):
|
|
if i >= 5:
|
|
break
|
|
data, target = data.cuda(), target.cuda()
|
|
output = model(data)
|
|
loss = criterion(output, target)
|
|
loss.backward()
|
|
|
|
# Look for high memory-bound ops
|
|
events = prof.key_averages()
|
|
for evt in events:
|
|
if evt.cuda_time_total > 0:
|
|
# If many large tensor ops with low FLOPS → memory-bound
|
|
pass
|
|
|
|
# 3. Profile phases
|
|
print("\n=== Phase Profiling ===")
|
|
times = profile_training_phases(model, next(iter(dataloader)))
|
|
|
|
# Interpret results
|
|
print("\n=== Diagnosis ===")
|
|
if times['data_loading'] > times['forward'] + times['backward']:
|
|
print("BOTTLENECK: CPU-bound (data loading)")
|
|
print("Action: Increase num_workers, enable pin_memory, cache data")
|
|
elif times['forward'] > times['backward'] * 2:
|
|
print("BOTTLENECK: GPU-bound (forward pass)")
|
|
print("Action: Profile forward pass operations")
|
|
elif times['backward'] > times['forward'] * 2:
|
|
print("BOTTLENECK: GPU-bound (backward pass)")
|
|
print("Action: Profile backward pass, check gradient checkpointing")
|
|
else:
|
|
print("BOTTLENECK: Mixed or memory-bound")
|
|
print("Action: Deep profiling needed")
|
|
```
|
|
|
|
|
|
### Phase 3: Narrow to Component
|
|
|
|
**Step 1: Coarse-Grained Profiling**
|
|
|
|
```python
|
|
from torch.profiler import profile, ProfilerActivity, schedule
|
|
|
|
def profile_training_step(model, dataloader, num_steps=10):
|
|
"""Profile one training step to identify bottleneck phase"""
|
|
|
|
# Use schedule to reduce profiling overhead
|
|
with profile(
|
|
activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA],
|
|
schedule=schedule(wait=1, warmup=2, active=5, repeat=1),
|
|
on_trace_ready=torch.profiler.tensorboard_trace_handler('./profiler_logs'),
|
|
record_shapes=True,
|
|
profile_memory=True,
|
|
with_stack=True
|
|
) as prof:
|
|
|
|
for step, (data, target) in enumerate(dataloader):
|
|
if step >= num_steps:
|
|
break
|
|
|
|
data, target = data.cuda(), target.cuda()
|
|
|
|
optimizer.zero_grad()
|
|
|
|
# Forward
|
|
output = model(data)
|
|
loss = criterion(output, target)
|
|
|
|
# Backward
|
|
loss.backward()
|
|
|
|
# Optimizer
|
|
optimizer.step()
|
|
|
|
prof.step() # Notify profiler of step boundary
|
|
|
|
# Print summary
|
|
print(prof.key_averages().table(
|
|
sort_by="cuda_time_total",
|
|
row_limit=20,
|
|
max_src_column_width=80
|
|
))
|
|
|
|
# Export trace for visualization
|
|
print("\n✅ Trace exported to ./profiler_logs")
|
|
print(" View in Chrome: chrome://tracing (load trace.json)")
|
|
print(" Or TensorBoard: tensorboard --logdir=./profiler_logs")
|
|
|
|
return prof
|
|
```
|
|
|
|
**Understanding the schedule:**
|
|
- `wait=1`: Skip first iteration (cold start)
|
|
- `warmup=2`: Next 2 iterations for warmup (no profiling overhead)
|
|
- `active=5`: Profile these 5 iterations
|
|
- `repeat=1`: Do this cycle once
|
|
|
|
**Why this matters:**
|
|
- Profiling has overhead - don't profile every iteration
|
|
- Schedule controls when profiling is active
|
|
- Warmup prevents including JIT compilation in measurements
|
|
|
|
|
|
**Step 2: Phase-Level Timing**
|
|
|
|
```python
|
|
from torch.profiler import record_function
|
|
|
|
def profile_training_phases(model, batch, target):
|
|
"""Time each phase of training separately"""
|
|
|
|
data, target = batch.cuda(), target.cuda()
|
|
optimizer.zero_grad()
|
|
|
|
torch.cuda.synchronize()
|
|
|
|
# Profile each phase
|
|
phases = {}
|
|
|
|
# Forward pass
|
|
start = torch.cuda.Event(enable_timing=True)
|
|
end = torch.cuda.Event(enable_timing=True)
|
|
|
|
with record_function("forward_pass"):
|
|
start.record()
|
|
output = model(data)
|
|
loss = criterion(output, target)
|
|
end.record()
|
|
torch.cuda.synchronize()
|
|
phases['forward'] = start.elapsed_time(end)
|
|
|
|
# Backward pass
|
|
start = torch.cuda.Event(enable_timing=True)
|
|
end = torch.cuda.Event(enable_timing=True)
|
|
|
|
with record_function("backward_pass"):
|
|
start.record()
|
|
loss.backward()
|
|
end.record()
|
|
torch.cuda.synchronize()
|
|
phases['backward'] = start.elapsed_time(end)
|
|
|
|
# Optimizer step
|
|
start = torch.cuda.Event(enable_timing=True)
|
|
end = torch.cuda.Event(enable_timing=True)
|
|
|
|
with record_function("optimizer_step"):
|
|
start.record()
|
|
optimizer.step()
|
|
end.record()
|
|
torch.cuda.synchronize()
|
|
phases['optimizer'] = start.elapsed_time(end)
|
|
|
|
# Print breakdown
|
|
total = sum(phases.values())
|
|
print("Phase Breakdown:")
|
|
for phase, time_ms in phases.items():
|
|
print(f" {phase:15s}: {time_ms:7.2f} ms ({time_ms/total*100:5.1f}%)")
|
|
|
|
return phases
|
|
```
|
|
|
|
**Why this matters:**
|
|
- Identifies which phase is slowest
|
|
- Focuses subsequent profiling on bottleneck phase
|
|
- Uses `record_function` to add custom markers in trace view
|
|
|
|
|
|
**Step 3: Module-Level Profiling**
|
|
|
|
```python
|
|
def profile_model_modules(model, sample_input):
|
|
"""Profile time spent in each model module"""
|
|
|
|
model.eval()
|
|
sample_input = sample_input.cuda()
|
|
|
|
# Add hooks to time each module
|
|
module_times = {}
|
|
|
|
def make_hook(name):
|
|
def hook(module, input, output):
|
|
if name not in module_times:
|
|
module_times[name] = []
|
|
|
|
torch.cuda.synchronize()
|
|
start = torch.cuda.Event(enable_timing=True)
|
|
end = torch.cuda.Event(enable_timing=True)
|
|
|
|
start.record()
|
|
# Forward already happened, this is for next time
|
|
end.record()
|
|
|
|
torch.cuda.synchronize()
|
|
# (Note: This is simplified - real implementation more complex)
|
|
|
|
return hook
|
|
|
|
# Register hooks
|
|
hooks = []
|
|
for name, module in model.named_modules():
|
|
if len(list(module.children())) == 0: # Leaf modules only
|
|
hook = module.register_forward_hook(make_hook(name))
|
|
hooks.append(hook)
|
|
|
|
# Better approach: Use record_function
|
|
class ProfilingModule(torch.nn.Module):
|
|
def __init__(self, module, name):
|
|
super().__init__()
|
|
self.module = module
|
|
self.name = name
|
|
|
|
def forward(self, *args, **kwargs):
|
|
with record_function(f"module_{self.name}"):
|
|
return self.module(*args, **kwargs)
|
|
|
|
# Or just use torch.profiler with record_shapes=True
|
|
# It will automatically show module breakdown
|
|
|
|
with profile(activities=[ProfilerActivity.CUDA]) as prof:
|
|
with record_function("model_forward"):
|
|
output = model(sample_input)
|
|
|
|
print(prof.key_averages(group_by_input_shape=True).table(
|
|
sort_by="cuda_time_total", row_limit=20
|
|
))
|
|
|
|
# Clean up
|
|
for hook in hooks:
|
|
hook.remove()
|
|
```
|
|
|
|
**Why this matters:**
|
|
- Identifies which model component is slowest
|
|
- Guides optimization efforts to specific layers
|
|
- Reveals unexpected bottlenecks (e.g., LayerNorm taking 30% of time)
|
|
|
|
|
|
### Phase 4: Identify Operation
|
|
|
|
**Step 1: Detailed Operation Profiling**
|
|
|
|
```python
|
|
def profile_operations_detailed(model, sample_input):
|
|
"""Get detailed breakdown of all operations"""
|
|
|
|
model.eval()
|
|
sample_input = sample_input.cuda()
|
|
|
|
with profile(
|
|
activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA],
|
|
record_shapes=True,
|
|
with_stack=True,
|
|
profile_memory=True
|
|
) as prof:
|
|
output = model(sample_input)
|
|
|
|
# Group by operation type
|
|
print("\n=== Top Operations by CUDA Time ===")
|
|
print(prof.key_averages().table(
|
|
sort_by="cuda_time_total",
|
|
row_limit=30,
|
|
max_src_column_width=100
|
|
))
|
|
|
|
print("\n=== Top Operations by Memory ===")
|
|
print(prof.key_averages().table(
|
|
sort_by="self_cuda_memory_usage",
|
|
row_limit=20,
|
|
max_src_column_width=100
|
|
))
|
|
|
|
print("\n=== Grouped by Input Shape ===")
|
|
print(prof.key_averages(group_by_input_shape=True).table(
|
|
sort_by="cuda_time_total",
|
|
row_limit=20
|
|
))
|
|
|
|
# Export for trace view
|
|
prof.export_chrome_trace("detailed_trace.json")
|
|
print("\n✅ Exported detailed_trace.json - view in chrome://tracing")
|
|
|
|
return prof
|
|
```
|
|
|
|
|
|
**Step 2: Reading Profiler Output**
|
|
|
|
```python
|
|
# Example profiler output:
|
|
"""
|
|
--------------------------------- ------------ ------------ ------------ ------------
|
|
Name Self CPU % Self CPU CPU total % CPU total
|
|
--------------------------------- ------------ ------------ ------------ ------------
|
|
aten::conv2d 0.5% 124ms 45.2% 11.234s
|
|
aten::convolution 1.2% 298ms 44.7% 11.110s
|
|
aten::_convolution 2.3% 571ms 43.5% 10.812s
|
|
aten::cudnn_convolution 40.1% 9.967s 41.2% 10.241s
|
|
aten::batch_norm 0.3% 74ms 25.8% 6.412s
|
|
aten::_batch_norm 1.1% 273ms 25.5% 6.338s
|
|
aten::cudnn_batch_norm 23.2% 5.765s 24.4% 6.065s
|
|
aten::relu 8.2% 2.038s 8.2% 2.038s
|
|
"""
|
|
```
|
|
|
|
**How to interpret:**
|
|
|
|
| Column | Meaning | When to Look |
|
|
|--------|---------|--------------|
|
|
| Name | Operation name | Identify what operation |
|
|
| Self CPU % | Time in this op only (no children) | Find leaf operations |
|
|
| CPU total % | Time in op + children | Find expensive subtrees |
|
|
| Self CUDA time | GPU execution time | Main metric for GPU ops |
|
|
| Call count | How many times called | High count = optimization target |
|
|
|
|
**Common patterns:**
|
|
|
|
```python
|
|
# Pattern 1: High aten::copy_ time (40%+)
|
|
# → Device transfer issue (CPU ↔ GPU)
|
|
# Action: Check device placement, reduce transfers
|
|
|
|
# Pattern 2: High cudaLaunchKernel overhead
|
|
# → Too many small kernel launches
|
|
# Action: Increase batch size, fuse operations
|
|
|
|
# Pattern 3: High cudnn_convolution time
|
|
# → Convolutions are bottleneck (expected for CNNs)
|
|
# Action: Check input dimensions for Tensor Core alignment
|
|
|
|
# Pattern 4: High CPU time, low CUDA time
|
|
# → CPU bottleneck (data loading, preprocessing)
|
|
# Action: Increase num_workers, move ops to GPU
|
|
|
|
# Pattern 5: Many small operations
|
|
# → Operation fusion opportunity
|
|
# Action: Use torch.compile or fuse manually
|
|
```
|
|
|
|
|
|
**Step 3: Trace View Analysis**
|
|
|
|
```python
|
|
# After exporting trace: prof.export_chrome_trace("trace.json")
|
|
# Open in chrome://tracing
|
|
|
|
"""
|
|
Trace view shows:
|
|
1. Timeline of GPU kernels
|
|
2. CPU → GPU synchronization points
|
|
3. Parallel vs sequential execution
|
|
4. GPU idle time (gaps between kernels)
|
|
|
|
What to look for:
|
|
- Large gaps between GPU kernels → GPU underutilized
|
|
- Many thin bars → Too many small operations
|
|
- Thick bars → Few large operations (good for GPU)
|
|
- Yellow/red bars → CPU activity (should be minimal during GPU work)
|
|
- Overlapping bars → Concurrent execution (good)
|
|
"""
|
|
```
|
|
|
|
**Reading trace view:**
|
|
|
|
```
|
|
GPU Stream 0: ████░░░░████░░░░████ ← Gaps = idle GPU (bad)
|
|
GPU Stream 0: ███████████████████ ← Continuous = good utilization
|
|
CPU: ░░░░████░░░░████░░░░ ← CPU peaks = data loading
|
|
|
|
Timeline:
|
|
[Data Load]──→[GPU Forward]──→[Data Load]──→[GPU Forward]
|
|
↑ Gap here = GPU waiting for data
|
|
```
|
|
|
|
|
|
## Memory Profiling
|
|
|
|
### Memory Tracking Methodology
|
|
|
|
**Step 1: Basic Memory Tracking**
|
|
|
|
```python
|
|
import torch
|
|
|
|
def track_memory(stage_name):
|
|
"""Print current memory usage"""
|
|
allocated = torch.cuda.memory_allocated() / 1e9
|
|
reserved = torch.cuda.memory_reserved() / 1e9
|
|
print(f"{stage_name:30s} - Allocated: {allocated:6.2f} GB, "
|
|
f"Reserved: {reserved:6.2f} GB")
|
|
|
|
# Track at each training phase
|
|
track_memory("Start")
|
|
|
|
data, target = next(iter(dataloader))
|
|
data, target = data.cuda(), target.cuda()
|
|
track_memory("After data to GPU")
|
|
|
|
output = model(data)
|
|
track_memory("After forward")
|
|
|
|
loss = criterion(output, target)
|
|
track_memory("After loss")
|
|
|
|
loss.backward()
|
|
track_memory("After backward")
|
|
|
|
optimizer.step()
|
|
track_memory("After optimizer step")
|
|
|
|
optimizer.zero_grad()
|
|
track_memory("After zero_grad")
|
|
```
|
|
|
|
**Output interpretation:**
|
|
|
|
```
|
|
Start - Allocated: 2.50 GB, Reserved: 2.75 GB
|
|
After data to GPU - Allocated: 2.62 GB, Reserved: 2.75 GB
|
|
After forward - Allocated: 4.80 GB, Reserved: 5.00 GB ← Activations
|
|
After loss - Allocated: 4.81 GB, Reserved: 5.00 GB
|
|
After backward - Allocated: 7.20 GB, Reserved: 7.50 GB ← Gradients
|
|
After optimizer step - Allocated: 7.20 GB, Reserved: 7.50 GB
|
|
After zero_grad - Allocated: 4.70 GB, Reserved: 7.50 GB ← Gradients freed
|
|
```
|
|
|
|
**Key insights:**
|
|
- Allocated = actual memory used
|
|
- Reserved = memory held by allocator (may be > allocated due to caching)
|
|
- Large jump after forward = activations (consider gradient checkpointing)
|
|
- Large jump after backward = gradients (same size as parameters)
|
|
- Reserved stays high = memory fragmentation or caching
|
|
|
|
|
|
**Step 2: Peak Memory Analysis**
|
|
|
|
```python
|
|
def analyze_peak_memory(model, batch, target):
|
|
"""Find peak memory usage and what causes it"""
|
|
|
|
torch.cuda.reset_peak_memory_stats()
|
|
torch.cuda.synchronize()
|
|
|
|
# Run one iteration
|
|
data, target = batch.cuda(), target.cuda()
|
|
|
|
output = model(data)
|
|
forward_peak = torch.cuda.max_memory_allocated() / 1e9
|
|
|
|
loss = criterion(output, target)
|
|
loss_peak = torch.cuda.max_memory_allocated() / 1e9
|
|
|
|
loss.backward()
|
|
backward_peak = torch.cuda.max_memory_allocated() / 1e9
|
|
|
|
optimizer.step()
|
|
optimizer_peak = torch.cuda.max_memory_allocated() / 1e9
|
|
|
|
optimizer.zero_grad()
|
|
final_peak = torch.cuda.max_memory_allocated() / 1e9
|
|
|
|
print(f"Peak after forward: {forward_peak:.2f} GB")
|
|
print(f"Peak after loss: {loss_peak:.2f} GB")
|
|
print(f"Peak after backward: {backward_peak:.2f} GB")
|
|
print(f"Peak after optimizer: {optimizer_peak:.2f} GB")
|
|
print(f"Overall peak: {final_peak:.2f} GB")
|
|
|
|
# Identify bottleneck
|
|
if forward_peak > backward_peak * 0.8:
|
|
print("\n⚠️ Activations dominate memory usage")
|
|
print(" Consider: Gradient checkpointing, smaller batch size")
|
|
elif backward_peak > forward_peak * 1.5:
|
|
print("\n⚠️ Gradients dominate memory usage")
|
|
print(" Consider: Gradient accumulation, mixed precision")
|
|
else:
|
|
print("\n✅ Memory usage balanced across phases")
|
|
|
|
return {
|
|
'forward': forward_peak,
|
|
'backward': backward_peak,
|
|
'optimizer': optimizer_peak,
|
|
'peak': final_peak
|
|
}
|
|
```
|
|
|
|
|
|
**Step 3: Detailed Memory Summary**
|
|
|
|
```python
|
|
def print_memory_summary():
|
|
"""Print detailed memory breakdown"""
|
|
print(torch.cuda.memory_summary())
|
|
|
|
"""
|
|
Example output:
|
|
|===========================================================================|
|
|
| PyTorch CUDA memory summary |
|
|
|---------------------------------------------------------------------------|
|
|
| CUDA OOMs: 0 |
|
|
| Metric | Cur Usage | Peak Usage | Alloc Retries | # Allocs |
|
|
|----------------|------------|------------|---------------|---------------|
|
|
| Allocated | 4.50 GB | 7.20 GB | 0 | 15234 |
|
|
| Reserved | 7.50 GB | 7.50 GB | 0 | 1523 |
|
|
| Active | 4.50 GB | 7.20 GB | | |
|
|
| Inactive | 3.00 GB | 0.30 GB | | |
|
|
|===========================================================================|
|
|
|
|
Allocated memory: 4.50 GB ← Actual tensors
|
|
Reserved memory: 7.50 GB ← Memory held by allocator
|
|
Active allocations: 4.50 GB ← Currently in use
|
|
Inactive allocations: 3.00 GB ← Cached for reuse (fragmentation)
|
|
"""
|
|
|
|
# If Inactive >> 0, memory fragmentation is occurring
|
|
# Periodic torch.cuda.empty_cache() may help
|
|
```
|
|
|
|
|
|
**Step 4: Memory Snapshot (PyTorch 2.0+)**
|
|
|
|
```python
|
|
import pickle
|
|
import torch.cuda
|
|
|
|
def capture_memory_snapshot(filename="memory_snapshot.pickle"):
|
|
"""Capture detailed memory snapshot for analysis"""
|
|
|
|
# Enable memory history tracking
|
|
torch.cuda.memory._record_memory_history(max_entries=100000)
|
|
|
|
try:
|
|
# Run your training code here
|
|
for i, (data, target) in enumerate(dataloader):
|
|
if i >= 5:
|
|
break
|
|
data, target = data.cuda(), target.cuda()
|
|
output = model(data)
|
|
loss = criterion(output, target)
|
|
loss.backward()
|
|
optimizer.step()
|
|
optimizer.zero_grad()
|
|
|
|
# Capture snapshot
|
|
torch.cuda.memory._dump_snapshot(filename)
|
|
print(f"✅ Memory snapshot saved to {filename}")
|
|
|
|
finally:
|
|
# Disable tracking
|
|
torch.cuda.memory._record_memory_history(enabled=None)
|
|
|
|
print(f"\nAnalyze with:")
|
|
print(f" python -m torch.cuda._memory_viz trace_plot {filename}")
|
|
print(f" # Opens interactive visualization in browser")
|
|
```
|
|
|
|
**Memory snapshot visualization shows:**
|
|
- Allocation timeline
|
|
- Stack traces for each allocation
|
|
- Memory leaks (allocations never freed)
|
|
- Fragmentation patterns
|
|
- Peak memory events
|
|
|
|
|
|
## GPU Timing Best Practices
|
|
|
|
### CUDA Synchronization and Events
|
|
|
|
**❌ WRONG: Using time.time() for GPU operations**
|
|
|
|
```python
|
|
import time
|
|
|
|
# WRONG - measures CPU time, not GPU time!
|
|
start = time.time()
|
|
output = model(data) # Kernel launches, returns immediately
|
|
end = time.time()
|
|
print(f"Time: {end - start:.4f}s") # ❌ This is kernel launch overhead (~microseconds)
|
|
|
|
# Problem: CUDA operations are asynchronous
|
|
# time.time() measures CPU time (when kernel was launched)
|
|
# Not GPU time (when kernel actually executed)
|
|
```
|
|
|
|
**Why this is wrong:**
|
|
- CUDA kernel launches are asynchronous (return immediately to CPU)
|
|
- `time.time()` measures CPU wall-clock time
|
|
- Actual GPU execution happens later, in parallel with CPU
|
|
- Measured time is kernel launch overhead (microseconds), not execution time
|
|
|
|
|
|
**✅ CORRECT: Using CUDA Events**
|
|
|
|
```python
|
|
# CORRECT - measures actual GPU execution time
|
|
start_event = torch.cuda.Event(enable_timing=True)
|
|
end_event = torch.cuda.Event(enable_timing=True)
|
|
|
|
start_event.record()
|
|
output = model(data)
|
|
end_event.record()
|
|
|
|
# Wait for GPU to finish
|
|
torch.cuda.synchronize()
|
|
|
|
# Get elapsed time in milliseconds
|
|
elapsed_time_ms = start_event.elapsed_time(end_event)
|
|
print(f"GPU Time: {elapsed_time_ms:.2f} ms")
|
|
```
|
|
|
|
**Why this is correct:**
|
|
- CUDA Events are GPU-native timing
|
|
- `record()` inserts timing markers into GPU stream
|
|
- `synchronize()` waits for GPU to complete
|
|
- `elapsed_time()` returns actual GPU execution time
|
|
|
|
|
|
**Alternative: Using torch.profiler**
|
|
|
|
```python
|
|
# For comprehensive profiling, use torch.profiler instead of manual timing
|
|
from torch.profiler import profile, ProfilerActivity
|
|
|
|
with profile(activities=[ProfilerActivity.CUDA]) as prof:
|
|
output = model(data)
|
|
|
|
print(prof.key_averages().table(sort_by="cuda_time_total"))
|
|
# This automatically handles synchronization and provides detailed breakdown
|
|
```
|
|
|
|
|
|
### Warmup Iterations
|
|
|
|
**Why warmup is critical:**
|
|
|
|
```python
|
|
# First iteration includes:
|
|
# 1. CUDA kernel JIT compilation
|
|
# 2. cuDNN algorithm selection (benchmark mode)
|
|
# 3. Memory pool allocation
|
|
# 4. CPU→GPU transfer of model weights (first time)
|
|
|
|
# Example timing without warmup:
|
|
model.eval()
|
|
with torch.no_grad():
|
|
for i in range(10):
|
|
start = torch.cuda.Event(enable_timing=True)
|
|
end = torch.cuda.Event(enable_timing=True)
|
|
|
|
start.record()
|
|
output = model(sample_input)
|
|
end.record()
|
|
torch.cuda.synchronize()
|
|
|
|
print(f"Iteration {i}: {start.elapsed_time(end):.2f} ms")
|
|
|
|
"""
|
|
Output:
|
|
Iteration 0: 1234.56 ms ← JIT compilation, cuDNN benchmarking
|
|
Iteration 1: 987.43 ms ← Still some overhead
|
|
Iteration 2: 102.34 ms ← Stabilized
|
|
Iteration 3: 101.89 ms ← Stable
|
|
Iteration 4: 102.12 ms ← Stable
|
|
...
|
|
"""
|
|
```
|
|
|
|
**✅ Correct warmup methodology:**
|
|
|
|
```python
|
|
def benchmark_with_warmup(model, sample_input, warmup=5, iterations=100):
|
|
"""Proper benchmarking with warmup"""
|
|
|
|
model.eval()
|
|
sample_input = sample_input.cuda()
|
|
|
|
# Warmup iterations (CRITICAL!)
|
|
with torch.no_grad():
|
|
for _ in range(warmup):
|
|
_ = model(sample_input)
|
|
|
|
# Ensure warmup completed
|
|
torch.cuda.synchronize()
|
|
|
|
# Actual measurement
|
|
times = []
|
|
with torch.no_grad():
|
|
for _ in range(iterations):
|
|
start = torch.cuda.Event(enable_timing=True)
|
|
end = torch.cuda.Event(enable_timing=True)
|
|
|
|
start.record()
|
|
output = model(sample_input)
|
|
end.record()
|
|
|
|
torch.cuda.synchronize()
|
|
times.append(start.elapsed_time(end))
|
|
|
|
# Report statistics
|
|
import numpy as np
|
|
times = np.array(times)
|
|
|
|
print(f"Mean: {times.mean():.2f} ms")
|
|
print(f"Std: {times.std():.2f} ms")
|
|
print(f"Median: {np.median(times):.2f} ms")
|
|
print(f"Min: {times.min():.2f} ms")
|
|
print(f"Max: {times.max():.2f} ms")
|
|
print(f"P95: {np.percentile(times, 95):.2f} ms")
|
|
print(f"P99: {np.percentile(times, 99):.2f} ms")
|
|
|
|
return times
|
|
```
|
|
|
|
**Warmup rules:**
|
|
- Minimum 3 iterations, recommend 5-10
|
|
- More complex models need more warmup
|
|
- Dynamic control flow needs extra warmup
|
|
- Report statistics (mean, std, percentiles), not just average
|
|
|
|
|
|
## Bottleneck Identification Patterns
|
|
|
|
### CPU-Bound Bottlenecks
|
|
|
|
**Symptoms:**
|
|
- Low GPU utilization (<70%)
|
|
- High CPU usage
|
|
- Data loading time > computation time
|
|
- `nvidia-smi` shows low GPU usage
|
|
|
|
**Diagnostic code:**
|
|
|
|
```python
|
|
def diagnose_cpu_bottleneck(model, dataloader):
|
|
"""Check if training is CPU-bound"""
|
|
|
|
# Check GPU utilization
|
|
import subprocess
|
|
result = subprocess.run(
|
|
['nvidia-smi', '--query-gpu=utilization.gpu', '--format=csv,noheader,nounits'],
|
|
capture_output=True, text=True
|
|
)
|
|
gpu_util = int(result.stdout.strip())
|
|
|
|
print(f"GPU Utilization: {gpu_util}%")
|
|
|
|
if gpu_util < 70:
|
|
print("⚠️ LOW GPU UTILIZATION - likely CPU-bound")
|
|
|
|
# Profile data loading vs compute
|
|
data_time, compute_time = profile_dataloader_vs_model(model, dataloader)
|
|
|
|
if data_time > compute_time:
|
|
print("\n🎯 BOTTLENECK: Data loading")
|
|
print("Solutions:")
|
|
print(" 1. Increase num_workers in DataLoader")
|
|
print(f" Current: {dataloader.num_workers}, try: {os.cpu_count()}")
|
|
print(" 2. Enable pin_memory=True")
|
|
print(" 3. Move data augmentation to GPU (use kornia)")
|
|
print(" 4. Cache preprocessed data if dataset is small")
|
|
print(" 5. Use faster storage (SSD instead of HDD)")
|
|
|
|
else:
|
|
print("\n🎯 BOTTLENECK: CPU preprocessing")
|
|
print("Solutions:")
|
|
print(" 1. Move preprocessing to GPU")
|
|
print(" 2. Reduce preprocessing complexity")
|
|
print(" 3. Batch preprocessing operations")
|
|
|
|
else:
|
|
print("✅ GPU utilization is healthy")
|
|
print(" Bottleneck is likely in GPU computation")
|
|
|
|
return gpu_util
|
|
```
|
|
|
|
**Common solutions:**
|
|
|
|
```python
|
|
# Solution 1: Increase num_workers
|
|
dataloader = DataLoader(
|
|
dataset,
|
|
batch_size=32,
|
|
num_workers=8, # Increase from default 0
|
|
pin_memory=True, # Enable for faster GPU transfer
|
|
persistent_workers=True # Keep workers alive between epochs
|
|
)
|
|
|
|
# Solution 2: Move augmentation to GPU
|
|
import kornia
|
|
|
|
class GPUAugmentation(nn.Module):
|
|
def __init__(self):
|
|
super().__init__()
|
|
self.augment = nn.Sequential(
|
|
kornia.augmentation.RandomHorizontalFlip(p=0.5),
|
|
kornia.augmentation.ColorJitter(0.2, 0.2, 0.2, 0.1),
|
|
kornia.augmentation.RandomResizedCrop((224, 224)),
|
|
)
|
|
|
|
def forward(self, x):
|
|
return self.augment(x)
|
|
|
|
# Apply on GPU
|
|
gpu_augment = GPUAugmentation().cuda()
|
|
for data, target in dataloader:
|
|
data = data.cuda()
|
|
data = gpu_augment(data) # Augment on GPU
|
|
output = model(data)
|
|
```
|
|
|
|
|
|
### GPU-Bound Bottlenecks
|
|
|
|
**Symptoms:**
|
|
- High GPU utilization (>90%)
|
|
- Computation time > data loading time
|
|
- High CUDA time in profiler
|
|
|
|
**Diagnostic code:**
|
|
|
|
```python
|
|
def diagnose_gpu_bottleneck(model, sample_input):
|
|
"""Identify GPU bottleneck operations"""
|
|
|
|
with profile(activities=[ProfilerActivity.CUDA]) as prof:
|
|
output = model(sample_input)
|
|
loss = criterion(output, target)
|
|
loss.backward()
|
|
|
|
# Find top GPU operations
|
|
events = prof.key_averages()
|
|
cuda_events = [(evt.key, evt.cuda_time_total) for evt in events
|
|
if evt.cuda_time_total > 0]
|
|
cuda_events.sort(key=lambda x: x[1], reverse=True)
|
|
|
|
print("Top 10 GPU operations:")
|
|
total_time = sum(time for _, time in cuda_events)
|
|
for i, (name, time) in enumerate(cuda_events[:10], 1):
|
|
percentage = (time / total_time) * 100
|
|
print(f"{i:2d}. {name:40s} {time/1000:8.2f} ms ({percentage:5.1f}%)")
|
|
|
|
# Check for optimization opportunities
|
|
top_op = cuda_events[0][0]
|
|
if 'conv' in top_op.lower():
|
|
print("\n🎯 Bottleneck: Convolution operations")
|
|
print("Solutions:")
|
|
print(" 1. Check input dimensions for Tensor Core alignment (multiples of 8)")
|
|
print(" 2. Use mixed precision (torch.cuda.amp)")
|
|
print(" 3. Consider depthwise separable convolutions")
|
|
print(" 4. Profile with different batch sizes")
|
|
|
|
elif 'mm' in top_op.lower() or 'matmul' in top_op.lower():
|
|
print("\n🎯 Bottleneck: Matrix multiplication")
|
|
print("Solutions:")
|
|
print(" 1. Ensure dimensions are multiples of 8 (FP16) or 16 (BF16)")
|
|
print(" 2. Use mixed precision")
|
|
print(" 3. Check for unnecessary transposes")
|
|
|
|
elif 'copy' in top_op.lower():
|
|
print("\n🎯 Bottleneck: Memory copies")
|
|
print("Solutions:")
|
|
print(" 1. Check device placement (CPU ↔ GPU transfers)")
|
|
print(" 2. Ensure tensors are contiguous")
|
|
print(" 3. Reduce explicit .cuda() or .cpu() calls")
|
|
|
|
return cuda_events
|
|
```
|
|
|
|
**Common solutions:**
|
|
|
|
```python
|
|
# Solution 1: Mixed precision
|
|
from torch.cuda.amp import autocast
|
|
|
|
with autocast():
|
|
output = model(data)
|
|
loss = criterion(output, target)
|
|
# 2-3x speedup for large models
|
|
|
|
# Solution 2: Tensor Core alignment
|
|
# Ensure dimensions are multiples of 8 (FP16) or 16 (BF16)
|
|
# BAD: (batch=31, seq_len=127, hidden=509)
|
|
# GOOD: (batch=32, seq_len=128, hidden=512)
|
|
|
|
# Solution 3: torch.compile (PyTorch 2.0+)
|
|
model = torch.compile(model)
|
|
# Automatic kernel fusion and optimization
|
|
```
|
|
|
|
|
|
### Memory-Bound Bottlenecks
|
|
|
|
**Symptoms:**
|
|
- Low GPU utilization despite high memory usage
|
|
- Large tensor operations dominating time
|
|
- Memory bandwidth saturated
|
|
|
|
**Diagnostic code:**
|
|
|
|
```python
|
|
def diagnose_memory_bottleneck(model, sample_input):
|
|
"""Check if operations are memory-bandwidth limited"""
|
|
|
|
# Profile memory and compute
|
|
with profile(
|
|
activities=[ProfilerActivity.CUDA],
|
|
profile_memory=True
|
|
) as prof:
|
|
output = model(sample_input)
|
|
|
|
# Analyze operations
|
|
for evt in prof.key_averages():
|
|
if evt.cuda_time_total > 0 and evt.self_cuda_memory_usage > 0:
|
|
# Rough FLOP/s estimate
|
|
# Memory-bound: low FLOP/s despite high memory usage
|
|
# Compute-bound: high FLOP/s
|
|
|
|
memory_gb = evt.self_cuda_memory_usage / 1e9
|
|
time_s = evt.cuda_time_total / 1e6 # µs to s
|
|
|
|
if memory_gb > 1.0 and time_s > 0.01:
|
|
bandwidth = memory_gb / time_s # GB/s
|
|
print(f"{evt.key:40s}: {bandwidth:.1f} GB/s")
|
|
|
|
print("\nIf bandwidth < 500 GB/s, likely memory-bound")
|
|
print("Solutions:")
|
|
print(" 1. Reduce intermediate tensor sizes")
|
|
print(" 2. Use in-place operations where safe")
|
|
print(" 3. Tile large operations")
|
|
print(" 4. Increase arithmetic intensity (more compute per byte)")
|
|
```
|
|
|
|
|
|
### I/O-Bound Bottlenecks
|
|
|
|
**Symptoms:**
|
|
- Low CPU and GPU utilization
|
|
- Long pauses between batches
|
|
- Slow disk I/O
|
|
|
|
**Solutions:**
|
|
|
|
```python
|
|
# Solution 1: Cache dataset in RAM
|
|
class CachedDataset(Dataset):
|
|
def __init__(self, dataset):
|
|
self.cache = [dataset[i] for i in range(len(dataset))]
|
|
|
|
def __getitem__(self, idx):
|
|
return self.cache[idx]
|
|
|
|
def __len__(self):
|
|
return len(self.cache)
|
|
|
|
# Solution 2: Use SSD storage or RAM disk
|
|
# Solution 3: Prefetch data
|
|
dataloader = DataLoader(
|
|
dataset,
|
|
num_workers=8,
|
|
prefetch_factor=4, # Prefetch 4 batches per worker
|
|
pin_memory=True
|
|
)
|
|
```
|
|
|
|
|
|
## Common Profiling Mistakes
|
|
|
|
### Mistake 1: Profiling Too Many Iterations
|
|
|
|
**❌ WRONG:**
|
|
|
|
```python
|
|
# Profiling 100 epochs - output is massive, unusable
|
|
with profile(activities=[ProfilerActivity.CUDA]) as prof:
|
|
for epoch in range(100):
|
|
for batch in dataloader:
|
|
# ... training ...
|
|
pass
|
|
```
|
|
|
|
**✅ CORRECT:**
|
|
|
|
```python
|
|
# Profile just a few iterations
|
|
with profile(
|
|
activities=[ProfilerActivity.CUDA],
|
|
schedule=schedule(wait=1, warmup=2, active=3, repeat=1)
|
|
) as prof:
|
|
for epoch in range(1):
|
|
for step, batch in enumerate(dataloader):
|
|
if step >= 10:
|
|
break
|
|
# ... training ...
|
|
prof.step()
|
|
```
|
|
|
|
|
|
### Mistake 2: No Warmup Before Timing
|
|
|
|
**❌ WRONG:**
|
|
|
|
```python
|
|
# Including JIT compilation in timing
|
|
start = time.time()
|
|
output = model(data) # First call - includes JIT overhead
|
|
end = time.time()
|
|
```
|
|
|
|
**✅ CORRECT:**
|
|
|
|
```python
|
|
# Warmup first
|
|
for _ in range(5):
|
|
_ = model(data)
|
|
|
|
torch.cuda.synchronize()
|
|
|
|
# Now measure
|
|
start = torch.cuda.Event(enable_timing=True)
|
|
end = torch.cuda.Event(enable_timing=True)
|
|
start.record()
|
|
output = model(data)
|
|
end.record()
|
|
torch.cuda.synchronize()
|
|
```
|
|
|
|
|
|
### Mistake 3: Synchronization Overhead
|
|
|
|
**❌ WRONG:**
|
|
|
|
```python
|
|
# Synchronizing in training loop - kills performance!
|
|
for batch in dataloader:
|
|
output = model(batch)
|
|
torch.cuda.synchronize() # ❌ Breaks pipelining!
|
|
loss = criterion(output, target)
|
|
torch.cuda.synchronize() # ❌ Unnecessary!
|
|
loss.backward()
|
|
```
|
|
|
|
**✅ CORRECT:**
|
|
|
|
```python
|
|
# Only synchronize for timing/profiling, not in production
|
|
for batch in dataloader:
|
|
output = model(batch)
|
|
loss = criterion(output, target)
|
|
loss.backward()
|
|
# No synchronization - let GPU pipeline work
|
|
```
|
|
|
|
**When to synchronize:**
|
|
- Profiling/timing measurements
|
|
- Before memory measurements
|
|
- Debugging CUDA errors
|
|
- NEVER in production training loop
|
|
|
|
|
|
### Mistake 4: Wrong Profiling Granularity
|
|
|
|
**❌ WRONG:**
|
|
|
|
```python
|
|
# Profiling entire model - too coarse, can't identify bottleneck
|
|
with profile() as prof:
|
|
output = model(data)
|
|
# "Model takes 100ms" - not actionable!
|
|
```
|
|
|
|
**✅ CORRECT:**
|
|
|
|
```python
|
|
# Iterative narrowing:
|
|
# 1. Profile whole step
|
|
# 2. Identify slow phase (forward, backward, optimizer)
|
|
# 3. Profile that phase in detail
|
|
# 4. Identify specific operation
|
|
|
|
# Phase 1: Coarse
|
|
with record_function("forward"):
|
|
output = model(data)
|
|
with record_function("backward"):
|
|
loss.backward()
|
|
|
|
# Phase 2: Found forward is slow, profile in detail
|
|
with profile() as prof:
|
|
output = model(data)
|
|
# Now see which layer is slow
|
|
|
|
# Phase 3: Found layer X is slow, profile that layer
|
|
with profile() as prof:
|
|
output = model.layer_x(data)
|
|
# Now see which operation in layer X is slow
|
|
```
|
|
|
|
|
|
### Mistake 5: Ignoring Memory While Profiling Compute
|
|
|
|
**❌ WRONG:**
|
|
|
|
```python
|
|
# Only looking at time, ignoring memory
|
|
with profile(activities=[ProfilerActivity.CUDA]) as prof:
|
|
output = model(data)
|
|
```
|
|
|
|
**✅ CORRECT:**
|
|
|
|
```python
|
|
# Profile both compute AND memory
|
|
with profile(
|
|
activities=[ProfilerActivity.CUDA],
|
|
profile_memory=True
|
|
) as prof:
|
|
output = model(data)
|
|
|
|
# Check both time and memory
|
|
print(prof.key_averages().table(sort_by="cuda_time_total"))
|
|
print(prof.key_averages().table(sort_by="self_cuda_memory_usage"))
|
|
```
|
|
|
|
|
|
### Mistake 6: Profiling in Wrong Mode
|
|
|
|
**❌ WRONG:**
|
|
|
|
```python
|
|
# Profiling in eval mode when you care about training speed
|
|
model.eval()
|
|
with torch.no_grad():
|
|
with profile() as prof:
|
|
output = model(data)
|
|
# ❌ This doesn't include backward pass!
|
|
```
|
|
|
|
**✅ CORRECT:**
|
|
|
|
```python
|
|
# Profile in the mode you actually use
|
|
model.train()
|
|
with profile() as prof:
|
|
output = model(data)
|
|
loss = criterion(output, target)
|
|
loss.backward() # ✅ Include backward if profiling training
|
|
```
|
|
|
|
|
|
## Red Flags - Stop and Profile Systematically
|
|
|
|
**If you catch yourself thinking ANY of these, STOP and follow methodology:**
|
|
|
|
| Red Flag Thought | Reality | What to Do Instead |
|
|
|------------------|---------|-------------------|
|
|
| "I can see the bottleneck" | 90% of the time your guess is wrong | Profile to confirm, don't guess |
|
|
| "User says X is slow, so X is the bottleneck" | User might be wrong about the cause | Verify with profiling |
|
|
| "This loop looks inefficient" | Intuition about performance often wrong | Measure it, don't assume |
|
|
| "Profiling takes too long" | Profiling saves hours of guessing | 10 minutes of profiling > hours of guessing |
|
|
| "Let me just try this optimization" | Premature optimization wastes time | Measure first, optimize second |
|
|
| "It's obviously a GPU problem" | Could be CPU, data loading, or I/O | Check GPU utilization first |
|
|
| "I'll reduce batch size" | Doesn't address root cause | Diagnose memory bottleneck first |
|
|
| "Skip warmup, it's just one iteration" | First iterations have 10-100x overhead | Always warmup, no exceptions |
|
|
|
|
**Critical rules:**
|
|
1. NEVER optimize before profiling
|
|
2. ALWAYS use warmup iterations
|
|
3. ALWAYS check GPU utilization before assuming GPU bottleneck
|
|
4. ALWAYS profile data loading separately from computation
|
|
5. ALWAYS report statistics (mean, std, percentiles), not just average
|
|
6. ALWAYS use CUDA Events for GPU timing, never `time.time()`
|
|
|
|
|
|
## Common Rationalizations (Don't Do These)
|
|
|
|
| Excuse | What Really Happens | Correct Approach |
|
|
|--------|-------------------|------------------|
|
|
| "User seems rushed, skip profiling" | Guessing wastes MORE time than profiling | 10 min profiling saves hours |
|
|
| "I already profiled once" | Might have used wrong tool or granularity | Re-profile with systematic methodology |
|
|
| "Profiling overhead will skew results" | Use schedule to minimize overhead | `schedule(wait=1, warmup=2, active=3)` |
|
|
| "This worked on another model" | Different models have different bottlenecks | Profile THIS model, not assumptions |
|
|
| "Documentation says X is slow" | Depends on context, hardware, data | Verify with profiling on YOUR setup |
|
|
| "Just trust the profiler output" | Must interpret correctly | Understand what metrics mean |
|
|
| "The model is the bottleneck" | Often it's data loading | Always check data loading vs compute |
|
|
|
|
|
|
## Complete Profiling Example
|
|
|
|
```python
|
|
import torch
|
|
import torch.nn as nn
|
|
from torch.profiler import profile, ProfilerActivity, schedule
|
|
from torch.utils.data import DataLoader
|
|
import numpy as np
|
|
|
|
def comprehensive_profiling(model, dataloader, device='cuda'):
|
|
"""
|
|
Complete profiling workflow following systematic methodology
|
|
"""
|
|
|
|
print("=" * 80)
|
|
print("PHASE 1: ESTABLISH BASELINE")
|
|
print("=" * 80)
|
|
|
|
# Step 1: Measure baseline performance
|
|
model = model.to(device)
|
|
model.train()
|
|
|
|
# Warmup
|
|
print("\nWarming up (5 iterations)...")
|
|
for i, (data, target) in enumerate(dataloader):
|
|
if i >= 5:
|
|
break
|
|
data, target = data.to(device), target.to(device)
|
|
output = model(data)
|
|
loss = criterion(output, target)
|
|
loss.backward()
|
|
optimizer.step()
|
|
optimizer.zero_grad()
|
|
|
|
torch.cuda.synchronize()
|
|
|
|
# Measure baseline
|
|
print("\nMeasuring baseline (10 iterations)...")
|
|
times = []
|
|
for i, (data, target) in enumerate(dataloader):
|
|
if i >= 10:
|
|
break
|
|
|
|
data, target = data.to(device), target.to(device)
|
|
|
|
start = torch.cuda.Event(enable_timing=True)
|
|
end = torch.cuda.Event(enable_timing=True)
|
|
|
|
start.record()
|
|
output = model(data)
|
|
loss = criterion(output, target)
|
|
loss.backward()
|
|
optimizer.step()
|
|
optimizer.zero_grad()
|
|
end.record()
|
|
|
|
torch.cuda.synchronize()
|
|
times.append(start.elapsed_time(end))
|
|
|
|
times = np.array(times)
|
|
print(f"\nBaseline Performance:")
|
|
print(f" Mean: {times.mean():.2f} ms/iteration")
|
|
print(f" Std: {times.std():.2f} ms")
|
|
print(f" Median: {np.median(times):.2f} ms")
|
|
print(f" P95: {np.percentile(times, 95):.2f} ms")
|
|
|
|
# -------------------------------------------------------------------------
|
|
print("\n" + "=" * 80)
|
|
print("PHASE 2: IDENTIFY BOTTLENECK TYPE")
|
|
print("=" * 80)
|
|
|
|
# Check GPU utilization
|
|
import subprocess
|
|
result = subprocess.run(
|
|
['nvidia-smi', '--query-gpu=utilization.gpu,memory.used',
|
|
'--format=csv,noheader,nounits'],
|
|
capture_output=True, text=True
|
|
)
|
|
gpu_util, mem_used = result.stdout.strip().split(',')
|
|
print(f"\nGPU Utilization: {gpu_util}%")
|
|
print(f"GPU Memory Used: {mem_used} MB")
|
|
|
|
if int(gpu_util) < 70:
|
|
print("⚠️ LOW GPU UTILIZATION - likely CPU-bound")
|
|
else:
|
|
print("✅ GPU utilization healthy - likely GPU-bound")
|
|
|
|
# Profile data loading vs computation
|
|
print("\nProfiling data loading vs computation...")
|
|
data_times = []
|
|
compute_times = []
|
|
|
|
batch_iter = iter(dataloader)
|
|
for i in range(20):
|
|
import time
|
|
|
|
# Data loading time
|
|
data_start = time.time()
|
|
data, target = next(batch_iter)
|
|
data_time = time.time() - data_start
|
|
data_times.append(data_time * 1000) # to ms
|
|
|
|
# Computation time
|
|
data, target = data.to(device), target.to(device)
|
|
torch.cuda.synchronize()
|
|
|
|
start = torch.cuda.Event(enable_timing=True)
|
|
end = torch.cuda.Event(enable_timing=True)
|
|
|
|
start.record()
|
|
output = model(data)
|
|
loss = criterion(output, target)
|
|
loss.backward()
|
|
optimizer.step()
|
|
optimizer.zero_grad()
|
|
end.record()
|
|
|
|
torch.cuda.synchronize()
|
|
compute_times.append(start.elapsed_time(end))
|
|
|
|
avg_data = np.mean(data_times)
|
|
avg_compute = np.mean(compute_times)
|
|
|
|
print(f"\nData loading: {avg_data:.2f} ms")
|
|
print(f"Computation: {avg_compute:.2f} ms")
|
|
|
|
if avg_data > avg_compute:
|
|
print("🎯 BOTTLENECK: Data loading (CPU-bound)")
|
|
else:
|
|
print("🎯 BOTTLENECK: Model computation (GPU-bound)")
|
|
|
|
# -------------------------------------------------------------------------
|
|
print("\n" + "=" * 80)
|
|
print("PHASE 3: NARROW TO COMPONENT")
|
|
print("=" * 80)
|
|
|
|
# Profile training phases
|
|
print("\nProfiling training phases...")
|
|
|
|
data, target = next(iter(dataloader))
|
|
data, target = data.to(device), target.to(device)
|
|
|
|
# Forward
|
|
torch.cuda.synchronize()
|
|
start = torch.cuda.Event(enable_timing=True)
|
|
end = torch.cuda.Event(enable_timing=True)
|
|
|
|
start.record()
|
|
output = model(data)
|
|
loss = criterion(output, target)
|
|
end.record()
|
|
torch.cuda.synchronize()
|
|
forward_time = start.elapsed_time(end)
|
|
|
|
# Backward
|
|
start = torch.cuda.Event(enable_timing=True)
|
|
end = torch.cuda.Event(enable_timing=True)
|
|
|
|
start.record()
|
|
loss.backward()
|
|
end.record()
|
|
torch.cuda.synchronize()
|
|
backward_time = start.elapsed_time(end)
|
|
|
|
# Optimizer
|
|
start = torch.cuda.Event(enable_timing=True)
|
|
end = torch.cuda.Event(enable_timing=True)
|
|
|
|
start.record()
|
|
optimizer.step()
|
|
optimizer.zero_grad()
|
|
end.record()
|
|
torch.cuda.synchronize()
|
|
optimizer_time = start.elapsed_time(end)
|
|
|
|
total = forward_time + backward_time + optimizer_time
|
|
|
|
print(f"\nPhase breakdown:")
|
|
print(f" Forward: {forward_time:7.2f} ms ({forward_time/total*100:5.1f}%)")
|
|
print(f" Backward: {backward_time:7.2f} ms ({backward_time/total*100:5.1f}%)")
|
|
print(f" Optimizer: {optimizer_time:7.2f} ms ({optimizer_time/total*100:5.1f}%)")
|
|
|
|
# -------------------------------------------------------------------------
|
|
print("\n" + "=" * 80)
|
|
print("PHASE 4: IDENTIFY OPERATION")
|
|
print("=" * 80)
|
|
|
|
# Detailed profiling
|
|
print("\nRunning detailed profiler...")
|
|
|
|
with profile(
|
|
activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA],
|
|
schedule=schedule(wait=1, warmup=2, active=3, repeat=1),
|
|
on_trace_ready=torch.profiler.tensorboard_trace_handler('./profiler_logs'),
|
|
record_shapes=True,
|
|
profile_memory=True,
|
|
with_stack=True
|
|
) as prof:
|
|
for step, (data, target) in enumerate(dataloader):
|
|
if step >= 10:
|
|
break
|
|
|
|
data, target = data.to(device), target.to(device)
|
|
optimizer.zero_grad()
|
|
|
|
output = model(data)
|
|
loss = criterion(output, target)
|
|
loss.backward()
|
|
optimizer.step()
|
|
|
|
prof.step()
|
|
|
|
# Print summary
|
|
print("\nTop operations by CUDA time:")
|
|
print(prof.key_averages().table(
|
|
sort_by="cuda_time_total",
|
|
row_limit=15,
|
|
max_src_column_width=60
|
|
))
|
|
|
|
print("\nTop operations by memory:")
|
|
print(prof.key_averages().table(
|
|
sort_by="self_cuda_memory_usage",
|
|
row_limit=10,
|
|
max_src_column_width=60
|
|
))
|
|
|
|
prof.export_chrome_trace("detailed_trace.json")
|
|
|
|
print("\n" + "=" * 80)
|
|
print("PROFILING COMPLETE")
|
|
print("=" * 80)
|
|
print("\nNext steps:")
|
|
print(" 1. Review top operations in table above")
|
|
print(" 2. Open chrome://tracing and load detailed_trace.json")
|
|
print(" 3. Or view in TensorBoard: tensorboard --logdir=./profiler_logs")
|
|
print(" 4. Focus optimization on identified bottleneck")
|
|
print(" 5. Re-run this profiling after optimization to verify improvement")
|
|
|
|
return {
|
|
'baseline_ms': times.mean(),
|
|
'gpu_utilization': int(gpu_util),
|
|
'data_loading_ms': avg_data,
|
|
'computation_ms': avg_compute,
|
|
'forward_ms': forward_time,
|
|
'backward_ms': backward_time,
|
|
'optimizer_ms': optimizer_time,
|
|
}
|
|
```
|
|
|
|
|
|
## Memory Profiling Complete Example
|
|
|
|
```python
|
|
def profile_memory_usage(model, sample_batch, sample_target):
|
|
"""Comprehensive memory profiling"""
|
|
|
|
print("=" * 80)
|
|
print("MEMORY PROFILING")
|
|
print("=" * 80)
|
|
|
|
device = next(model.parameters()).device
|
|
|
|
# Reset memory stats
|
|
torch.cuda.reset_peak_memory_stats()
|
|
torch.cuda.empty_cache()
|
|
|
|
def print_memory(stage):
|
|
allocated = torch.cuda.memory_allocated() / 1e9
|
|
reserved = torch.cuda.memory_reserved() / 1e9
|
|
peak = torch.cuda.max_memory_allocated() / 1e9
|
|
print(f"{stage:30s} | Alloc: {allocated:5.2f} GB | "
|
|
f"Reserved: {reserved:5.2f} GB | Peak: {peak:5.2f} GB")
|
|
|
|
print("\nMemory tracking:")
|
|
print("-" * 80)
|
|
|
|
print_memory("Initial")
|
|
|
|
# Move data to GPU
|
|
data = sample_batch.to(device)
|
|
target = sample_target.to(device)
|
|
print_memory("After data to GPU")
|
|
|
|
# Forward pass
|
|
output = model(data)
|
|
print_memory("After forward")
|
|
|
|
# Loss
|
|
loss = criterion(output, target)
|
|
print_memory("After loss")
|
|
|
|
# Backward
|
|
loss.backward()
|
|
print_memory("After backward")
|
|
|
|
# Optimizer
|
|
optimizer.step()
|
|
print_memory("After optimizer step")
|
|
|
|
# Zero grad
|
|
optimizer.zero_grad()
|
|
print_memory("After zero_grad")
|
|
|
|
# Final peak
|
|
peak_memory = torch.cuda.max_memory_allocated() / 1e9
|
|
print("-" * 80)
|
|
print(f"\nPeak memory usage: {peak_memory:.2f} GB")
|
|
|
|
# Detailed summary
|
|
print("\n" + "=" * 80)
|
|
print("DETAILED MEMORY SUMMARY")
|
|
print("=" * 80)
|
|
print(torch.cuda.memory_summary())
|
|
|
|
# Memory breakdown
|
|
print("\n" + "=" * 80)
|
|
print("MEMORY OPTIMIZATION SUGGESTIONS")
|
|
print("=" * 80)
|
|
|
|
allocated = torch.cuda.memory_allocated() / 1e9
|
|
reserved = torch.cuda.memory_reserved() / 1e9
|
|
|
|
if reserved > allocated * 1.5:
|
|
print("⚠️ Memory fragmentation detected")
|
|
print(f" Reserved: {reserved:.2f} GB, Allocated: {allocated:.2f} GB")
|
|
print(" Suggestion: Call torch.cuda.empty_cache() periodically")
|
|
|
|
# Estimate memory components
|
|
param_memory = sum(p.numel() * p.element_size() for p in model.parameters()) / 1e9
|
|
print(f"\nModel parameters: {param_memory:.2f} GB")
|
|
|
|
# Estimate gradients (same size as parameters)
|
|
print(f"Gradients (estimate): {param_memory:.2f} GB")
|
|
|
|
# Optimizer states (Adam: 2x parameters)
|
|
if isinstance(optimizer, torch.optim.Adam):
|
|
optimizer_memory = param_memory * 2
|
|
print(f"Optimizer states (Adam): {optimizer_memory:.2f} GB")
|
|
|
|
# Activations (peak - parameters - gradients - optimizer)
|
|
activation_memory = peak_memory - param_memory - param_memory - optimizer_memory
|
|
if activation_memory > 0:
|
|
print(f"Activations (estimate): {activation_memory:.2f} GB")
|
|
|
|
if activation_memory > peak_memory * 0.5:
|
|
print("\n🎯 Activations dominate memory usage")
|
|
print(" Suggestions:")
|
|
print(" 1. Use gradient checkpointing")
|
|
print(" 2. Reduce batch size")
|
|
print(" 3. Use mixed precision (FP16/BF16)")
|
|
|
|
return {
|
|
'peak_gb': peak_memory,
|
|
'parameters_gb': param_memory,
|
|
'fragmentation_ratio': reserved / allocated if allocated > 0 else 1.0
|
|
}
|
|
```
|
|
|
|
|
|
## Profiling Checklist
|
|
|
|
Before claiming you've profiled the code, verify:
|
|
|
|
- [ ] **Baseline established**
|
|
- [ ] Defined performance metric (throughput/latency/memory)
|
|
- [ ] Measured with CUDA Events (not time.time())
|
|
- [ ] Used 5+ warmup iterations
|
|
- [ ] Reported statistics (mean, std, percentiles)
|
|
- [ ] Documented measurement conditions
|
|
|
|
- [ ] **Bottleneck type identified**
|
|
- [ ] Checked GPU utilization (nvidia-smi)
|
|
- [ ] Profiled data loading vs computation separately
|
|
- [ ] Categorized as CPU-bound, GPU-bound, memory-bound, or I/O-bound
|
|
- [ ] Verified category with profiling data (not guessing)
|
|
|
|
- [ ] **Component identified**
|
|
- [ ] Profiled training phases (forward/backward/optimizer)
|
|
- [ ] Identified which phase is slowest
|
|
- [ ] Used iterative narrowing approach
|
|
- [ ] Examined both table and trace view
|
|
|
|
- [ ] **Operation identified**
|
|
- [ ] Profiled bottleneck component in detail
|
|
- [ ] Found specific operation or pattern
|
|
- [ ] Understand WHY it's slow (not just WHAT is slow)
|
|
- [ ] Have actionable optimization target
|
|
|
|
- [ ] **Verification ready**
|
|
- [ ] Saved baseline measurements
|
|
- [ ] Know how to re-run profiling after optimization
|
|
- [ ] Can verify if optimization actually helped
|
|
- [ ] Have profiling artifacts (traces, summaries)
|
|
|
|
|
|
## References
|
|
|
|
**PyTorch Profiling Documentation:**
|
|
- torch.profiler: https://pytorch.org/docs/stable/profiler.html
|
|
- Profiling recipe: https://pytorch.org/tutorials/recipes/recipes/profiler_recipe.html
|
|
- Performance tuning guide: https://pytorch.org/tutorials/recipes/recipes/tuning_guide.html
|
|
|
|
**Related Skills:**
|
|
- tensor-operations-and-memory (memory leak debugging, operation optimization)
|
|
- mixed-precision-and-optimization (AMP profiling, Tensor Core utilization)
|
|
- distributed-training-strategies (multi-GPU profiling)
|
|
|
|
**Tools:**
|
|
- Chrome tracing: chrome://tracing
|
|
- TensorBoard profiler: tensorboard --logdir=<path>
|
|
- NVIDIA Nsight Systems: nsys profile python train.py
|
|
- PyTorch Memory Visualizer: python -m torch.cuda._memory_viz
|