gh-tachyon-beep-skillpacks-…/skills/using-pytorch-engineering/performance-profiling.md


# Performance Profiling and Bottleneck Analysis

## Overview

**Core Principle:** Optimization without measurement is guessing. Profile systematically (whole → component → operation) using the right tools to identify actual bottlenecks before attempting fixes. 90% of runtime is usually in 10% of code - find that 10% with profiling, not intuition.

Performance issues stem from: data loading bottlenecks (CPU-bound), inefficient operations (GPU-bound), memory bandwidth limits (memory-bound), or I/O bottlenecks. Profiling reveals which category applies. Guessing leads to optimizing the wrong thing, wasting hours on marginal improvements while real bottleneck remains.

## When to Use

**Use this skill when:**
- Training or inference slower than expected
- Need to identify performance bottleneck in PyTorch code
- High GPU memory usage, need to understand what's using memory
- Evaluating whether optimization actually improved performance
- Debugging low GPU utilization issues
- Comparing performance of different implementations
- Need to profile specific operations or model components

**Don't use when:**
- Performance is already acceptable (no problem to solve)
- Architecture design questions (use module-design-patterns)
- Debugging correctness issues (use debugging-techniques)
- Memory leaks (use tensor-operations-and-memory)

**Symptoms triggering this skill:**
- "Training is slower than expected"
- "Low GPU utilization but training still slow"
- "Which part of my model is the bottleneck?"
- "Does this optimization actually help?"
- "Memory usage is high, what's using it?"
- "First iteration much slower than subsequent ones"


## Systematic Profiling Methodology

### The Four-Phase Framework

**Phase 1: Establish Baseline**
- Define metric (throughput, latency, memory)
- Measure end-to-end performance
- Set improvement target
- Document measurement conditions

**Phase 2: Identify Bottleneck Type**
- CPU-bound vs GPU-bound vs I/O-bound vs memory-bound
- Check GPU utilization (nvidia-smi)
- Profile data loading separately from computation
- Determine which component to investigate

**Phase 3: Narrow to Component**
- Profile at coarse granularity
- Identify which phase is slow (forward/backward/optimizer/data loading)
- Focus profiling on bottleneck component
- Use iterative narrowing

**Phase 4: Identify Operation**
- Profile bottleneck component in detail
- Examine both table view and trace view
- Find specific operation or pattern
- Measure improvement after fix

**Critical Rule:** ALWAYS work through phases in order. Don't jump to Phase 4 without Phases 1-3.


### Phase 1: Establish Baseline

**Step 1: Define Performance Metric**

```python
# Choose the right metric for your use case:

# Throughput (samples/second) - for training
def measure_throughput(model, dataloader, num_batches=100):
    model.train()
    torch.cuda.synchronize()
    start = torch.cuda.Event(enable_timing=True)
    end = torch.cuda.Event(enable_timing=True)

    total_samples = 0
    start.record()

    for i, (data, target) in enumerate(dataloader):
        if i >= num_batches:
            break
        data, target = data.cuda(), target.cuda()
        output = model(data)
        loss = criterion(output, target)
        loss.backward()
        optimizer.step()
        optimizer.zero_grad()
        total_samples += data.size(0)

    end.record()
    torch.cuda.synchronize()
    elapsed_ms = start.elapsed_time(end)

    throughput = total_samples / (elapsed_ms / 1000.0)
    print(f"Throughput: {throughput:.2f} samples/sec")
    print(f"Time per batch: {elapsed_ms / num_batches:.2f} ms")
    return throughput

# Latency (time per sample) - for inference
def measure_latency(model, sample_input, num_iterations=100, warmup=10):
    model.eval()
    sample_input = sample_input.cuda()

    # Warmup (CRITICAL - don't skip!)
    with torch.no_grad():
        for _ in range(warmup):
            _ = model(sample_input)

    # Measure
    latencies = []
    with torch.no_grad():
        for _ in range(num_iterations):
            start = torch.cuda.Event(enable_timing=True)
            end = torch.cuda.Event(enable_timing=True)

            start.record()
            output = model(sample_input)
            end.record()

            torch.cuda.synchronize()
            latencies.append(start.elapsed_time(end))

    # Report statistics (not just average!)
    import numpy as np
    latencies = np.array(latencies)
    print(f"Latency - Mean: {latencies.mean():.2f} ms, "
          f"Std: {latencies.std():.2f} ms, "
          f"Median: {np.median(latencies):.2f} ms, "
          f"P95: {np.percentile(latencies, 95):.2f} ms, "
          f"P99: {np.percentile(latencies, 99):.2f} ms")
    return latencies

# Memory usage (peak GB)
def measure_memory(model, sample_batch):
    torch.cuda.reset_peak_memory_stats()
    torch.cuda.synchronize()

    # Run one iteration
    output = model(sample_batch)
    loss = criterion(output, target)
    loss.backward()

    torch.cuda.synchronize()
    peak_memory = torch.cuda.max_memory_allocated() / 1e9
    print(f"Peak memory: {peak_memory:.2f} GB")
    return peak_memory
```

**Why this matters:**
- Without baseline, can't measure improvement
- Need statistics (mean, std, percentiles), not just average
- Must use CUDA Events for GPU timing (not `time.time()`)
- Warmup critical to exclude JIT compilation overhead


**Step 2: Document Measurement Conditions**

```python
# Record all relevant configuration
profiling_config = {
    'model': model.__class__.__name__,
    'batch_size': 32,
    'input_shape': (3, 224, 224),
    'device': 'cuda:0',
    'dtype': 'float16' if using_amp else 'float32',
    'mode': 'train' or 'eval',
    'num_workers': dataloader.num_workers,
    'cudnn_benchmark': torch.backends.cudnn.benchmark,
    'gpu': torch.cuda.get_device_name(0),
}

print(json.dumps(profiling_config, indent=2))
```

**Why this matters:**
- Performance changes with configuration
- Need to reproduce results
- Comparing different runs requires same conditions
- Document before optimizing, re-measure after


### Phase 2: Identify Bottleneck Type

**Step 1: Check GPU Utilization**

```bash
# In terminal, monitor GPU utilization in real-time
nvidia-smi dmon -s u -i 0 -d 1

# Or within Python
import subprocess
result = subprocess.run(['nvidia-smi', '--query-gpu=utilization.gpu,memory.used',
                        '--format=csv,noheader,nounits'],
                       capture_output=True, text=True)
gpu_util, mem_used = result.stdout.strip().split(',')
print(f"GPU Utilization: {gpu_util}%, Memory: {mem_used} MB")
```

**Interpretation:**

| GPU Utilization | Likely Bottleneck | Next Step |
|----------------|-------------------|-----------|
| < 70% | CPU-bound (data loading, preprocessing) | Profile data loading |
| > 90% | GPU-bound (computation) | Profile model operations |
| 50-80% | Mixed or memory-bound | Check memory bandwidth |

**Why this matters:** GPU utilization tells you WHERE to look. If GPU isn't saturated, optimizing GPU operations won't help.


**Step 2: Profile Data Loading vs Computation**

```python
import time

def profile_dataloader_vs_model(model, dataloader, num_batches=50):
    """Separate data loading time from model computation time"""
    model.train()

    data_times = []
    compute_times = []

    batch_iterator = iter(dataloader)

    for i in range(num_batches):
        # Time data loading
        data_start = time.time()
        data, target = next(batch_iterator)
        data_end = time.time()
        data_times.append(data_end - data_start)

        # Time computation
        data, target = data.cuda(non_blocking=True), target.cuda(non_blocking=True)
        torch.cuda.synchronize()

        compute_start = torch.cuda.Event(enable_timing=True)
        compute_end = torch.cuda.Event(enable_timing=True)

        compute_start.record()
        output = model(data)
        loss = criterion(output, target)
        loss.backward()
        optimizer.step()
        optimizer.zero_grad()
        compute_end.record()

        torch.cuda.synchronize()
        compute_times.append(compute_start.elapsed_time(compute_end))

    import numpy as np
    avg_data_time = np.mean(data_times) * 1000  # to ms
    avg_compute_time = np.mean(compute_times)

    print(f"Avg data loading time: {avg_data_time:.2f} ms")
    print(f"Avg computation time: {avg_compute_time:.2f} ms")
    print(f"Data loading is {avg_data_time/avg_compute_time:.1f}x "
          f"{'slower' if avg_data_time > avg_compute_time else 'faster'} than compute")

    if avg_data_time > avg_compute_time:
        print("⚠️ BOTTLENECK: Data loading (CPU-bound)")
        print("   Solutions: Increase num_workers, use pin_memory=True, "
              "move preprocessing to GPU")
    else:
        print("✅ Data loading is fast enough. Bottleneck is in model computation.")

    return avg_data_time, avg_compute_time
```

**Why this matters:**
- If data loading > computation time, GPU is starved (increase workers)
- If computation > data loading, GPU is bottleneck (optimize model)
- Common mistake: Optimizing model when data loading is the bottleneck


**Step 3: Determine Bottleneck Category**

```python
def diagnose_bottleneck_type(model, dataloader):
    """Systematic bottleneck categorization"""

    # 1. Check GPU utilization
    print("=== GPU Utilization Check ===")
    # Run training for a bit while monitoring GPU
    # If GPU util < 70% → CPU-bound
    # If GPU util > 90% → GPU-bound

    # 2. Check memory bandwidth
    print("\n=== Memory Bandwidth Check ===")
    from torch.profiler import profile, ProfilerActivity

    with profile(activities=[ProfilerActivity.CUDA]) as prof:
        for i, (data, target) in enumerate(dataloader):
            if i >= 5:
                break
            data, target = data.cuda(), target.cuda()
            output = model(data)
            loss = criterion(output, target)
            loss.backward()

    # Look for high memory-bound ops
    events = prof.key_averages()
    for evt in events:
        if evt.cuda_time_total > 0:
            # If many large tensor ops with low FLOPS → memory-bound
            pass

    # 3. Profile phases
    print("\n=== Phase Profiling ===")
    times = profile_training_phases(model, next(iter(dataloader)))

    # Interpret results
    print("\n=== Diagnosis ===")
    if times['data_loading'] > times['forward'] + times['backward']:
        print("BOTTLENECK: CPU-bound (data loading)")
        print("Action: Increase num_workers, enable pin_memory, cache data")
    elif times['forward'] > times['backward'] * 2:
        print("BOTTLENECK: GPU-bound (forward pass)")
        print("Action: Profile forward pass operations")
    elif times['backward'] > times['forward'] * 2:
        print("BOTTLENECK: GPU-bound (backward pass)")
        print("Action: Profile backward pass, check gradient checkpointing")
    else:
        print("BOTTLENECK: Mixed or memory-bound")
        print("Action: Deep profiling needed")
```


### Phase 3: Narrow to Component

**Step 1: Coarse-Grained Profiling**

```python
from torch.profiler import profile, ProfilerActivity, schedule

def profile_training_step(model, dataloader, num_steps=10):
    """Profile one training step to identify bottleneck phase"""

    # Use schedule to reduce profiling overhead
    with profile(
        activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA],
        schedule=schedule(wait=1, warmup=2, active=5, repeat=1),
        on_trace_ready=torch.profiler.tensorboard_trace_handler('./profiler_logs'),
        record_shapes=True,
        profile_memory=True,
        with_stack=True
    ) as prof:

        for step, (data, target) in enumerate(dataloader):
            if step >= num_steps:
                break

            data, target = data.cuda(), target.cuda()

            optimizer.zero_grad()

            # Forward
            output = model(data)
            loss = criterion(output, target)

            # Backward
            loss.backward()

            # Optimizer
            optimizer.step()

            prof.step()  # Notify profiler of step boundary

    # Print summary
    print(prof.key_averages().table(
        sort_by="cuda_time_total",
        row_limit=20,
        max_src_column_width=80
    ))

    # Export trace for visualization
    print("\n✅ Trace exported to ./profiler_logs")
    print("   View in Chrome: chrome://tracing (load trace.json)")
    print("   Or TensorBoard: tensorboard --logdir=./profiler_logs")

    return prof
```

**Understanding the schedule:**
- `wait=1`: Skip first iteration (cold start)
- `warmup=2`: Next 2 iterations for warmup (no profiling overhead)
- `active=5`: Profile these 5 iterations
- `repeat=1`: Do this cycle once

**Why this matters:**
- Profiling has overhead - don't profile every iteration
- Schedule controls when profiling is active
- Warmup prevents including JIT compilation in measurements


**Step 2: Phase-Level Timing**

```python
from torch.profiler import record_function

def profile_training_phases(model, batch, target):
    """Time each phase of training separately"""

    data, target = batch.cuda(), target.cuda()
    optimizer.zero_grad()

    torch.cuda.synchronize()

    # Profile each phase
    phases = {}

    # Forward pass
    start = torch.cuda.Event(enable_timing=True)
    end = torch.cuda.Event(enable_timing=True)

    with record_function("forward_pass"):
        start.record()
        output = model(data)
        loss = criterion(output, target)
        end.record()
        torch.cuda.synchronize()
        phases['forward'] = start.elapsed_time(end)

    # Backward pass
    start = torch.cuda.Event(enable_timing=True)
    end = torch.cuda.Event(enable_timing=True)

    with record_function("backward_pass"):
        start.record()
        loss.backward()
        end.record()
        torch.cuda.synchronize()
        phases['backward'] = start.elapsed_time(end)

    # Optimizer step
    start = torch.cuda.Event(enable_timing=True)
    end = torch.cuda.Event(enable_timing=True)

    with record_function("optimizer_step"):
        start.record()
        optimizer.step()
        end.record()
        torch.cuda.synchronize()
        phases['optimizer'] = start.elapsed_time(end)

    # Print breakdown
    total = sum(phases.values())
    print("Phase Breakdown:")
    for phase, time_ms in phases.items():
        print(f"  {phase:15s}: {time_ms:7.2f} ms ({time_ms/total*100:5.1f}%)")

    return phases
```

**Why this matters:**
- Identifies which phase is slowest
- Focuses subsequent profiling on bottleneck phase
- Uses `record_function` to add custom markers in trace view


**Step 3: Module-Level Profiling**

```python
def profile_model_modules(model, sample_input):
    """Profile time spent in each model module"""

    model.eval()
    sample_input = sample_input.cuda()

    # Add hooks to time each module
    module_times = {}

    def make_hook(name):
        def hook(module, input, output):
            if name not in module_times:
                module_times[name] = []

            torch.cuda.synchronize()
            start = torch.cuda.Event(enable_timing=True)
            end = torch.cuda.Event(enable_timing=True)

            start.record()
            # Forward already happened, this is for next time
            end.record()

            torch.cuda.synchronize()
            # (Note: This is simplified - real implementation more complex)

        return hook

    # Register hooks
    hooks = []
    for name, module in model.named_modules():
        if len(list(module.children())) == 0:  # Leaf modules only
            hook = module.register_forward_hook(make_hook(name))
            hooks.append(hook)

    # Better approach: Use record_function
    class ProfilingModule(torch.nn.Module):
        def __init__(self, module, name):
            super().__init__()
            self.module = module
            self.name = name

        def forward(self, *args, **kwargs):
            with record_function(f"module_{self.name}"):
                return self.module(*args, **kwargs)

    # Or just use torch.profiler with record_shapes=True
    # It will automatically show module breakdown

    with profile(activities=[ProfilerActivity.CUDA]) as prof:
        with record_function("model_forward"):
            output = model(sample_input)

    print(prof.key_averages(group_by_input_shape=True).table(
        sort_by="cuda_time_total", row_limit=20
    ))

    # Clean up
    for hook in hooks:
        hook.remove()
```

**Why this matters:**
- Identifies which model component is slowest
- Guides optimization efforts to specific layers
- Reveals unexpected bottlenecks (e.g., LayerNorm taking 30% of time)


### Phase 4: Identify Operation

**Step 1: Detailed Operation Profiling**

```python
def profile_operations_detailed(model, sample_input):
    """Get detailed breakdown of all operations"""

    model.eval()
    sample_input = sample_input.cuda()

    with profile(
        activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA],
        record_shapes=True,
        with_stack=True,
        profile_memory=True
    ) as prof:
        output = model(sample_input)

    # Group by operation type
    print("\n=== Top Operations by CUDA Time ===")
    print(prof.key_averages().table(
        sort_by="cuda_time_total",
        row_limit=30,
        max_src_column_width=100
    ))

    print("\n=== Top Operations by Memory ===")
    print(prof.key_averages().table(
        sort_by="self_cuda_memory_usage",
        row_limit=20,
        max_src_column_width=100
    ))

    print("\n=== Grouped by Input Shape ===")
    print(prof.key_averages(group_by_input_shape=True).table(
        sort_by="cuda_time_total",
        row_limit=20
    ))

    # Export for trace view
    prof.export_chrome_trace("detailed_trace.json")
    print("\n✅ Exported detailed_trace.json - view in chrome://tracing")

    return prof
```


**Step 2: Reading Profiler Output**

```python
# Example profiler output:
"""
---------------------------------  ------------  ------------  ------------  ------------
                             Name    Self CPU %      Self CPU   CPU total %     CPU total
---------------------------------  ------------  ------------  ------------  ------------
                      aten::conv2d         0.5%       124ms         45.2%      11.234s
                 aten::convolution         1.2%       298ms         44.7%      11.110s
               aten::_convolution         2.3%       571ms         43.5%      10.812s
          aten::cudnn_convolution        40.1%      9.967s         41.2%      10.241s
                     aten::batch_norm         0.3%        74ms         25.8%       6.412s
                      aten::_batch_norm         1.1%       273ms         25.5%       6.338s
        aten::cudnn_batch_norm        23.2%      5.765s         24.4%       6.065s
                           aten::relu         8.2%      2.038s          8.2%       2.038s
"""
```

**How to interpret:**

| Column | Meaning | When to Look |
|--------|---------|--------------|
| Name | Operation name | Identify what operation |
| Self CPU % | Time in this op only (no children) | Find leaf operations |
| CPU total % | Time in op + children | Find expensive subtrees |
| Self CUDA time | GPU execution time | Main metric for GPU ops |
| Call count | How many times called | High count = optimization target |

**Common patterns:**

```python
# Pattern 1: High aten::copy_ time (40%+)
# → Device transfer issue (CPU ↔ GPU)
# Action: Check device placement, reduce transfers

# Pattern 2: High cudaLaunchKernel overhead
# → Too many small kernel launches
# Action: Increase batch size, fuse operations

# Pattern 3: High cudnn_convolution time
# → Convolutions are bottleneck (expected for CNNs)
# Action: Check input dimensions for Tensor Core alignment

# Pattern 4: High CPU time, low CUDA time
# → CPU bottleneck (data loading, preprocessing)
# Action: Increase num_workers, move ops to GPU

# Pattern 5: Many small operations
# → Operation fusion opportunity
# Action: Use torch.compile or fuse manually
```


**Step 3: Trace View Analysis**

```python
# After exporting trace: prof.export_chrome_trace("trace.json")
# Open in chrome://tracing

"""
Trace view shows:
1. Timeline of GPU kernels
2. CPU → GPU synchronization points
3. Parallel vs sequential execution
4. GPU idle time (gaps between kernels)

What to look for:
- Large gaps between GPU kernels → GPU underutilized
- Many thin bars → Too many small operations
- Thick bars → Few large operations (good for GPU)
- Yellow/red bars → CPU activity (should be minimal during GPU work)
- Overlapping bars → Concurrent execution (good)
"""
```

**Reading trace view:**

```
GPU Stream 0:  ████░░░░████░░░░████  ← Gaps = idle GPU (bad)
GPU Stream 0:  ███████████████████   ← Continuous = good utilization
CPU:           ░░░░████░░░░████░░░░   ← CPU peaks = data loading

Timeline:
[Data Load]──→[GPU Forward]──→[Data Load]──→[GPU Forward]
            ↑ Gap here = GPU waiting for data
```


## Memory Profiling

### Memory Tracking Methodology

**Step 1: Basic Memory Tracking**

```python
import torch

def track_memory(stage_name):
    """Print current memory usage"""
    allocated = torch.cuda.memory_allocated() / 1e9
    reserved = torch.cuda.memory_reserved() / 1e9
    print(f"{stage_name:30s} - Allocated: {allocated:6.2f} GB, "
          f"Reserved: {reserved:6.2f} GB")

# Track at each training phase
track_memory("Start")

data, target = next(iter(dataloader))
data, target = data.cuda(), target.cuda()
track_memory("After data to GPU")

output = model(data)
track_memory("After forward")

loss = criterion(output, target)
track_memory("After loss")

loss.backward()
track_memory("After backward")

optimizer.step()
track_memory("After optimizer step")

optimizer.zero_grad()
track_memory("After zero_grad")
```

**Output interpretation:**

```
Start                          - Allocated:   2.50 GB, Reserved:   2.75 GB
After data to GPU              - Allocated:   2.62 GB, Reserved:   2.75 GB
After forward                  - Allocated:   4.80 GB, Reserved:   5.00 GB  ← Activations
After loss                     - Allocated:   4.81 GB, Reserved:   5.00 GB
After backward                 - Allocated:   7.20 GB, Reserved:   7.50 GB  ← Gradients
After optimizer step           - Allocated:   7.20 GB, Reserved:   7.50 GB
After zero_grad                - Allocated:   4.70 GB, Reserved:   7.50 GB  ← Gradients freed
```

**Key insights:**
- Allocated = actual memory used
- Reserved = memory held by allocator (may be > allocated due to caching)
- Large jump after forward = activations (consider gradient checkpointing)
- Large jump after backward = gradients (same size as parameters)
- Reserved stays high = memory fragmentation or caching


**Step 2: Peak Memory Analysis**

```python
def analyze_peak_memory(model, batch, target):
    """Find peak memory usage and what causes it"""

    torch.cuda.reset_peak_memory_stats()
    torch.cuda.synchronize()

    # Run one iteration
    data, target = batch.cuda(), target.cuda()

    output = model(data)
    forward_peak = torch.cuda.max_memory_allocated() / 1e9

    loss = criterion(output, target)
    loss_peak = torch.cuda.max_memory_allocated() / 1e9

    loss.backward()
    backward_peak = torch.cuda.max_memory_allocated() / 1e9

    optimizer.step()
    optimizer_peak = torch.cuda.max_memory_allocated() / 1e9

    optimizer.zero_grad()
    final_peak = torch.cuda.max_memory_allocated() / 1e9

    print(f"Peak after forward:   {forward_peak:.2f} GB")
    print(f"Peak after loss:      {loss_peak:.2f} GB")
    print(f"Peak after backward:  {backward_peak:.2f} GB")
    print(f"Peak after optimizer: {optimizer_peak:.2f} GB")
    print(f"Overall peak:         {final_peak:.2f} GB")

    # Identify bottleneck
    if forward_peak > backward_peak * 0.8:
        print("\n⚠️ Activations dominate memory usage")
        print("   Consider: Gradient checkpointing, smaller batch size")
    elif backward_peak > forward_peak * 1.5:
        print("\n⚠️ Gradients dominate memory usage")
        print("   Consider: Gradient accumulation, mixed precision")
    else:
        print("\n✅ Memory usage balanced across phases")

    return {
        'forward': forward_peak,
        'backward': backward_peak,
        'optimizer': optimizer_peak,
        'peak': final_peak
    }
```


**Step 3: Detailed Memory Summary**

```python
def print_memory_summary():
    """Print detailed memory breakdown"""
    print(torch.cuda.memory_summary())

"""
Example output:
|===========================================================================|
|                  PyTorch CUDA memory summary                              |
|---------------------------------------------------------------------------|
| CUDA OOMs: 0                                                              |
| Metric         | Cur Usage  | Peak Usage | Alloc Retries | # Allocs      |
|----------------|------------|------------|---------------|---------------|
| Allocated      |   4.50 GB  |   7.20 GB  |             0 |       15234   |
| Reserved       |   7.50 GB  |   7.50 GB  |             0 |        1523   |
| Active         |   4.50 GB  |   7.20 GB  |               |               |
| Inactive       |   3.00 GB  |   0.30 GB  |               |               |
|===========================================================================|

Allocated memory:     4.50 GB  ← Actual tensors
Reserved memory:      7.50 GB  ← Memory held by allocator
Active allocations:   4.50 GB  ← Currently in use
Inactive allocations: 3.00 GB  ← Cached for reuse (fragmentation)
"""

# If Inactive >> 0, memory fragmentation is occurring
# Periodic torch.cuda.empty_cache() may help
```


**Step 4: Memory Snapshot (PyTorch 2.0+)**

```python
import pickle
import torch.cuda

def capture_memory_snapshot(filename="memory_snapshot.pickle"):
    """Capture detailed memory snapshot for analysis"""

    # Enable memory history tracking
    torch.cuda.memory._record_memory_history(max_entries=100000)

    try:
        # Run your training code here
        for i, (data, target) in enumerate(dataloader):
            if i >= 5:
                break
            data, target = data.cuda(), target.cuda()
            output = model(data)
            loss = criterion(output, target)
            loss.backward()
            optimizer.step()
            optimizer.zero_grad()

        # Capture snapshot
        torch.cuda.memory._dump_snapshot(filename)
        print(f"✅ Memory snapshot saved to {filename}")

    finally:
        # Disable tracking
        torch.cuda.memory._record_memory_history(enabled=None)

    print(f"\nAnalyze with:")
    print(f"  python -m torch.cuda._memory_viz trace_plot {filename}")
    print(f"  # Opens interactive visualization in browser")
```

**Memory snapshot visualization shows:**
- Allocation timeline
- Stack traces for each allocation
- Memory leaks (allocations never freed)
- Fragmentation patterns
- Peak memory events


## GPU Timing Best Practices

### CUDA Synchronization and Events

**❌ WRONG: Using time.time() for GPU operations**

```python
import time

# WRONG - measures CPU time, not GPU time!
start = time.time()
output = model(data)  # Kernel launches, returns immediately
end = time.time()
print(f"Time: {end - start:.4f}s")  # ❌ This is kernel launch overhead (~microseconds)

# Problem: CUDA operations are asynchronous
# time.time() measures CPU time (when kernel was launched)
# Not GPU time (when kernel actually executed)
```

**Why this is wrong:**
- CUDA kernel launches are asynchronous (return immediately to CPU)
- `time.time()` measures CPU wall-clock time
- Actual GPU execution happens later, in parallel with CPU
- Measured time is kernel launch overhead (microseconds), not execution time


**✅ CORRECT: Using CUDA Events**

```python
# CORRECT - measures actual GPU execution time
start_event = torch.cuda.Event(enable_timing=True)
end_event = torch.cuda.Event(enable_timing=True)

start_event.record()
output = model(data)
end_event.record()

# Wait for GPU to finish
torch.cuda.synchronize()

# Get elapsed time in milliseconds
elapsed_time_ms = start_event.elapsed_time(end_event)
print(f"GPU Time: {elapsed_time_ms:.2f} ms")
```

**Why this is correct:**
- CUDA Events are GPU-native timing
- `record()` inserts timing markers into GPU stream
- `synchronize()` waits for GPU to complete
- `elapsed_time()` returns actual GPU execution time


**Alternative: Using torch.profiler**

```python
# For comprehensive profiling, use torch.profiler instead of manual timing
from torch.profiler import profile, ProfilerActivity

with profile(activities=[ProfilerActivity.CUDA]) as prof:
    output = model(data)

print(prof.key_averages().table(sort_by="cuda_time_total"))
# This automatically handles synchronization and provides detailed breakdown
```


### Warmup Iterations

**Why warmup is critical:**

```python
# First iteration includes:
# 1. CUDA kernel JIT compilation
# 2. cuDNN algorithm selection (benchmark mode)
# 3. Memory pool allocation
# 4. CPU→GPU transfer of model weights (first time)

# Example timing without warmup:
model.eval()
with torch.no_grad():
    for i in range(10):
        start = torch.cuda.Event(enable_timing=True)
        end = torch.cuda.Event(enable_timing=True)

        start.record()
        output = model(sample_input)
        end.record()
        torch.cuda.synchronize()

        print(f"Iteration {i}: {start.elapsed_time(end):.2f} ms")

"""
Output:
Iteration 0: 1234.56 ms  ← JIT compilation, cuDNN benchmarking
Iteration 1:  987.43 ms  ← Still some overhead
Iteration 2:  102.34 ms  ← Stabilized
Iteration 3:  101.89 ms  ← Stable
Iteration 4:  102.12 ms  ← Stable
...
"""
```

**✅ Correct warmup methodology:**

```python
def benchmark_with_warmup(model, sample_input, warmup=5, iterations=100):
    """Proper benchmarking with warmup"""

    model.eval()
    sample_input = sample_input.cuda()

    # Warmup iterations (CRITICAL!)
    with torch.no_grad():
        for _ in range(warmup):
            _ = model(sample_input)

    # Ensure warmup completed
    torch.cuda.synchronize()

    # Actual measurement
    times = []
    with torch.no_grad():
        for _ in range(iterations):
            start = torch.cuda.Event(enable_timing=True)
            end = torch.cuda.Event(enable_timing=True)

            start.record()
            output = model(sample_input)
            end.record()

            torch.cuda.synchronize()
            times.append(start.elapsed_time(end))

    # Report statistics
    import numpy as np
    times = np.array(times)

    print(f"Mean:   {times.mean():.2f} ms")
    print(f"Std:    {times.std():.2f} ms")
    print(f"Median: {np.median(times):.2f} ms")
    print(f"Min:    {times.min():.2f} ms")
    print(f"Max:    {times.max():.2f} ms")
    print(f"P95:    {np.percentile(times, 95):.2f} ms")
    print(f"P99:    {np.percentile(times, 99):.2f} ms")

    return times
```

**Warmup rules:**
- Minimum 3 iterations, recommend 5-10
- More complex models need more warmup
- Dynamic control flow needs extra warmup
- Report statistics (mean, std, percentiles), not just average


## Bottleneck Identification Patterns

### CPU-Bound Bottlenecks

**Symptoms:**
- Low GPU utilization (<70%)
- High CPU usage
- Data loading time > computation time
- `nvidia-smi` shows low GPU usage

**Diagnostic code:**

```python
def diagnose_cpu_bottleneck(model, dataloader):
    """Check if training is CPU-bound"""

    # Check GPU utilization
    import subprocess
    result = subprocess.run(
        ['nvidia-smi', '--query-gpu=utilization.gpu', '--format=csv,noheader,nounits'],
        capture_output=True, text=True
    )
    gpu_util = int(result.stdout.strip())

    print(f"GPU Utilization: {gpu_util}%")

    if gpu_util < 70:
        print("⚠️ LOW GPU UTILIZATION - likely CPU-bound")

        # Profile data loading vs compute
        data_time, compute_time = profile_dataloader_vs_model(model, dataloader)

        if data_time > compute_time:
            print("\n🎯 BOTTLENECK: Data loading")
            print("Solutions:")
            print("  1. Increase num_workers in DataLoader")
            print(f"     Current: {dataloader.num_workers}, try: {os.cpu_count()}")
            print("  2. Enable pin_memory=True")
            print("  3. Move data augmentation to GPU (use kornia)")
            print("  4. Cache preprocessed data if dataset is small")
            print("  5. Use faster storage (SSD instead of HDD)")

        else:
            print("\n🎯 BOTTLENECK: CPU preprocessing")
            print("Solutions:")
            print("  1. Move preprocessing to GPU")
            print("  2. Reduce preprocessing complexity")
            print("  3. Batch preprocessing operations")

    else:
        print("✅ GPU utilization is healthy")
        print("   Bottleneck is likely in GPU computation")

    return gpu_util
```

**Common solutions:**

```python
# Solution 1: Increase num_workers
dataloader = DataLoader(
    dataset,
    batch_size=32,
    num_workers=8,  # Increase from default 0
    pin_memory=True,  # Enable for faster GPU transfer
    persistent_workers=True  # Keep workers alive between epochs
)

# Solution 2: Move augmentation to GPU
import kornia

class GPUAugmentation(nn.Module):
    def __init__(self):
        super().__init__()
        self.augment = nn.Sequential(
            kornia.augmentation.RandomHorizontalFlip(p=0.5),
            kornia.augmentation.ColorJitter(0.2, 0.2, 0.2, 0.1),
            kornia.augmentation.RandomResizedCrop((224, 224)),
        )

    def forward(self, x):
        return self.augment(x)

# Apply on GPU
gpu_augment = GPUAugmentation().cuda()
for data, target in dataloader:
    data = data.cuda()
    data = gpu_augment(data)  # Augment on GPU
    output = model(data)
```


### GPU-Bound Bottlenecks

**Symptoms:**
- High GPU utilization (>90%)
- Computation time > data loading time
- High CUDA time in profiler

**Diagnostic code:**

```python
def diagnose_gpu_bottleneck(model, sample_input):
    """Identify GPU bottleneck operations"""

    with profile(activities=[ProfilerActivity.CUDA]) as prof:
        output = model(sample_input)
        loss = criterion(output, target)
        loss.backward()

    # Find top GPU operations
    events = prof.key_averages()
    cuda_events = [(evt.key, evt.cuda_time_total) for evt in events
                   if evt.cuda_time_total > 0]
    cuda_events.sort(key=lambda x: x[1], reverse=True)

    print("Top 10 GPU operations:")
    total_time = sum(time for _, time in cuda_events)
    for i, (name, time) in enumerate(cuda_events[:10], 1):
        percentage = (time / total_time) * 100
        print(f"{i:2d}. {name:40s} {time/1000:8.2f} ms ({percentage:5.1f}%)")

    # Check for optimization opportunities
    top_op = cuda_events[0][0]
    if 'conv' in top_op.lower():
        print("\n🎯 Bottleneck: Convolution operations")
        print("Solutions:")
        print("  1. Check input dimensions for Tensor Core alignment (multiples of 8)")
        print("  2. Use mixed precision (torch.cuda.amp)")
        print("  3. Consider depthwise separable convolutions")
        print("  4. Profile with different batch sizes")

    elif 'mm' in top_op.lower() or 'matmul' in top_op.lower():
        print("\n🎯 Bottleneck: Matrix multiplication")
        print("Solutions:")
        print("  1. Ensure dimensions are multiples of 8 (FP16) or 16 (BF16)")
        print("  2. Use mixed precision")
        print("  3. Check for unnecessary transposes")

    elif 'copy' in top_op.lower():
        print("\n🎯 Bottleneck: Memory copies")
        print("Solutions:")
        print("  1. Check device placement (CPU ↔ GPU transfers)")
        print("  2. Ensure tensors are contiguous")
        print("  3. Reduce explicit .cuda() or .cpu() calls")

    return cuda_events
```

**Common solutions:**

```python
# Solution 1: Mixed precision
from torch.cuda.amp import autocast

with autocast():
    output = model(data)
    loss = criterion(output, target)
# 2-3x speedup for large models

# Solution 2: Tensor Core alignment
# Ensure dimensions are multiples of 8 (FP16) or 16 (BF16)
# BAD:  (batch=31, seq_len=127, hidden=509)
# GOOD: (batch=32, seq_len=128, hidden=512)

# Solution 3: torch.compile (PyTorch 2.0+)
model = torch.compile(model)
# Automatic kernel fusion and optimization
```


### Memory-Bound Bottlenecks

**Symptoms:**
- Low GPU utilization despite high memory usage
- Large tensor operations dominating time
- Memory bandwidth saturated

**Diagnostic code:**

```python
def diagnose_memory_bottleneck(model, sample_input):
    """Check if operations are memory-bandwidth limited"""

    # Profile memory and compute
    with profile(
        activities=[ProfilerActivity.CUDA],
        profile_memory=True
    ) as prof:
        output = model(sample_input)

    # Analyze operations
    for evt in prof.key_averages():
        if evt.cuda_time_total > 0 and evt.self_cuda_memory_usage > 0:
            # Rough FLOP/s estimate
            # Memory-bound: low FLOP/s despite high memory usage
            # Compute-bound: high FLOP/s

            memory_gb = evt.self_cuda_memory_usage / 1e9
            time_s = evt.cuda_time_total / 1e6  # µs to s

            if memory_gb > 1.0 and time_s > 0.01:
                bandwidth = memory_gb / time_s  # GB/s
                print(f"{evt.key:40s}: {bandwidth:.1f} GB/s")

    print("\nIf bandwidth < 500 GB/s, likely memory-bound")
    print("Solutions:")
    print("  1. Reduce intermediate tensor sizes")
    print("  2. Use in-place operations where safe")
    print("  3. Tile large operations")
    print("  4. Increase arithmetic intensity (more compute per byte)")
```


### I/O-Bound Bottlenecks

**Symptoms:**
- Low CPU and GPU utilization
- Long pauses between batches
- Slow disk I/O

**Solutions:**

```python
# Solution 1: Cache dataset in RAM
class CachedDataset(Dataset):
    def __init__(self, dataset):
        self.cache = [dataset[i] for i in range(len(dataset))]

    def __getitem__(self, idx):
        return self.cache[idx]

    def __len__(self):
        return len(self.cache)

# Solution 2: Use SSD storage or RAM disk
# Solution 3: Prefetch data
dataloader = DataLoader(
    dataset,
    num_workers=8,
    prefetch_factor=4,  # Prefetch 4 batches per worker
    pin_memory=True
)
```


## Common Profiling Mistakes

### Mistake 1: Profiling Too Many Iterations

**❌ WRONG:**

```python
# Profiling 100 epochs - output is massive, unusable
with profile(activities=[ProfilerActivity.CUDA]) as prof:
    for epoch in range(100):
        for batch in dataloader:
            # ... training ...
            pass
```

**✅ CORRECT:**

```python
# Profile just a few iterations
with profile(
    activities=[ProfilerActivity.CUDA],
    schedule=schedule(wait=1, warmup=2, active=3, repeat=1)
) as prof:
    for epoch in range(1):
        for step, batch in enumerate(dataloader):
            if step >= 10:
                break
            # ... training ...
            prof.step()
```


### Mistake 2: No Warmup Before Timing

**❌ WRONG:**

```python
# Including JIT compilation in timing
start = time.time()
output = model(data)  # First call - includes JIT overhead
end = time.time()
```

**✅ CORRECT:**

```python
# Warmup first
for _ in range(5):
    _ = model(data)

torch.cuda.synchronize()

# Now measure
start = torch.cuda.Event(enable_timing=True)
end = torch.cuda.Event(enable_timing=True)
start.record()
output = model(data)
end.record()
torch.cuda.synchronize()
```


### Mistake 3: Synchronization Overhead

**❌ WRONG:**

```python
# Synchronizing in training loop - kills performance!
for batch in dataloader:
    output = model(batch)
    torch.cuda.synchronize()  # ❌ Breaks pipelining!
    loss = criterion(output, target)
    torch.cuda.synchronize()  # ❌ Unnecessary!
    loss.backward()
```

**✅ CORRECT:**

```python
# Only synchronize for timing/profiling, not in production
for batch in dataloader:
    output = model(batch)
    loss = criterion(output, target)
    loss.backward()
    # No synchronization - let GPU pipeline work
```

**When to synchronize:**
- Profiling/timing measurements
- Before memory measurements
- Debugging CUDA errors
- NEVER in production training loop


### Mistake 4: Wrong Profiling Granularity

**❌ WRONG:**

```python
# Profiling entire model - too coarse, can't identify bottleneck
with profile() as prof:
    output = model(data)
# "Model takes 100ms" - not actionable!
```

**✅ CORRECT:**

```python
# Iterative narrowing:
# 1. Profile whole step
# 2. Identify slow phase (forward, backward, optimizer)
# 3. Profile that phase in detail
# 4. Identify specific operation

# Phase 1: Coarse
with record_function("forward"):
    output = model(data)
with record_function("backward"):
    loss.backward()

# Phase 2: Found forward is slow, profile in detail
with profile() as prof:
    output = model(data)
# Now see which layer is slow

# Phase 3: Found layer X is slow, profile that layer
with profile() as prof:
    output = model.layer_x(data)
# Now see which operation in layer X is slow
```


### Mistake 5: Ignoring Memory While Profiling Compute

**❌ WRONG:**

```python
# Only looking at time, ignoring memory
with profile(activities=[ProfilerActivity.CUDA]) as prof:
    output = model(data)
```

**✅ CORRECT:**

```python
# Profile both compute AND memory
with profile(
    activities=[ProfilerActivity.CUDA],
    profile_memory=True
) as prof:
    output = model(data)

# Check both time and memory
print(prof.key_averages().table(sort_by="cuda_time_total"))
print(prof.key_averages().table(sort_by="self_cuda_memory_usage"))
```


### Mistake 6: Profiling in Wrong Mode

**❌ WRONG:**

```python
# Profiling in eval mode when you care about training speed
model.eval()
with torch.no_grad():
    with profile() as prof:
        output = model(data)
# ❌ This doesn't include backward pass!
```

**✅ CORRECT:**

```python
# Profile in the mode you actually use
model.train()
with profile() as prof:
    output = model(data)
    loss = criterion(output, target)
    loss.backward()  # ✅ Include backward if profiling training
```


## Red Flags - Stop and Profile Systematically

**If you catch yourself thinking ANY of these, STOP and follow methodology:**

| Red Flag Thought | Reality | What to Do Instead |
|------------------|---------|-------------------|
| "I can see the bottleneck" | 90% of the time your guess is wrong | Profile to confirm, don't guess |
| "User says X is slow, so X is the bottleneck" | User might be wrong about the cause | Verify with profiling |
| "This loop looks inefficient" | Intuition about performance often wrong | Measure it, don't assume |
| "Profiling takes too long" | Profiling saves hours of guessing | 10 minutes of profiling > hours of guessing |
| "Let me just try this optimization" | Premature optimization wastes time | Measure first, optimize second |
| "It's obviously a GPU problem" | Could be CPU, data loading, or I/O | Check GPU utilization first |
| "I'll reduce batch size" | Doesn't address root cause | Diagnose memory bottleneck first |
| "Skip warmup, it's just one iteration" | First iterations have 10-100x overhead | Always warmup, no exceptions |

**Critical rules:**
1. NEVER optimize before profiling
2. ALWAYS use warmup iterations
3. ALWAYS check GPU utilization before assuming GPU bottleneck
4. ALWAYS profile data loading separately from computation
5. ALWAYS report statistics (mean, std, percentiles), not just average
6. ALWAYS use CUDA Events for GPU timing, never `time.time()`


## Common Rationalizations (Don't Do These)

| Excuse | What Really Happens | Correct Approach |
|--------|-------------------|------------------|
| "User seems rushed, skip profiling" | Guessing wastes MORE time than profiling | 10 min profiling saves hours |
| "I already profiled once" | Might have used wrong tool or granularity | Re-profile with systematic methodology |
| "Profiling overhead will skew results" | Use schedule to minimize overhead | `schedule(wait=1, warmup=2, active=3)` |
| "This worked on another model" | Different models have different bottlenecks | Profile THIS model, not assumptions |
| "Documentation says X is slow" | Depends on context, hardware, data | Verify with profiling on YOUR setup |
| "Just trust the profiler output" | Must interpret correctly | Understand what metrics mean |
| "The model is the bottleneck" | Often it's data loading | Always check data loading vs compute |


## Complete Profiling Example

```python
import torch
import torch.nn as nn
from torch.profiler import profile, ProfilerActivity, schedule
from torch.utils.data import DataLoader
import numpy as np

def comprehensive_profiling(model, dataloader, device='cuda'):
    """
    Complete profiling workflow following systematic methodology
    """

    print("=" * 80)
    print("PHASE 1: ESTABLISH BASELINE")
    print("=" * 80)

    # Step 1: Measure baseline performance
    model = model.to(device)
    model.train()

    # Warmup
    print("\nWarming up (5 iterations)...")
    for i, (data, target) in enumerate(dataloader):
        if i >= 5:
            break
        data, target = data.to(device), target.to(device)
        output = model(data)
        loss = criterion(output, target)
        loss.backward()
        optimizer.step()
        optimizer.zero_grad()

    torch.cuda.synchronize()

    # Measure baseline
    print("\nMeasuring baseline (10 iterations)...")
    times = []
    for i, (data, target) in enumerate(dataloader):
        if i >= 10:
            break

        data, target = data.to(device), target.to(device)

        start = torch.cuda.Event(enable_timing=True)
        end = torch.cuda.Event(enable_timing=True)

        start.record()
        output = model(data)
        loss = criterion(output, target)
        loss.backward()
        optimizer.step()
        optimizer.zero_grad()
        end.record()

        torch.cuda.synchronize()
        times.append(start.elapsed_time(end))

    times = np.array(times)
    print(f"\nBaseline Performance:")
    print(f"  Mean:   {times.mean():.2f} ms/iteration")
    print(f"  Std:    {times.std():.2f} ms")
    print(f"  Median: {np.median(times):.2f} ms")
    print(f"  P95:    {np.percentile(times, 95):.2f} ms")

    # -------------------------------------------------------------------------
    print("\n" + "=" * 80)
    print("PHASE 2: IDENTIFY BOTTLENECK TYPE")
    print("=" * 80)

    # Check GPU utilization
    import subprocess
    result = subprocess.run(
        ['nvidia-smi', '--query-gpu=utilization.gpu,memory.used',
         '--format=csv,noheader,nounits'],
        capture_output=True, text=True
    )
    gpu_util, mem_used = result.stdout.strip().split(',')
    print(f"\nGPU Utilization: {gpu_util}%")
    print(f"GPU Memory Used: {mem_used} MB")

    if int(gpu_util) < 70:
        print("⚠️ LOW GPU UTILIZATION - likely CPU-bound")
    else:
        print("✅ GPU utilization healthy - likely GPU-bound")

    # Profile data loading vs computation
    print("\nProfiling data loading vs computation...")
    data_times = []
    compute_times = []

    batch_iter = iter(dataloader)
    for i in range(20):
        import time

        # Data loading time
        data_start = time.time()
        data, target = next(batch_iter)
        data_time = time.time() - data_start
        data_times.append(data_time * 1000)  # to ms

        # Computation time
        data, target = data.to(device), target.to(device)
        torch.cuda.synchronize()

        start = torch.cuda.Event(enable_timing=True)
        end = torch.cuda.Event(enable_timing=True)

        start.record()
        output = model(data)
        loss = criterion(output, target)
        loss.backward()
        optimizer.step()
        optimizer.zero_grad()
        end.record()

        torch.cuda.synchronize()
        compute_times.append(start.elapsed_time(end))

    avg_data = np.mean(data_times)
    avg_compute = np.mean(compute_times)

    print(f"\nData loading: {avg_data:.2f} ms")
    print(f"Computation:  {avg_compute:.2f} ms")

    if avg_data > avg_compute:
        print("🎯 BOTTLENECK: Data loading (CPU-bound)")
    else:
        print("🎯 BOTTLENECK: Model computation (GPU-bound)")

    # -------------------------------------------------------------------------
    print("\n" + "=" * 80)
    print("PHASE 3: NARROW TO COMPONENT")
    print("=" * 80)

    # Profile training phases
    print("\nProfiling training phases...")

    data, target = next(iter(dataloader))
    data, target = data.to(device), target.to(device)

    # Forward
    torch.cuda.synchronize()
    start = torch.cuda.Event(enable_timing=True)
    end = torch.cuda.Event(enable_timing=True)

    start.record()
    output = model(data)
    loss = criterion(output, target)
    end.record()
    torch.cuda.synchronize()
    forward_time = start.elapsed_time(end)

    # Backward
    start = torch.cuda.Event(enable_timing=True)
    end = torch.cuda.Event(enable_timing=True)

    start.record()
    loss.backward()
    end.record()
    torch.cuda.synchronize()
    backward_time = start.elapsed_time(end)

    # Optimizer
    start = torch.cuda.Event(enable_timing=True)
    end = torch.cuda.Event(enable_timing=True)

    start.record()
    optimizer.step()
    optimizer.zero_grad()
    end.record()
    torch.cuda.synchronize()
    optimizer_time = start.elapsed_time(end)

    total = forward_time + backward_time + optimizer_time

    print(f"\nPhase breakdown:")
    print(f"  Forward:   {forward_time:7.2f} ms ({forward_time/total*100:5.1f}%)")
    print(f"  Backward:  {backward_time:7.2f} ms ({backward_time/total*100:5.1f}%)")
    print(f"  Optimizer: {optimizer_time:7.2f} ms ({optimizer_time/total*100:5.1f}%)")

    # -------------------------------------------------------------------------
    print("\n" + "=" * 80)
    print("PHASE 4: IDENTIFY OPERATION")
    print("=" * 80)

    # Detailed profiling
    print("\nRunning detailed profiler...")

    with profile(
        activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA],
        schedule=schedule(wait=1, warmup=2, active=3, repeat=1),
        on_trace_ready=torch.profiler.tensorboard_trace_handler('./profiler_logs'),
        record_shapes=True,
        profile_memory=True,
        with_stack=True
    ) as prof:
        for step, (data, target) in enumerate(dataloader):
            if step >= 10:
                break

            data, target = data.to(device), target.to(device)
            optimizer.zero_grad()

            output = model(data)
            loss = criterion(output, target)
            loss.backward()
            optimizer.step()

            prof.step()

    # Print summary
    print("\nTop operations by CUDA time:")
    print(prof.key_averages().table(
        sort_by="cuda_time_total",
        row_limit=15,
        max_src_column_width=60
    ))

    print("\nTop operations by memory:")
    print(prof.key_averages().table(
        sort_by="self_cuda_memory_usage",
        row_limit=10,
        max_src_column_width=60
    ))

    prof.export_chrome_trace("detailed_trace.json")

    print("\n" + "=" * 80)
    print("PROFILING COMPLETE")
    print("=" * 80)
    print("\nNext steps:")
    print("  1. Review top operations in table above")
    print("  2. Open chrome://tracing and load detailed_trace.json")
    print("  3. Or view in TensorBoard: tensorboard --logdir=./profiler_logs")
    print("  4. Focus optimization on identified bottleneck")
    print("  5. Re-run this profiling after optimization to verify improvement")

    return {
        'baseline_ms': times.mean(),
        'gpu_utilization': int(gpu_util),
        'data_loading_ms': avg_data,
        'computation_ms': avg_compute,
        'forward_ms': forward_time,
        'backward_ms': backward_time,
        'optimizer_ms': optimizer_time,
    }
```


## Memory Profiling Complete Example

```python
def profile_memory_usage(model, sample_batch, sample_target):
    """Comprehensive memory profiling"""

    print("=" * 80)
    print("MEMORY PROFILING")
    print("=" * 80)

    device = next(model.parameters()).device

    # Reset memory stats
    torch.cuda.reset_peak_memory_stats()
    torch.cuda.empty_cache()

    def print_memory(stage):
        allocated = torch.cuda.memory_allocated() / 1e9
        reserved = torch.cuda.memory_reserved() / 1e9
        peak = torch.cuda.max_memory_allocated() / 1e9
        print(f"{stage:30s} | Alloc: {allocated:5.2f} GB | "
              f"Reserved: {reserved:5.2f} GB | Peak: {peak:5.2f} GB")

    print("\nMemory tracking:")
    print("-" * 80)

    print_memory("Initial")

    # Move data to GPU
    data = sample_batch.to(device)
    target = sample_target.to(device)
    print_memory("After data to GPU")

    # Forward pass
    output = model(data)
    print_memory("After forward")

    # Loss
    loss = criterion(output, target)
    print_memory("After loss")

    # Backward
    loss.backward()
    print_memory("After backward")

    # Optimizer
    optimizer.step()
    print_memory("After optimizer step")

    # Zero grad
    optimizer.zero_grad()
    print_memory("After zero_grad")

    # Final peak
    peak_memory = torch.cuda.max_memory_allocated() / 1e9
    print("-" * 80)
    print(f"\nPeak memory usage: {peak_memory:.2f} GB")

    # Detailed summary
    print("\n" + "=" * 80)
    print("DETAILED MEMORY SUMMARY")
    print("=" * 80)
    print(torch.cuda.memory_summary())

    # Memory breakdown
    print("\n" + "=" * 80)
    print("MEMORY OPTIMIZATION SUGGESTIONS")
    print("=" * 80)

    allocated = torch.cuda.memory_allocated() / 1e9
    reserved = torch.cuda.memory_reserved() / 1e9

    if reserved > allocated * 1.5:
        print("⚠️ Memory fragmentation detected")
        print(f"   Reserved: {reserved:.2f} GB, Allocated: {allocated:.2f} GB")
        print("   Suggestion: Call torch.cuda.empty_cache() periodically")

    # Estimate memory components
    param_memory = sum(p.numel() * p.element_size() for p in model.parameters()) / 1e9
    print(f"\nModel parameters: {param_memory:.2f} GB")

    # Estimate gradients (same size as parameters)
    print(f"Gradients (estimate): {param_memory:.2f} GB")

    # Optimizer states (Adam: 2x parameters)
    if isinstance(optimizer, torch.optim.Adam):
        optimizer_memory = param_memory * 2
        print(f"Optimizer states (Adam): {optimizer_memory:.2f} GB")

    # Activations (peak - parameters - gradients - optimizer)
    activation_memory = peak_memory - param_memory - param_memory - optimizer_memory
    if activation_memory > 0:
        print(f"Activations (estimate): {activation_memory:.2f} GB")

        if activation_memory > peak_memory * 0.5:
            print("\n🎯 Activations dominate memory usage")
            print("   Suggestions:")
            print("   1. Use gradient checkpointing")
            print("   2. Reduce batch size")
            print("   3. Use mixed precision (FP16/BF16)")

    return {
        'peak_gb': peak_memory,
        'parameters_gb': param_memory,
        'fragmentation_ratio': reserved / allocated if allocated > 0 else 1.0
    }
```


## Profiling Checklist

Before claiming you've profiled the code, verify:

- [ ] **Baseline established**
  - [ ] Defined performance metric (throughput/latency/memory)
  - [ ] Measured with CUDA Events (not time.time())
  - [ ] Used 5+ warmup iterations
  - [ ] Reported statistics (mean, std, percentiles)
  - [ ] Documented measurement conditions

- [ ] **Bottleneck type identified**
  - [ ] Checked GPU utilization (nvidia-smi)
  - [ ] Profiled data loading vs computation separately
  - [ ] Categorized as CPU-bound, GPU-bound, memory-bound, or I/O-bound
  - [ ] Verified category with profiling data (not guessing)

- [ ] **Component identified**
  - [ ] Profiled training phases (forward/backward/optimizer)
  - [ ] Identified which phase is slowest
  - [ ] Used iterative narrowing approach
  - [ ] Examined both table and trace view

- [ ] **Operation identified**
  - [ ] Profiled bottleneck component in detail
  - [ ] Found specific operation or pattern
  - [ ] Understand WHY it's slow (not just WHAT is slow)
  - [ ] Have actionable optimization target

- [ ] **Verification ready**
  - [ ] Saved baseline measurements
  - [ ] Know how to re-run profiling after optimization
  - [ ] Can verify if optimization actually helped
  - [ ] Have profiling artifacts (traces, summaries)


## References

**PyTorch Profiling Documentation:**
- torch.profiler: https://pytorch.org/docs/stable/profiler.html
- Profiling recipe: https://pytorch.org/tutorials/recipes/recipes/profiler_recipe.html
- Performance tuning guide: https://pytorch.org/tutorials/recipes/recipes/tuning_guide.html

**Related Skills:**
- tensor-operations-and-memory (memory leak debugging, operation optimization)
- mixed-precision-and-optimization (AMP profiling, Tensor Core utilization)
- distributed-training-strategies (multi-GPU profiling)

**Tools:**
- Chrome tracing: chrome://tracing
- TensorBoard profiler: tensorboard --logdir=<path>
- NVIDIA Nsight Systems: nsys profile python train.py
- PyTorch Memory Visualizer: python -m torch.cuda._memory_viz