gh-tachyon-beep-skillpacks-…/skills/using-python-engineering/debugging-and-profiling.md


# Debugging and Profiling

## Overview

**Core Principle:** Profile before optimizing. Humans are terrible at guessing where code is slow. Always measure before making changes.

Python debugging and profiling enables systematic problem diagnosis and performance optimization. Use debugpy/pdb for step-through debugging, cProfile for CPU profiling, memory_profiler for memory analysis. The biggest mistake: optimizing code without profiling first—you'll likely optimize the wrong thing.

## When to Use

**Use this skill when:**
- "Code is slow"
- "How to profile Python?"
- "Memory leak"
- "Debugging not working"
- "Find bottleneck"
- "Optimize performance"
- "Step through code"
- "Where is my code spending time?"

**Don't use when:**
- Setting up project (use project-structure-and-tooling)
- Already know what to optimize (but still profile to verify!)
- Algorithm selection (different skill domain)

**Symptoms triggering this skill:**
- Code runs slower than expected
- Memory usage growing over time
- Need to understand execution flow
- Performance degraded after changes


## Debugging Fundamentals

### Using debugpy with VS Code

```python
# ✅ CORRECT: debugpy for remote debugging
import debugpy

# Allow VS Code to attach
debugpy.listen(5678)
print("Waiting for debugger to attach...")
debugpy.wait_for_client()

# Your code here
def process_data(data):
    result = []
    for item in data:
        # Set breakpoint in VS Code on this line
        transformed = transform(item)
        result.append(transformed)
    return result

# VS Code launch.json configuration:
"""
{
    "version": "0.2.0",
    "configurations": [
        {
            "name": "Python: Attach",
            "type": "python",
            "request": "attach",
            "connect": {
                "host": "localhost",
                "port": 5678
            }
        }
    ]
}
"""
```

### Using pdb (Python Debugger)

```python
# ✅ CORRECT: pdb for interactive debugging
import pdb

def buggy_function(data):
    result = []
    for i, item in enumerate(data):
        # Drop into debugger
        pdb.set_trace()  # Or: breakpoint() in Python 3.7+

        processed = item * 2
        result.append(processed)
    return result

# pdb commands:
# n (next): Execute next line
# s (step): Step into function
# c (continue): Continue execution
# p variable: Print variable
# pp variable: Pretty print variable
# l (list): Show current location in code
# w (where): Show stack trace
# q (quit): Quit debugger
```

### Conditional Breakpoints

```python
# ❌ WRONG: Breaking on every iteration
def process_items(items):
    for item in items:
        pdb.set_trace()  # Breaks 10000 times!
        process(item)

# ✅ CORRECT: Conditional breakpoint
def process_items(items):
    for i, item in enumerate(items):
        if i == 5000:  # Only break on specific iteration
            breakpoint()
        process(item)

# ✅ BETTER: Use pdb.set_trace with condition
def process_items(items):
    for item in items:
        if item.value < 0:  # Break only when problematic
            breakpoint()
        process(item)
```

### Post-Mortem Debugging

```python
# ✅ CORRECT: Debug after exception
import pdb

def main():
    try:
        # Code that might raise exception
        result = risky_operation()
    except Exception:
        # Drop into debugger at exception point
        pdb.post_mortem()

# ✅ CORRECT: Auto post-mortem for unhandled exceptions
import sys

def custom_excepthook(type, value, traceback):
    pdb.post_mortem(traceback)

sys.excepthook = custom_excepthook

# Now unhandled exceptions drop into pdb automatically
```

**Why this matters**: Breakpoints let you inspect state at exact point of failure. Conditional breakpoints avoid noise. Post-mortem debugging examines crashes.


## CPU Profiling

### cProfile for Function-Level Profiling

```python
import cProfile
import pstats

# ❌ WRONG: Guessing which function is slow
def slow_program():
    # "I think this loop is the problem..."
    for i in range(1000):
        process_data(i)

# ✅ CORRECT: Profile to find actual bottleneck
def slow_program():
    for i in range(1000):
        process_data(i)

# Profile the function
cProfile.run('slow_program()', 'profile_stats')

# Analyze results
stats = pstats.Stats('profile_stats')
stats.strip_dirs()
stats.sort_stats('cumulative')
stats.print_stats(20)  # Top 20 functions by cumulative time

# ✅ CORRECT: Profile with context manager
from contextlib import contextmanager
import cProfile

@contextmanager
def profiled():
    pr = cProfile.Profile()
    pr.enable()
    yield
    pr.disable()

    stats = pstats.Stats(pr)
    stats.strip_dirs()
    stats.sort_stats('cumulative')
    stats.print_stats(20)

# Usage
with profiled():
    slow_program()
```

### Profiling Specific Code Blocks

```python
# ✅ CORRECT: Profile specific section
import cProfile

pr = cProfile.Profile()

# Normal code
setup_data()

# Profile this section
pr.enable()
expensive_operation()
pr.disable()

# More normal code
cleanup()

# View results
pr.print_stats(sort='cumulative')
```

### Line-Level Profiling with line_profiler

```python
# Install: pip install line_profiler

# ✅ CORRECT: Line-by-line profiling
from line_profiler import LineProfiler

@profile  # Use @profile decorator
def slow_function():
    total = 0
    for i in range(10000):
        total += i ** 2
    return total

# Run with kernprof:
# kernprof -l -v script.py

# Or programmatically:
lp = LineProfiler()
lp.add_function(slow_function)
lp.enable()
slow_function()
lp.disable()
lp.print_stats()

# Output shows time spent per line:
# Line #      Hits         Time  Per Hit   % Time  Line Contents
# ==============================================================
#     1                                           def slow_function():
#     2         1          2.0      2.0      0.0      total = 0
#     3     10001      15234.0      1.5     20.0      for i in range(10000):
#     4     10000      60123.0      6.0     80.0          total += i ** 2
#     5         1          1.0      1.0      0.0      return total
```

**Why this matters**: cProfile shows which functions are slow. line_profiler shows which lines within functions. Both essential for optimization.

### Visualizing Profiles with SnakeViz

```bash
# Install: pip install snakeviz

# Profile code
python -m cProfile -o program.prof script.py

# Visualize
snakeviz program.prof

# Opens browser with interactive visualization:
# - Sunburst chart showing call hierarchy
# - Icicle chart showing time distribution
# - Click functions to zoom in
```


## Memory Profiling

### Memory Usage with memory_profiler

```python
# Install: pip install memory_profiler

from memory_profiler import profile

# ✅ CORRECT: Track memory usage per line
@profile
def memory_hungry_function():
    # Line-by-line memory usage shown
    big_list = [i for i in range(1000000)]  # Allocates ~40MB
    big_dict = {i: i**2 for i in range(1000000)}  # Another ~40MB
    return len(big_list), len(big_dict)

# Run with:
# python -m memory_profiler script.py

# Output:
# Line #    Mem usage    Increment   Line Contents
# ================================================
#      3   38.3 MiB     38.3 MiB   @profile
#      4                             def memory_hungry_function():
#      5   45.2 MiB      6.9 MiB       big_list = [i for i in range(1000000)]
#      6   83.1 MiB     37.9 MiB       big_dict = {i: i**2 for i in range(1000000)}
#      7   83.1 MiB      0.0 MiB       return len(big_list), len(big_dict)
```

### Finding Memory Leaks

```python
# ✅ CORRECT: Detect memory leaks with tracemalloc
import tracemalloc

# Start tracing
tracemalloc.start()

# Take snapshot before
snapshot1 = tracemalloc.take_snapshot()

# Run code that might leak
problematic_function()

# Take snapshot after
snapshot2 = tracemalloc.take_snapshot()

# Compare snapshots
top_stats = snapshot2.compare_to(snapshot1, 'lineno')

print("Top 10 memory increases:")
for stat in top_stats[:10]:
    print(stat)

tracemalloc.stop()

# ✅ CORRECT: Track specific objects
import gc
import sys

def find_memory_leak():
    # Force garbage collection
    gc.collect()

    # Track objects before
    before = len(gc.get_objects())

    # Run potentially leaky code
    for _ in range(100):
        leaky_operation()

    # Force GC again
    gc.collect()

    # Track objects after
    after = len(gc.get_objects())

    if after > before:
        print(f"Potential leak: {after - before} objects not collected")

        # Find what's keeping objects alive
        for obj in gc.get_objects():
            if isinstance(obj, MyClass):  # Suspect class
                print(f"Found {type(obj)}: {sys.getrefcount(obj)} references")
                print(gc.get_referrers(obj))
```

### Profiling Memory with objgraph

```python
# Install: pip install objgraph

import objgraph

# ✅ CORRECT: Find most common objects
def analyze_memory():
    objgraph.show_most_common_types()
    # Output:
    # dict                   12453
    # function               8234
    # list                   6789
    # ...

# ✅ CORRECT: Track object growth
objgraph.show_growth()
potentially_leaky_function()
objgraph.show_growth()  # Shows objects that increased

# ✅ CORRECT: Visualize object references
import objgraph
objgraph.show_refs([my_object], filename='refs.png')
# Creates graph showing what references my_object
```

**Why this matters**: Memory leaks cause gradual performance degradation. tracemalloc and memory_profiler help find exactly where memory is allocated.


## Profiling Async Code

### Profiling Async Functions

```python
import asyncio
import cProfile
import pstats

# ❌ WRONG: cProfile doesn't work well with async
async def slow_async():
    await asyncio.sleep(1)
    await process_data()

cProfile.run('asyncio.run(slow_async())')  # Misleading results

# ✅ CORRECT: Use yappi for async profiling
# Install: pip install yappi
import yappi

async def slow_async():
    await asyncio.sleep(1)
    await process_data()

yappi.set_clock_type("wall")  # Use wall time, not CPU time
yappi.start()

asyncio.run(slow_async())

yappi.stop()

# Print stats
stats = yappi.get_func_stats()
stats.sort("totaltime", "desc")
stats.print_all()

# ✅ CORRECT: Profile coroutines specifically
stats = yappi.get_func_stats(filter_callback=lambda x: 'coroutine' in x.name)
stats.print_all()
```

### Detecting Blocking Code in Async

```python
# ✅ CORRECT: Detect event loop blocking
import asyncio
import time

class LoopMonitor:
    def __init__(self, threshold: float = 0.1):
        self.threshold = threshold

    async def monitor(self):
        while True:
            start = time.monotonic()
            await asyncio.sleep(0.01)  # Very short sleep
            elapsed = time.monotonic() - start

            if elapsed > self.threshold:
                print(f"WARNING: Event loop blocked for {elapsed:.3f}s")

async def main():
    # Start monitor
    monitor = LoopMonitor(threshold=0.1)
    monitor_task = asyncio.create_task(monitor.monitor())

    # Run your async code
    await your_async_function()

    monitor_task.cancel()

# ✅ CORRECT: Use asyncio debug mode
asyncio.run(main(), debug=True)
# Warns about slow callbacks (>100ms)
```


## Performance Optimization Strategies

### Optimization Workflow

```python
# ✅ CORRECT: Systematic optimization approach

# 1. Profile to find bottleneck
import cProfile
cProfile.run('main()', 'profile_stats')

# 2. Analyze results
stats = pstats.Stats('profile_stats')
stats.sort_stats('cumulative')
stats.print_stats(10)  # Focus on top 10

# 3. Identify specific slow function
def slow_function(data):
    # Original implementation
    result = []
    for item in data:
        if is_valid(item):
            result.append(transform(item))
    return result

# 4. Create benchmark
import timeit

def benchmark():
    data = create_test_data(10000)
    time_taken = timeit.timeit(
        lambda: slow_function(data),
        number=100
    )
    print(f"Average time: {time_taken / 100:.4f}s")

benchmark()  # Baseline: 0.1234s

# 5. Optimize
def optimized_function(data):
    # Use list comprehension (faster)
    return [transform(item) for item in data if is_valid(item)]

# 6. Benchmark again
time_taken = timeit.timeit(
    lambda: optimized_function(data),
    number=100
)
print(f"Average time: {time_taken / 100:.4f}s")  # 0.0789s - 36% faster!

# 7. Verify correctness
assert slow_function(data) == optimized_function(data)

# 8. Re-profile entire program to verify improvement
cProfile.run('main()', 'profile_stats_optimized')
```

**Why this matters**: Without profiling, you might optimize code that takes 1% of runtime, ignoring the 90% bottleneck. Always measure.

### Common Optimizations

```python
# ❌ WRONG: Repeated expensive operations
def process_items(items):
    for item in items:
        # Regex compiled every iteration!
        pattern = re.compile(r'\d+')
        match = pattern.search(item)

# ✅ CORRECT: Move expensive operations outside loop
def process_items(items):
    pattern = re.compile(r'\d+')  # Compile once
    for item in items:
        match = pattern.search(item)

# ❌ WRONG: Growing list with repeated concatenation
def build_large_list():
    result = []
    for i in range(100000):
        result = result + [i]  # Creates new list each time! O(n²)

# ✅ CORRECT: Use append
def build_large_list():
    result = []
    for i in range(100000):
        result.append(i)  # O(n)

# ❌ WRONG: Checking membership in list
def filter_items(items, blacklist):
    return [item for item in items if item not in blacklist]
    # O(n * m) if blacklist is list

# ✅ CORRECT: Use set for membership checks
def filter_items(items, blacklist):
    blacklist_set = set(blacklist)  # O(m)
    return [item for item in items if item not in blacklist_set]
    # O(n) for iteration + O(1) per lookup = O(n)
```

### Caching Results

```python
from functools import lru_cache

# ❌ WRONG: Recomputing expensive results
def fibonacci(n):
    if n < 2:
        return n
    return fibonacci(n-1) + fibonacci(n-2)
# O(2^n) - recalculates same values repeatedly

# ✅ CORRECT: Cache results
@lru_cache(maxsize=None)
def fibonacci(n):
    if n < 2:
        return n
    return fibonacci(n-1) + fibonacci(n-2)
# O(n) - each value computed once

# ✅ CORRECT: Custom caching for unhashable arguments
from functools import wraps

def cache_dataframe_results(func):
    cache = {}

    @wraps(func)
    def wrapper(df):
        # Use hash of dataframe content as key
        key = hashlib.md5(df.to_csv(index=False).encode()).hexdigest()

        if key not in cache:
            cache[key] = func(df)

        return cache[key]

    return wrapper

@cache_dataframe_results
def expensive_dataframe_operation(df):
    # Complex computation
    return df.groupby('category').agg({'value': 'sum'})
```


## Systematic Diagnosis

### Performance Degradation Diagnosis

```python
# ✅ CORRECT: Diagnose performance regression
import cProfile
import pstats

def diagnose_slowdown():
    """Compare current vs baseline performance."""

    # Profile current code
    cProfile.run('main()', 'current_profile.prof')

    # Load baseline profile (from git history or previous run)
    # git show main:profile.prof > baseline_profile.prof

    current = pstats.Stats('current_profile.prof')
    baseline = pstats.Stats('baseline_profile.prof')

    print("=== CURRENT ===")
    current.sort_stats('cumulative')
    current.print_stats(10)

    print("\n=== BASELINE ===")
    baseline.sort_stats('cumulative')
    baseline.print_stats(10)

    # Look for functions that got slower
    # Compare cumulative times
```

### Memory Leak Diagnosis

```python
# ✅ CORRECT: Systematic memory leak detection
import tracemalloc
import gc

def diagnose_memory_leak():
    """Run function multiple times and check memory growth."""

    gc.collect()
    tracemalloc.start()

    # Baseline
    snapshot1 = tracemalloc.take_snapshot()

    # Run 100 times
    for _ in range(100):
        potentially_leaky_function()
        gc.collect()

    # Check memory
    snapshot2 = tracemalloc.take_snapshot()

    top_stats = snapshot2.compare_to(snapshot1, 'lineno')

    print("Top 10 memory allocations:")
    for stat in top_stats[:10]:
        print(f"{stat.traceback}: +{stat.size_diff / 1024:.1f} KB")

    tracemalloc.stop()
```

### I/O vs CPU Bound Diagnosis

```python
# ✅ CORRECT: Determine if I/O or CPU bound
import time
import cProfile

def diagnose_bottleneck():
    """Determine if program is I/O or CPU bound."""

    # Time wall clock
    start_wall = time.time()
    main()
    wall_time = time.time() - start_wall

    # Profile CPU time
    pr = cProfile.Profile()
    pr.enable()
    start_cpu = time.process_time()
    main()
    cpu_time = time.process_time() - start_cpu
    pr.disable()

    print(f"Wall time: {wall_time:.2f}s")
    print(f"CPU time: {cpu_time:.2f}s")

    if cpu_time / wall_time > 0.9:
        print("CPU bound - optimize computation")
        # Consider: Cython, NumPy, multiprocessing
    else:
        print("I/O bound - optimize I/O")
        # Consider: async/await, caching, batching
```


## Common Bottlenecks and Solutions

### String Concatenation

```python
# ❌ WRONG: String concatenation in loop
def build_string(items):
    result = ""
    for item in items:
        result += str(item) + "\n"  # Creates new string each time
    return result
# O(n²) time complexity

# ✅ CORRECT: Use join
def build_string(items):
    return "\n".join(str(item) for item in items)
# O(n) time complexity

# Benchmark:
# 1000 items: 0.0015s (join) vs 0.0234s (concatenation) - 15x faster
# 10000 items: 0.015s (join) vs 2.341s (concatenation) - 156x faster
```

### List Comprehension vs Map/Filter

```python
import timeit

# ✅ CORRECT: List comprehension (usually fastest)
def with_list_comp(data):
    return [x * 2 for x in data if x > 0]

# ✅ CORRECT: Generator (memory efficient for large data)
def with_generator(data):
    return (x * 2 for x in data if x > 0)

# Map/filter (sometimes faster for simple operations)
def with_map_filter(data):
    return map(lambda x: x * 2, filter(lambda x: x > 0, data))

# Benchmark
data = list(range(1000000))
print(timeit.timeit(lambda: list(with_list_comp(data)), number=10))
print(timeit.timeit(lambda: list(with_generator(data)), number=10))
print(timeit.timeit(lambda: list(with_map_filter(data)), number=10))

# Results: List comprehension usually fastest for complex logic
# Generator best when you don't need all results at once
```

### Dictionary Lookups vs List Searches

```python
# ❌ WRONG: Searching in list
def find_users_list(user_ids, all_users_list):
    results = []
    for user_id in user_ids:
        for user in all_users_list:  # O(n) per lookup
            if user['id'] == user_id:
                results.append(user)
                break
    return results
# O(n * m) time complexity

# ✅ CORRECT: Use dictionary
def find_users_dict(user_ids, all_users_dict):
    return [all_users_dict[uid] for uid in user_ids if uid in all_users_dict]
# O(n) time complexity

# Benchmark:
# 1000 lookups in 10000 items:
# List: 1.234s
# Dict: 0.001s - 1234x faster!
```

### DataFrame Iteration Anti-Pattern

```python
import pandas as pd
import numpy as np

# ❌ WRONG: Iterating over DataFrame rows
def process_rows_iterrows(df):
    results = []
    for idx, row in df.iterrows():  # VERY SLOW
        if row['value'] > 0:
            results.append(row['value'] * 2)
    return results

# ✅ CORRECT: Vectorized operations
def process_rows_vectorized(df):
    mask = df['value'] > 0
    return (df.loc[mask, 'value'] * 2).tolist()

# Benchmark with 100,000 rows:
# iterrows: 15.234s
# vectorized: 0.015s - 1000x faster!
```


## Profiling Tools Comparison

### When to Use Which Tool

| Tool | Use Case | Output |
|------|----------|--------|
| cProfile | Function-level CPU profiling | Which functions take most time |
| line_profiler | Line-level CPU profiling | Which lines within function slow |
| memory_profiler | Line-level memory profiling | Memory usage per line |
| tracemalloc | Memory allocation tracking | Where memory allocated |
| yappi | Async/multithreaded profiling | Profile concurrent code |
| py-spy | Sampling profiler (no code changes) | Profile running processes |
| scalene | CPU+GPU+memory profiling | Comprehensive profiling |

### py-spy for Production Profiling

```bash
# Install: pip install py-spy

# Profile running process (no code changes needed!)
py-spy record -o profile.svg --pid 12345

# Profile for 60 seconds
py-spy record -o profile.svg --duration 60 -- python script.py

# Top-like view of running process
py-spy top --pid 12345

# Why use py-spy:
# - No code changes needed
# - Minimal overhead
# - Can attach to running process
# - Great for production debugging
```


## Anti-Patterns

### Premature Optimization

```python
# ❌ WRONG: Optimizing before measuring
def process_data(data):
    # "Let me make this fast with complex caching..."
    # Spend hours optimizing function that takes 0.1% of runtime

# ✅ CORRECT: Profile first
cProfile.run('main()', 'profile.prof')
# Oh, process_data only takes 0.1% of time
# The real bottleneck is database queries (90% of time)
# Optimize database queries instead!
```

### Micro-Optimizations

```python
# ❌ WRONG: Micro-optimizing at expense of readability
def calculate(x, y):
    # "Using bit shift instead of multiply by 2 for speed!"
    return (x << 1) + (y << 1)
# Saved: ~0.0000001 seconds per call
# Cost: Unreadable code

# ✅ CORRECT: Clear code first
def calculate(x, y):
    return 2 * x + 2 * y
# Modern Python JIT optimizes this anyway
# Only optimize if profiler shows this is bottleneck
```

### Not Benchmarking Changes

```python
# ❌ WRONG: Assuming optimization worked
def slow_function():
    # Original code
    pass

def optimized_function():
    # "Optimized" code
    pass

# Assume optimized_function is faster without measuring

# ✅ CORRECT: Benchmark before and after
import timeit

before = timeit.timeit(slow_function, number=1000)
after = timeit.timeit(optimized_function, number=1000)

print(f"Before: {before:.4f}s")
print(f"After: {after:.4f}s")
print(f"Speedup: {before/after:.2f}x")

# Verify correctness
assert slow_function() == optimized_function()
```


## Decision Trees

### What Tool to Use for Profiling?

```
What do I need to profile?
├─ CPU time
│   ├─ Function-level → cProfile
│   ├─ Line-level → line_profiler
│   └─ Async code → yappi
├─ Memory usage
│   ├─ Line-level → memory_profiler
│   ├─ Allocation tracking → tracemalloc
│   └─ Object types → objgraph
└─ Running process (no code changes) → py-spy
```

### Optimization Strategy

```
Is code slow?
├─ Yes → Profile to find bottleneck
│   ├─ CPU bound → Profile with cProfile
│   │   └─ Optimize hot functions (vectorize, cache, algorithms)
│   └─ I/O bound → Profile with timing
│       └─ Use async/await, caching, batching
└─ No → Don't optimize (focus on features/correctness)
```

### Memory Issue Diagnosis

```
Is memory usage high?
├─ Yes → Profile with memory_profiler
│   ├─ Growing over time → Memory leak
│   │   └─ Use tracemalloc to find leak
│   └─ High but stable → Large data structures
│       └─ Optimize data structures (generators, efficient types)
└─ No → Monitor but don't optimize yet
```


## Integration with Other Skills

**After using this skill:**
- If I/O bound → See @async-patterns-and-concurrency for async optimization
- If data processing slow → See @scientific-computing-foundations for vectorization
- If need to track improvements → See @ml-engineering-workflows for metrics

**Before using this skill:**
- If unsure code is slow → Use this skill to profile and confirm!
- If setting up profiling → See @project-structure-and-tooling for dependencies


## Quick Reference

### Essential Profiling Commands

```python
# CPU profiling
import cProfile
cProfile.run('main()', 'profile.prof')

# View results
import pstats
stats = pstats.Stats('profile.prof')
stats.sort_stats('cumulative')
stats.print_stats(20)

# Memory profiling
import tracemalloc
tracemalloc.start()
# ... code ...
snapshot = tracemalloc.take_snapshot()
top_stats = snapshot.statistics('lineno')
for stat in top_stats[:10]:
    print(stat)
```

### Debugging Commands

```python
# Set breakpoint
breakpoint()  # Python 3.7+
# or
import pdb; pdb.set_trace()

# pdb commands:
# n - next line
# s - step into
# c - continue
# p var - print variable
# l - list code
# w - where am I
# q - quit
```

### Optimization Checklist

- [ ] Profile before optimizing (use cProfile)
- [ ] Identify bottleneck (top 20% of time)
- [ ] Create benchmark for bottleneck
- [ ] Optimize bottleneck
- [ ] Benchmark again to verify improvement
- [ ] Re-profile entire program
- [ ] Verify correctness (tests still pass)

### Common Optimizations

| Problem | Solution | Speedup |
|---------|----------|---------|
| String concatenation in loop | Use str.join() | 10-100x |
| List membership checks | Use set | 100-1000x |
| DataFrame iteration | Vectorize with NumPy/pandas | 100-1000x |
| Repeated expensive computation | Cache with @lru_cache | ∞ (depends on cache hits) |
| I/O bound | Use async/await | 10-100x |
| CPU bound with parallelizable work | Use multiprocessing | ~number of cores |

### Red Flags

If you find yourself:
- Optimizing before profiling → STOP, profile first
- Spending hours on micro-optimizations → Check if it's bottleneck
- Making code unreadable for speed → Benchmark the benefit
- Assuming what's slow → Profile to verify

**Always measure. Never assume.**