gh-jeanluciano-quaestor-src…/skills/optimizing-performance/languages/RUST.md

# Rust Performance Optimization

**Load this file when:** Optimizing performance in Rust projects

## Profiling Tools

### Benchmarking with Criterion
```bash
# Add to Cargo.toml
[dev-dependencies]
criterion = "0.5"

[[bench]]
name = "my_benchmark"
harness = false

# Run benchmarks
cargo bench

# Compare against baseline
cargo bench --baseline master
```

### CPU Profiling
```bash
# perf (Linux)
cargo build --release
perf record --call-graph dwarf ./target/release/myapp
perf report

# Instruments (macOS)
cargo instruments --release --template "Time Profiler"

# cargo-flamegraph
cargo install flamegraph
cargo flamegraph

# samply (cross-platform)
cargo install samply
samply record ./target/release/myapp
```

### Memory Profiling
```bash
# valgrind (memory leaks, cache performance)
cargo build
valgrind --tool=massif ./target/debug/myapp

# dhat (heap profiling)
# Add dhat crate to project

# cargo-bloat (binary size analysis)
cargo install cargo-bloat
cargo bloat --release
```

## Zero-Cost Abstractions

### Avoiding Unnecessary Allocations
```rust
// Bad: Allocates on every call
fn process_string(s: String) -> String {
    s.to_uppercase()
}

// Good: Borrows, no allocation
fn process_string(s: &str) -> String {
    s.to_uppercase()
}

// Best: In-place modification where possible
fn process_string_mut(s: &mut String) {
    *s = s.to_uppercase();
}
```

### Stack vs Heap Allocation
```rust
// Stack: Fast, known size at compile time
let numbers = [1, 2, 3, 4, 5];

// Heap: Flexible, runtime-sized data
let numbers = vec![1, 2, 3, 4, 5];

// Use Box<[T]> for fixed-size heap data (smaller than Vec)
let numbers: Box<[i32]> = vec![1, 2, 3, 4, 5].into_boxed_slice();
```

### Iterator Chains vs For Loops
```rust
// Good: Zero-cost iterator chains (compiled to efficient code)
let sum: i32 = numbers
    .iter()
    .filter(|&&n| n > 0)
    .map(|&n| n * 2)
    .sum();

// Also good: Manual loop (similar performance)
let mut sum = 0;
for &n in numbers.iter() {
    if n > 0 {
        sum += n * 2;
    }
}

// Choose iterators for readability, loops for complex logic
```

## Compilation Optimizations

### Release Profile Tuning
```toml
[profile.release]
opt-level = 3           # Maximum optimization
lto = "fat"             # Link-time optimization
codegen-units = 1       # Better optimization, slower compile
strip = true            # Strip symbols from binary
panic = "abort"         # Smaller binary, no stack unwinding

[profile.release-with-debug]
inherits = "release"
debug = true           # Keep debug symbols for profiling
```

### Target CPU Features
```bash
# Use native CPU features
RUSTFLAGS="-C target-cpu=native" cargo build --release

# Or in .cargo/config.toml
[build]
rustflags = ["-C", "target-cpu=native"]
```

## Memory Layout Optimization

### Struct Field Ordering
```rust
// Bad: Wasted padding (24 bytes)
struct BadLayout {
    a: u8,   // 1 byte + 7 padding
    b: u64,  // 8 bytes
    c: u8,   // 1 byte + 7 padding
}

// Good: Minimal padding (16 bytes)
struct GoodLayout {
    b: u64,  // 8 bytes
    a: u8,   // 1 byte
    c: u8,   // 1 byte + 6 padding
}

// Use #[repr(C)] for consistent layout
#[repr(C)]
struct FixedLayout {
    // Fields laid out in declaration order
}
```

### Enum Optimization
```rust
// Consider enum size (uses largest variant)
enum Large {
    Small(u8),
    Big([u8; 1000]),  // Entire enum is 1000+ bytes!
}

// Better: Box large variants
enum Optimized {
    Small(u8),
    Big(Box<[u8; 1000]>),  // Enum is now pointer-sized
}
```

## Concurrency Patterns

### Using Rayon for Data Parallelism
```rust
use rayon::prelude::*;

// Sequential
let sum: i32 = data.iter().map(|x| expensive(x)).sum();

// Parallel (automatic work stealing)
let sum: i32 = data.par_iter().map(|x| expensive(x)).sum();
```

### Async Runtime Optimization
```rust
// tokio - For I/O-heavy workloads
#[tokio::main(flavor = "multi_thread", worker_threads = 4)]
async fn main() {
    // Async I/O operations
}

// async-std - Alternative runtime
// Choose based on ecosystem compatibility
```

## Common Rust Performance Patterns

### String Handling
```rust
// Avoid unnecessary clones
// Bad
fn process(s: String) -> String {
    let upper = s.clone().to_uppercase();
    upper
}

// Good
fn process(s: &str) -> String {
    s.to_uppercase()
}

// Use Cow for conditional cloning
use std::borrow::Cow;

fn maybe_uppercase<'a>(s: &'a str, uppercase: bool) -> Cow<'a, str> {
    if uppercase {
        Cow::Owned(s.to_uppercase())
    } else {
        Cow::Borrowed(s)
    }
}
```

### Collection Preallocation
```rust
// Bad: Multiple reallocations
let mut vec = Vec::new();
for i in 0..1000 {
    vec.push(i);
}

// Good: Single allocation
let mut vec = Vec::with_capacity(1000);
for i in 0..1000 {
    vec.push(i);
}

// Best: Use collect with size_hint
let vec: Vec<_> = (0..1000).collect();
```

### Minimize Clones
```rust
// Bad: Unnecessary clones in loop
for item in &items {
    let owned = item.clone();
    process(owned);
}

// Good: Borrow when possible
for item in &items {
    process_borrowed(item);
}

// Use Rc/Arc only when necessary
use std::rc::Rc;
let shared = Rc::new(expensive_data);
let clone1 = Rc::clone(&shared);  // Cheap pointer clone
```

## Performance Anti-Patterns

### Unnecessary Dynamic Dispatch
```rust
// Bad: Dynamic dispatch overhead
fn process(items: &[Box<dyn Trait>]) {
    for item in items {
        item.method();  // Virtual call
    }
}

// Good: Static dispatch via generics
fn process<T: Trait>(items: &[T]) {
    for item in items {
        item.method();  // Direct call, can be inlined
    }
}
```

### Lock Contention
```rust
// Bad: Holding lock during expensive operation
let data = mutex.lock().unwrap();
let result = expensive_computation(&data);
drop(data);

// Good: Release lock quickly
let cloned = {
    let data = mutex.lock().unwrap();
    data.clone()
};
let result = expensive_computation(&cloned);
```

## Benchmarking with Criterion

### Basic Benchmark
```rust
use criterion::{black_box, criterion_group, criterion_main, Criterion};

fn fibonacci_benchmark(c: &mut Criterion) {
    c.bench_function("fib 20", |b| {
        b.iter(|| fibonacci(black_box(20)))
    });
}

criterion_group!(benches, fibonacci_benchmark);
criterion_main!(benches);
```

### Parameterized Benchmarks
```rust
fn bench_sizes(c: &mut Criterion) {
    let mut group = c.benchmark_group("process");

    for size in [10, 100, 1000, 10000].iter() {
        group.bench_with_input(
            BenchmarkId::from_parameter(size),
            size,
            |b, &size| {
                b.iter(|| process_data(black_box(size)))
            },
        );
    }

    group.finish();
}
```

## Performance Checklist

**Before Optimizing:**
- [ ] Profile with release build to identify bottlenecks
- [ ] Measure baseline with criterion benchmarks
- [ ] Use cargo-flamegraph to visualize hot paths

**Rust-Specific Optimizations:**
- [ ] Enable LTO in release profile
- [ ] Use target-cpu=native for CPU-specific features
- [ ] Preallocate collections with `with_capacity`
- [ ] Prefer borrowing (&T) over owned (T) in APIs
- [ ] Use iterators over manual loops
- [ ] Minimize clones - use Rc/Arc only when needed
- [ ] Order struct fields by size (largest first)
- [ ] Box large enum variants
- [ ] Use rayon for CPU-bound parallelism
- [ ] Avoid unnecessary dynamic dispatch

**After Optimizing:**
- [ ] Re-benchmark to verify improvements
- [ ] Check binary size with cargo-bloat
- [ ] Profile memory with valgrind/dhat
- [ ] Add regression tests with criterion baselines

## Tools and Crates

**Profiling:**
- `criterion` - Statistical benchmarking
- `flamegraph` - Flamegraph generation
- `cargo-instruments` - macOS profiling
- `perf` - Linux performance analysis
- `dhat` - Heap profiling

**Optimization:**
- `rayon` - Data parallelism
- `tokio` / `async-std` - Async runtime
- `parking_lot` - Faster mutex/rwlock
- `smallvec` - Stack-allocated vectors
- `once_cell` - Lazy static initialization

**Analysis:**
- `cargo-bloat` - Binary size analysis
- `cargo-udeps` - Find unused dependencies
- `twiggy` - Code size profiler

---

*Rust-specific performance optimization with zero-cost abstractions and profiling tools*