gh-yebot-rad-cc-plugins-plu…/commands/analyze-performance.md

---
argument-hint: "[target] [--profiler=<tool>] [--benchmark] [--report]"
description: "Profile plugin performance with Instruments, perf, VTune, Tracy to identify bottlenecks and optimize DSP algorithms"
allowed-tools: Bash, Read, Write, AskUserQuestion
model: sonnet
---

# /analyze-performance - Performance Profiling and Optimization

Profile your JUCE plugin's performance to identify bottlenecks, optimize DSP algorithms, and ensure efficient CPU usage across different scenarios.

## Overview

This command guides you through comprehensive performance analysis using profiling tools, benchmarking, and optimization strategies. It helps identify hot paths in DSP code, memory bottlenecks, and inefficient algorithms.

## Syntax

```bash
/analyze-performance [target] [--profiler=<tool>] [--benchmark] [--report]
```

### Arguments

- `target` (optional): What to profile - `dsp`, `ui`, `full`, or `specific` (default: `dsp`)
- `--profiler=<tool>`: Profiler to use - `instruments`, `perf`, `vtune`, `tracy`, or `auto` (default: `auto`)
- `--benchmark`: Run performance benchmarks and compare against baseline
- `--report`: Generate detailed performance report with recommendations

### Examples

```bash
# Profile DSP code with platform default profiler
/analyze-performance dsp

# Profile entire plugin with Instruments (macOS)
/analyze-performance full --profiler=instruments

# Run benchmarks and generate report
/analyze-performance dsp --benchmark --report

# Profile UI rendering performance
/analyze-performance ui --profiler=instruments
```

## Instructions

### Step 1: Pre-Profiling Setup

**@build-engineer** - Prepare optimized build with profiling symbols.

1. **Build with Release optimization + debug symbols:**

   macOS (Xcode):
   ```cmake
   # CMakeLists.txt
   if(APPLE)
       set(CMAKE_BUILD_TYPE RelWithDebInfo)
       # Disable stripping for profiling
       set(CMAKE_XCODE_ATTRIBUTE_STRIP_INSTALLED_PRODUCT NO)
       set(CMAKE_XCODE_ATTRIBUTE_DEPLOYMENT_POSTPROCESSING NO)
   endif()
   ```

   Build:
   ```bash
   cmake -B build -DCMAKE_BUILD_TYPE=RelWithDebInfo
   cmake --build build --config RelWithDebInfo
   ```

   Windows (Visual Studio):
   ```bash
   cmake -B build
   cmake --build build --config RelWithDebInfo
   ```

   Linux:
   ```bash
   cmake -B build -DCMAKE_BUILD_TYPE=RelWithDebInfo -DCMAKE_CXX_FLAGS="-g -O2"
   cmake --build build
   ```

2. **Verify optimizations are enabled:**
   ```bash
   # macOS: Check optimization flags
   xcrun otool -v -s __TEXT __text build/MyPlugin.vst3/Contents/MacOS/MyPlugin | grep -A 5 "optimization"

   # Linux: Check binary for debug symbols and optimization
   objdump -d build/MyPlugin.vst3 | head -50
   ```

3. **Install profiling tools:**

   **macOS:**
   ```bash
   # Instruments comes with Xcode
   xcode-select --install

   # Optional: Tracy profiler
   brew install tracy
   ```

   **Windows:**
   ```powershell
   # Intel VTune (recommended)
   # Download from: https://www.intel.com/content/www/us/en/developer/tools/oneapi/vtune-profiler.html

   # Or Visual Studio Profiler (included with VS)
   ```

   **Linux:**
   ```bash
   # perf (Linux perf tool)
   sudo apt install linux-tools-common linux-tools-generic

   # Tracy profiler
   sudo apt install tracy-profiler

   # OR build from source
   git clone https://github.com/wolfpld/tracy
   cd tracy/profiler
   make
   ```

---

## Step 2: DSP Performance Profiling

**@dsp-engineer** + **@test-automation-engineer** - Identify DSP bottlenecks.

### Profile Audio Processing

#### macOS - Using Instruments

1. **Launch Instruments with plugin:**
   ```bash
   # Open standalone plugin in Instruments
   open -a Instruments

   # Or profile in DAW:
   # 1. Launch Instruments
   # 2. Choose "Time Profiler" template
   # 3. Click Record
   # 4. In dropdown, select DAW process (e.g., "Logic Pro")
   # 5. Load plugin in DAW and play audio
   ```

2. **Time Profiler Configuration:**
   - Template: Time Profiler
   - Sample Frequency: 1ms (high resolution)
   - Record Waiting Threads: OFF (focus on CPU time)
   - High Frequency: ON

3. **Record profiling session:**
   - Click Record in Instruments
   - Load plugin in standalone app or DAW
   - Play test audio for 30-60 seconds
   - Include various parameter automations
   - Stop recording

4. **Analyze results:**
   - Switch to "Call Tree" view
   - Enable filters:
     - ✅ Separate by Thread
     - ✅ Invert Call Tree
     - ✅ Hide System Libraries
   - Look for hot functions in your code
   - **Focus on:** `processBlock()`, DSP algorithm functions

5. **Identify bottlenecks:**
   - Functions taking > 5% of CPU time are candidates for optimization
   - Look for:
     - Unexpected memory allocations (`malloc`, `new`)
     - Expensive math operations (use vectorization)
     - Inefficient loops
     - Cache misses (scattered memory access)

**Example Instruments Output:**
```
Symbol Name                               % Time
MyPlugin::processBlock()                   45.2%
  MyFilter::processSample()                28.3%
    std::pow()                             15.1%  ⚠️ Expensive!
    MyFilter::updateCoefficients()         13.2%
  MyDistortion::process()                  16.9%
```

**Red Flags:**
- `std::pow()`, `std::sin()`, `std::cos()` in inner loops → Use lookup tables or approximations
- Memory allocations → Pre-allocate in `prepareToPlay()`
- Virtual function calls in hot path → Consider static polymorphism

---

#### Windows - Using Visual Studio Profiler

1. **Start profiling:**
   - Open Visual Studio
   - Debug → Performance Profiler
   - Select: CPU Usage
   - Start profiling
   - Launch DAW and load plugin
   - Play audio for 60 seconds
   - Stop profiling

2. **Analyze:**
   - View "Hot Path"
   - Check "Functions" view sorted by "Total CPU %"
   - Drill into `processBlock()`

3. **Generate report:**
   - File → Export Report
   - Save as `performance-analysis-[date].diagsession`

---

#### Linux - Using perf

1. **Record performance data:**
   ```bash
   # Profile specific process (find PID of DAW or standalone)
   perf record -F 999 -g -p <PID>

   # Or profile command:
   perf record -F 999 -g ./build/MyPlugin_Standalone

   # Play audio for 60 seconds, then Ctrl+C to stop
   ```

2. **View results:**
   ```bash
   # Interactive TUI
   perf report

   # Generate flame graph (requires flamegraph tools)
   git clone https://github.com/brendangregg/FlameGraph
   perf script | FlameGraph/stackcollapse-perf.pl | FlameGraph/flamegraph.pl > flamegraph.svg
   open flamegraph.svg  # or xdg-open on Linux
   ```

3. **Interpret flame graph:**
   - Width = CPU time
   - Look for wide sections in your code
   - Drill down into `processBlock()` stack frames

---

### Common Performance Issues and Fixes

#### Issue 1: Expensive Transcendental Functions

**Symptom:** `std::sin()`, `std::cos()`, `std::pow()` show up in profiler.

**Solution:** Use lookup tables or polynomial approximations.

**Example:**
```cpp
// ❌ Slow: Direct call in processBlock
float sine = std::sin(phase);

// ✅ Fast: Lookup table
class SineLUT {
    static constexpr int tableSize = 2048;
    std::array<float, tableSize> table;

public:
    SineLUT() {
        for (int i = 0; i < tableSize; ++i)
            table[i] = std::sin(2.0 * M_PI * i / tableSize);
    }

    float lookup(float phase) const {
        float index = phase * tableSize;
        int i0 = static_cast<int>(index) % tableSize;
        int i1 = (i0 + 1) % tableSize;
        float frac = index - std::floor(index);
        return table[i0] + frac * (table[i1] - table[i0]);
    }
};

// Use in processBlock:
float sine = sineLUT.lookup(phase);
```

**Speedup:** 5-10x faster

---

#### Issue 2: Inefficient Memory Access Patterns

**Symptom:** High cache miss rate, poor vectorization.

**Solution:** Structure-of-arrays instead of array-of-structures.

**Example:**
```cpp
// ❌ Poor cache locality (AoS)
struct Voice {
    float frequency, amplitude, phase;
};
std::vector<Voice> voices;

for (auto& voice : voices) {
    voice.phase += voice.frequency;
    output += voice.amplitude * std::sin(voice.phase);
}

// ✅ Better cache locality (SoA)
struct VoiceBank {
    std::vector<float> frequencies;
    std::vector<float> amplitudes;
    std::vector<float> phases;
};

for (int i = 0; i < voices.phases.size(); ++i) {
    voices.phases[i] += voices.frequencies[i];
    output += voices.amplitudes[i] * std::sin(voices.phases[i]);
}
```

**Benefit:** Better SIMD vectorization, fewer cache misses.

---

#### Issue 3: Unnecessary Branching

**Symptom:** Unpredictable branches in inner loops.

**Solution:** Branchless code or precompute decisions.

**Example:**
```cpp
// ❌ Branch in inner loop
for (int i = 0; i < numSamples; ++i) {
    if (bypassEnabled)
        output[i] = input[i];
    else
        output[i] = process(input[i]);
}

// ✅ Branchless or separate loops
if (bypassEnabled) {
    std::copy(input, input + numSamples, output);
} else {
    for (int i = 0; i < numSamples; ++i)
        output[i] = process(input[i]);
}
```

---

#### Issue 4: Virtual Function Calls

**Symptom:** Virtual dispatch overhead in hot path.

**Solution:** Static polymorphism (templates) or function pointers.

**Example:**
```cpp
// ❌ Virtual function call per sample
class Filter {
public:
    virtual float process(float input) = 0;
};

// ✅ Template-based static polymorphism
template<typename FilterType>
class Processor {
    FilterType filter;
public:
    void processBlock(float* buffer, int numSamples) {
        for (int i = 0; i < numSamples; ++i)
            buffer[i] = filter.process(buffer[i]);  // Inlined!
    }
};
```

---

## Step 3: SIMD Optimization

**@dsp-engineer** - Leverage SIMD instructions for maximum performance.

### Identify Vectorization Opportunities

1. **Check if compiler vectorized loops:**

   **macOS/Linux:**
   ```bash
   # GCC/Clang vectorization report
   cmake -B build -DCMAKE_CXX_FLAGS="-O3 -fopt-info-vec"
   cmake --build build 2>&1 | grep vectorized
   ```

   **Windows:**
   ```powershell
   # MSVC vectorization report
   cmake -B build
   cmake --build build -- /p:CL="/Qvec-report:2"
   ```

2. **Manual SIMD with JUCE:**

   JUCE provides cross-platform SIMD abstractions:

   ```cpp
   #include <juce_dsp/juce_dsp.h>

   // Example: Process 4 samples at once with SIMD
   void processBlock(juce::AudioBuffer<float>& buffer) {
       auto* channelData = buffer.getWritePointer(0);
       int numSamples = buffer.getNumSamples();

       // Process in chunks of 4 (SSE/NEON)
       using SIMDFloat = juce::dsp::SIMDRegister<float>;
       constexpr int simdSize = SIMDFloat::size();

       int i = 0;
       for (; i < numSamples - simdSize; i += simdSize) {
           auto simdInput = SIMDFloat::fromRawArray(channelData + i);
           auto simdOutput = simdInput * SIMDFloat(gain);  // SIMD multiply
           simdOutput.copyToRawArray(channelData + i);
       }

       // Handle remaining samples
       for (; i < numSamples; ++i) {
           channelData[i] *= gain;
       }
   }
   ```

3. **Benchmark SIMD vs scalar:**
   ```bash
   # Build with SIMD enabled
   cmake -B build-simd -DCMAKE_CXX_FLAGS="-O3 -march=native"
   cmake --build build-simd

   # Compare performance (see Step 4 below)
   ```

**Expected Speedup:** 2-4x for SIMD-friendly code.

---

## Step 4: Performance Benchmarking

**@test-automation-engineer** - Quantify performance improvements.

### Create Benchmark Tests

```cpp
// Tests/PerformanceBenchmark.cpp
#include <benchmark/benchmark.h>  // Google Benchmark
#include "../Source/PluginProcessor.h"

static void BM_ProcessBlock(benchmark::State& state) {
    MyPluginProcessor processor;
    processor.setPlayConfigDetails(2, 2, 44100.0, 512);
    processor.prepareToPlay(44100.0, 512);

    juce::AudioBuffer<float> buffer(2, 512);
    juce::MidiBuffer midi;

    // Fill with test signal
    for (int ch = 0; ch < 2; ++ch)
        for (int i = 0; i < 512; ++i)
            buffer.setSample(ch, i, std::sin(2 * M_PI * 440 * i / 44100.0));

    for (auto _ : state) {
        processor.processBlock(buffer, midi);
        benchmark::DoNotOptimize(buffer.getReadPointer(0));
    }

    // Report CPU usage metric
    state.SetItemsProcessed(state.iterations() * 512);
}

BENCHMARK(BM_ProcessBlock)->Iterations(10000);

BENCHMARK_MAIN();
```

### Run Benchmarks

```bash
# Install Google Benchmark
git clone https://github.com/google/benchmark.git
cd benchmark
cmake -E make_directory "build"
cmake -E chdir "build" cmake -DCMAKE_BUILD_TYPE=Release ..
cmake --build "build" --config Release
sudo cmake --build "build" --config Release --target install

# Build and run your benchmarks
cmake -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build --target PerformanceBenchmark
./build/Tests/PerformanceBenchmark --benchmark_out=results.json --benchmark_out_format=json
```

**Example Output:**
```
-----------------------------------------------------------------
Benchmark                       Time             CPU   Iterations
-----------------------------------------------------------------
BM_ProcessBlock              1.23 ms         1.22 ms          571
```

**Interpretation:**
- 1.22 ms per block @ 512 samples = ~24% CPU at 44.1kHz (1.22ms / (512/44100 * 1000))
- Goal: < 5% CPU (< 0.29 ms/block)

### Compare Before/After Optimization

```bash
# Save baseline
./build/Tests/PerformanceBenchmark > baseline.txt

# Make optimizations
# ...

# Compare
./build/Tests/PerformanceBenchmark > optimized.txt
diff baseline.txt optimized.txt
```

---

## Step 5: UI Performance Profiling

**@ui-engineer** - Ensure UI doesn't impact audio performance.

### Profile UI Rendering

1. **Check UI frame rate:**
   ```cpp
   // Add to Editor
   class MyEditor : public juce::AudioProcessorEditor, private juce::Timer {
       void timerCallback() override {
           auto now = juce::Time::getMillisecondCounterHiRes();
           double fps = 1000.0 / (now - lastFrameTime);
           DBG("FPS: " << fps);  // Should be 60fps
           lastFrameTime = now;
           repaint();
       }

       double lastFrameTime = 0;
   };
   ```

2. **Profile with Instruments (macOS):**
   - Use "Core Animation" template
   - Check for:
     - Dropped frames (should be 0)
     - Expensive drawing operations
     - Off-screen rendering

3. **Optimize UI:**
   - **Use `repaint()` only when needed** (not on every audio callback!)
   - **Coalesce repaints:**
     ```cpp
     // ❌ Repaint on every parameter change (60 times/sec from audio thread!)
     parameterChanged(parameter, newValue) {
         repaint();  // BAD
     }

     // ✅ Rate-limit repaints
     parameterChanged(parameter, newValue) {
         startTimer(16);  // 60fps max
     }

     timerCallback() {
         stopTimer();
         repaint();
     }
     ```
   - **Cache rendered graphics:**
     ```cpp
     juce::Image cachedBackground;

     void paint(Graphics& g) {
         if (cachedBackground.isNull()) {
             cachedBackground = juce::Image(juce::Image::ARGB, getWidth(), getHeight(), true);
             Graphics cg(cachedBackground);
             drawComplexBackground(cg);
         }
         g.drawImageAt(cachedBackground, 0, 0);
     }
     ```

---

## Step 6: Memory Profiling

**@test-automation-engineer** - Detect memory leaks and excessive allocations.

### macOS - Instruments Leaks

1. **Launch Instruments → Leaks template**
2. **Record session** (load/unload plugin multiple times)
3. **Check for leaks:**
   - Red flags = memory leaks
   - Click to see stack trace of allocation

### Linux - Valgrind

```bash
# Profile standalone plugin
valgrind --leak-check=full --track-origins=yes ./build/MyPlugin_Standalone

# Play audio, then quit
# Check report for leaks
```

### Windows - Visual Studio Memory Profiler

1. Debug → Performance Profiler → Memory Usage
2. Take snapshots before and after loading plugin
3. Compare snapshots for memory growth

### Check for Allocations in Audio Thread

Use `-fsanitize=address` (Clang/GCC):
```bash
cmake -B build -DCMAKE_CXX_FLAGS="-fsanitize=address -g"
cmake --build build
./build/Tests/MyPluginTests
```

**Look for:** Allocations called from `processBlock()` - these are FORBIDDEN.

---

## Step 7: Advanced Profiling - Tracy

**@test-automation-engineer** - Use Tracy for frame-perfect profiling.

Tracy is a real-time profiler with nanosecond precision.

### Integrate Tracy

```cpp
// CMakeLists.txt
include(FetchContent)
FetchContent_Declare(
    tracy
    GIT_REPOSITORY https://github.com/wolfpld/tracy.git
    GIT_TAG v0.10
)
FetchContent_MakeAvailable(tracy)

target_link_libraries(MyPlugin PRIVATE TracyClient)
target_compile_definitions(MyPlugin PRIVATE TRACY_ENABLE)
```

### Add Tracy Zones

```cpp
#include <tracy/Tracy.hpp>

void processBlock(AudioBuffer<float>& buffer, MidiBuffer& midi) {
    ZoneScoped;  // Automatic profiling for this function

    {
        ZoneScopedN("Filter Processing");
        filter.process(buffer);
    }

    {
        ZoneScopedN("Distortion Processing");
        distortion.process(buffer);
    }
}
```

### Run Tracy

```bash
# Launch Tracy profiler GUI
tracy

# Run plugin in DAW
# Tracy will automatically connect and show real-time profiling
```

**Benefits:**
- Real-time visualization
- Frame-by-frame analysis
- Memory allocation tracking
- Lock contention detection

---

## Step 8: Generate Performance Report

**@support-engineer** - Document findings and recommendations.

### Report Template

```markdown
# Performance Analysis Report - MyPlugin v1.2.0

**Date:** 2024-05-15
**Analyst:** @dsp-engineer
**Platform:** macOS 14.5, Apple M1 Max

## Summary

CPU usage has been reduced from **8.2%** to **2.1%** (74% improvement) through targeted optimizations.

## Profiling Results

### Baseline (v1.1.0)
- Single instance CPU: 8.2% @ 44.1kHz, 512 samples
- 10 instances: 82% CPU (not sustainable)
- Hot path: `std::pow()` in saturation curve (45% of CPU time)

### Optimized (v1.2.0)
- Single instance CPU: 2.1%
- 10 instances: 21% CPU
- Hot path: Vectorized filter processing (18% of CPU time)

## Optimizations Applied

### 1. Replaced `std::pow()` with Lookup Table
**Impact:** 45% CPU reduction
**Location:** `Source/DSP/Saturation.cpp:42`

### 2. SIMD Vectorization of Filter
**Impact:** 15% CPU reduction
**Location:** `Source/DSP/SVFilter.cpp:87`

### 3. Removed Allocation in processBlock
**Impact:** Eliminated RT violations
**Location:** `Source/PluginProcessor.cpp:156`

## Benchmark Results

| Test | Baseline | Optimized | Improvement |
|------|----------|-----------|-------------|
| ProcessBlock (512 samples) | 1.85 ms | 0.48 ms | 74% faster |
| Single instance CPU | 8.2% | 2.1% | 74% reduction |
| 50 instances CPU | 410% | 105% | 74% reduction |

## Remaining Bottlenecks

1. **Reverb Algorithm** - Still using naive implementation (12% CPU)
   - Recommendation: Switch to FDN reverb or partitioned convolution
2. **UI Repaints** - Currently 120fps (unnecessary)
   - Recommendation: Rate-limit to 60fps

## Next Steps

- [ ] Optimize reverb algorithm (target: 5% CPU)
- [ ] Rate-limit UI repaints (target: 60fps)
- [ ] Profile on Windows (Intel CPU) to verify SIMD portability
- [ ] Run stress test with 100+ instances

## Flame Graphs

![Baseline Flame Graph](flamegraph-baseline.svg)
![Optimized Flame Graph](flamegraph-optimized.svg)

## Conclusion

Plugin now meets performance targets for release:
- ✅ Single instance < 5% CPU
- ✅ 20 instances < 50% CPU
- ✅ No RT violations detected
- ⚠️ Further optimization possible in reverb module
```

---

## Definition of Done

Performance analysis is complete when:

- ✅ Profiling data collected on target platforms
- ✅ Hot paths identified and documented
- ✅ Optimization opportunities prioritized
- ✅ Key optimizations implemented and benchmarked
- ✅ Performance regression tests added
- ✅ Report generated with flame graphs and recommendations
- ✅ CPU usage meets targets (< 5% single instance)
- ✅ No allocations or locks in audio thread

---

## Performance Targets

### CPU Usage Goals

| Scenario | Target | Acceptable | Poor |
|----------|--------|------------|------|
| Single instance @ 44.1kHz, 512 samples | < 2% | < 5% | > 10% |
| 10 instances | < 20% | < 40% | > 60% |
| 50 instances | < 50% | < 80% | > 100% |

### Latency Goals

| Plugin Type | Target Latency |
|-------------|----------------|
| Dynamics (compressor, gate) | 0 samples |
| EQ, filter | 0-64 samples |
| Modulation effects | 0-128 samples |
| Reverb, delay | 0-512 samples |

### Memory Usage

- **RAM:** < 50 MB per instance
- **Allocations:** 0 in `processBlock()`

---

## Quick Profiling Checklist

For rapid performance validation:

```bash
# 1. Build optimized
cmake -B build -DCMAKE_BUILD_TYPE=RelWithDebInfo
cmake --build build

# 2. Profile (macOS)
instruments -t "Time Profiler" ./build/MyPlugin_Standalone

# 3. Check for RT violations
# (Look for malloc/new in processBlock stack traces)

# 4. Benchmark
./build/Tests/PerformanceBenchmark

# 5. Verify targets
# Single instance should be < 5% CPU
```

**Time Required:** 30 minutes

---

## Expert Help

Delegate performance tasks:

- **@dsp-engineer** - Optimize DSP algorithms, implement SIMD
- **@test-automation-engineer** - Set up benchmarks, run profilers
- **@ui-engineer** - Optimize UI rendering, fix repainting issues
- **@technical-lead** - Review architectural performance issues
- **@plugin-engineer** - Integrate optimizations into build system

---

## Related Documentation

- **TESTING_STRATEGY.md** - Performance testing in CI/CD
- **juce-best-practices** skill - Realtime safety guidelines
- **dsp-cookbook** skill - Optimized DSP algorithms
- `/run-pluginval` command - Validation includes performance tests

---

## Tools Reference

### macOS
- **Instruments** (Xcode) - Time Profiler, Allocations, Leaks
- **Activity Monitor** - Real-time CPU monitoring
- **sample** - Command-line profiler: `sample <PID> 10 -f output.txt`

### Windows
- **Visual Studio Profiler** - CPU Usage, Memory Usage
- **Intel VTune** - Advanced profiling, hardware counters
- **Windows Performance Analyzer (WPA)** - System-wide profiling

### Linux
- **perf** - Linux performance profiler
- **Valgrind** - Memory profiling, cache profiling
- **gprof** - GNU profiler
- **Tracy** - Real-time frame profiler

### Cross-Platform
- **Tracy Profiler** - Real-time, frame-perfect profiling
- **Google Benchmark** - Microbenchmarking library
- **Superluminal** - Commercial profiler (excellent for audio plugins)

---

**Remember:** "Premature optimization is the root of all evil" - but audio plugins are performance-critical. Profile first, optimize hot paths, and always measure the impact!