Files
gh-yebot-rad-cc-plugins-plu…/commands/analyze-performance.md
2025-11-30 09:08:03 +08:00

22 KiB

argument-hint, description, allowed-tools, model
argument-hint description allowed-tools model
[target] [--profiler=<tool>] [--benchmark] [--report] Profile plugin performance with Instruments, perf, VTune, Tracy to identify bottlenecks and optimize DSP algorithms Bash, Read, Write, AskUserQuestion sonnet

/analyze-performance - Performance Profiling and Optimization

Profile your JUCE plugin's performance to identify bottlenecks, optimize DSP algorithms, and ensure efficient CPU usage across different scenarios.

Overview

This command guides you through comprehensive performance analysis using profiling tools, benchmarking, and optimization strategies. It helps identify hot paths in DSP code, memory bottlenecks, and inefficient algorithms.

Syntax

/analyze-performance [target] [--profiler=<tool>] [--benchmark] [--report]

Arguments

  • target (optional): What to profile - dsp, ui, full, or specific (default: dsp)
  • --profiler=<tool>: Profiler to use - instruments, perf, vtune, tracy, or auto (default: auto)
  • --benchmark: Run performance benchmarks and compare against baseline
  • --report: Generate detailed performance report with recommendations

Examples

# Profile DSP code with platform default profiler
/analyze-performance dsp

# Profile entire plugin with Instruments (macOS)
/analyze-performance full --profiler=instruments

# Run benchmarks and generate report
/analyze-performance dsp --benchmark --report

# Profile UI rendering performance
/analyze-performance ui --profiler=instruments

Instructions

Step 1: Pre-Profiling Setup

@build-engineer - Prepare optimized build with profiling symbols.

  1. Build with Release optimization + debug symbols:

    macOS (Xcode):

    # CMakeLists.txt
    if(APPLE)
        set(CMAKE_BUILD_TYPE RelWithDebInfo)
        # Disable stripping for profiling
        set(CMAKE_XCODE_ATTRIBUTE_STRIP_INSTALLED_PRODUCT NO)
        set(CMAKE_XCODE_ATTRIBUTE_DEPLOYMENT_POSTPROCESSING NO)
    endif()
    

    Build:

    cmake -B build -DCMAKE_BUILD_TYPE=RelWithDebInfo
    cmake --build build --config RelWithDebInfo
    

    Windows (Visual Studio):

    cmake -B build
    cmake --build build --config RelWithDebInfo
    

    Linux:

    cmake -B build -DCMAKE_BUILD_TYPE=RelWithDebInfo -DCMAKE_CXX_FLAGS="-g -O2"
    cmake --build build
    
  2. Verify optimizations are enabled:

    # macOS: Check optimization flags
    xcrun otool -v -s __TEXT __text build/MyPlugin.vst3/Contents/MacOS/MyPlugin | grep -A 5 "optimization"
    
    # Linux: Check binary for debug symbols and optimization
    objdump -d build/MyPlugin.vst3 | head -50
    
  3. Install profiling tools:

    macOS:

    # Instruments comes with Xcode
    xcode-select --install
    
    # Optional: Tracy profiler
    brew install tracy
    

    Windows:

    # Intel VTune (recommended)
    # Download from: https://www.intel.com/content/www/us/en/developer/tools/oneapi/vtune-profiler.html
    
    # Or Visual Studio Profiler (included with VS)
    

    Linux:

    # perf (Linux perf tool)
    sudo apt install linux-tools-common linux-tools-generic
    
    # Tracy profiler
    sudo apt install tracy-profiler
    
    # OR build from source
    git clone https://github.com/wolfpld/tracy
    cd tracy/profiler
    make
    

Step 2: DSP Performance Profiling

@dsp-engineer + @test-automation-engineer - Identify DSP bottlenecks.

Profile Audio Processing

macOS - Using Instruments

  1. Launch Instruments with plugin:

    # Open standalone plugin in Instruments
    open -a Instruments
    
    # Or profile in DAW:
    # 1. Launch Instruments
    # 2. Choose "Time Profiler" template
    # 3. Click Record
    # 4. In dropdown, select DAW process (e.g., "Logic Pro")
    # 5. Load plugin in DAW and play audio
    
  2. Time Profiler Configuration:

    • Template: Time Profiler
    • Sample Frequency: 1ms (high resolution)
    • Record Waiting Threads: OFF (focus on CPU time)
    • High Frequency: ON
  3. Record profiling session:

    • Click Record in Instruments
    • Load plugin in standalone app or DAW
    • Play test audio for 30-60 seconds
    • Include various parameter automations
    • Stop recording
  4. Analyze results:

    • Switch to "Call Tree" view
    • Enable filters:
      • Separate by Thread
      • Invert Call Tree
      • Hide System Libraries
    • Look for hot functions in your code
    • Focus on: processBlock(), DSP algorithm functions
  5. Identify bottlenecks:

    • Functions taking > 5% of CPU time are candidates for optimization
    • Look for:
      • Unexpected memory allocations (malloc, new)
      • Expensive math operations (use vectorization)
      • Inefficient loops
      • Cache misses (scattered memory access)

Example Instruments Output:

Symbol Name                               % Time
MyPlugin::processBlock()                   45.2%
  MyFilter::processSample()                28.3%
    std::pow()                             15.1%  ⚠️ Expensive!
    MyFilter::updateCoefficients()         13.2%
  MyDistortion::process()                  16.9%

Red Flags:

  • std::pow(), std::sin(), std::cos() in inner loops → Use lookup tables or approximations
  • Memory allocations → Pre-allocate in prepareToPlay()
  • Virtual function calls in hot path → Consider static polymorphism

Windows - Using Visual Studio Profiler

  1. Start profiling:

    • Open Visual Studio
    • Debug → Performance Profiler
    • Select: CPU Usage
    • Start profiling
    • Launch DAW and load plugin
    • Play audio for 60 seconds
    • Stop profiling
  2. Analyze:

    • View "Hot Path"
    • Check "Functions" view sorted by "Total CPU %"
    • Drill into processBlock()
  3. Generate report:

    • File → Export Report
    • Save as performance-analysis-[date].diagsession

Linux - Using perf

  1. Record performance data:

    # Profile specific process (find PID of DAW or standalone)
    perf record -F 999 -g -p <PID>
    
    # Or profile command:
    perf record -F 999 -g ./build/MyPlugin_Standalone
    
    # Play audio for 60 seconds, then Ctrl+C to stop
    
  2. View results:

    # Interactive TUI
    perf report
    
    # Generate flame graph (requires flamegraph tools)
    git clone https://github.com/brendangregg/FlameGraph
    perf script | FlameGraph/stackcollapse-perf.pl | FlameGraph/flamegraph.pl > flamegraph.svg
    open flamegraph.svg  # or xdg-open on Linux
    
  3. Interpret flame graph:

    • Width = CPU time
    • Look for wide sections in your code
    • Drill down into processBlock() stack frames

Common Performance Issues and Fixes

Issue 1: Expensive Transcendental Functions

Symptom: std::sin(), std::cos(), std::pow() show up in profiler.

Solution: Use lookup tables or polynomial approximations.

Example:

// ❌ Slow: Direct call in processBlock
float sine = std::sin(phase);

// ✅ Fast: Lookup table
class SineLUT {
    static constexpr int tableSize = 2048;
    std::array<float, tableSize> table;

public:
    SineLUT() {
        for (int i = 0; i < tableSize; ++i)
            table[i] = std::sin(2.0 * M_PI * i / tableSize);
    }

    float lookup(float phase) const {
        float index = phase * tableSize;
        int i0 = static_cast<int>(index) % tableSize;
        int i1 = (i0 + 1) % tableSize;
        float frac = index - std::floor(index);
        return table[i0] + frac * (table[i1] - table[i0]);
    }
};

// Use in processBlock:
float sine = sineLUT.lookup(phase);

Speedup: 5-10x faster


Issue 2: Inefficient Memory Access Patterns

Symptom: High cache miss rate, poor vectorization.

Solution: Structure-of-arrays instead of array-of-structures.

Example:

// ❌ Poor cache locality (AoS)
struct Voice {
    float frequency, amplitude, phase;
};
std::vector<Voice> voices;

for (auto& voice : voices) {
    voice.phase += voice.frequency;
    output += voice.amplitude * std::sin(voice.phase);
}

// ✅ Better cache locality (SoA)
struct VoiceBank {
    std::vector<float> frequencies;
    std::vector<float> amplitudes;
    std::vector<float> phases;
};

for (int i = 0; i < voices.phases.size(); ++i) {
    voices.phases[i] += voices.frequencies[i];
    output += voices.amplitudes[i] * std::sin(voices.phases[i]);
}

Benefit: Better SIMD vectorization, fewer cache misses.


Issue 3: Unnecessary Branching

Symptom: Unpredictable branches in inner loops.

Solution: Branchless code or precompute decisions.

Example:

// ❌ Branch in inner loop
for (int i = 0; i < numSamples; ++i) {
    if (bypassEnabled)
        output[i] = input[i];
    else
        output[i] = process(input[i]);
}

// ✅ Branchless or separate loops
if (bypassEnabled) {
    std::copy(input, input + numSamples, output);
} else {
    for (int i = 0; i < numSamples; ++i)
        output[i] = process(input[i]);
}

Issue 4: Virtual Function Calls

Symptom: Virtual dispatch overhead in hot path.

Solution: Static polymorphism (templates) or function pointers.

Example:

// ❌ Virtual function call per sample
class Filter {
public:
    virtual float process(float input) = 0;
};

// ✅ Template-based static polymorphism
template<typename FilterType>
class Processor {
    FilterType filter;
public:
    void processBlock(float* buffer, int numSamples) {
        for (int i = 0; i < numSamples; ++i)
            buffer[i] = filter.process(buffer[i]);  // Inlined!
    }
};

Step 3: SIMD Optimization

@dsp-engineer - Leverage SIMD instructions for maximum performance.

Identify Vectorization Opportunities

  1. Check if compiler vectorized loops:

    macOS/Linux:

    # GCC/Clang vectorization report
    cmake -B build -DCMAKE_CXX_FLAGS="-O3 -fopt-info-vec"
    cmake --build build 2>&1 | grep vectorized
    

    Windows:

    # MSVC vectorization report
    cmake -B build
    cmake --build build -- /p:CL="/Qvec-report:2"
    
  2. Manual SIMD with JUCE:

    JUCE provides cross-platform SIMD abstractions:

    #include <juce_dsp/juce_dsp.h>
    
    // Example: Process 4 samples at once with SIMD
    void processBlock(juce::AudioBuffer<float>& buffer) {
        auto* channelData = buffer.getWritePointer(0);
        int numSamples = buffer.getNumSamples();
    
        // Process in chunks of 4 (SSE/NEON)
        using SIMDFloat = juce::dsp::SIMDRegister<float>;
        constexpr int simdSize = SIMDFloat::size();
    
        int i = 0;
        for (; i < numSamples - simdSize; i += simdSize) {
            auto simdInput = SIMDFloat::fromRawArray(channelData + i);
            auto simdOutput = simdInput * SIMDFloat(gain);  // SIMD multiply
            simdOutput.copyToRawArray(channelData + i);
        }
    
        // Handle remaining samples
        for (; i < numSamples; ++i) {
            channelData[i] *= gain;
        }
    }
    
  3. Benchmark SIMD vs scalar:

    # Build with SIMD enabled
    cmake -B build-simd -DCMAKE_CXX_FLAGS="-O3 -march=native"
    cmake --build build-simd
    
    # Compare performance (see Step 4 below)
    

Expected Speedup: 2-4x for SIMD-friendly code.


Step 4: Performance Benchmarking

@test-automation-engineer - Quantify performance improvements.

Create Benchmark Tests

// Tests/PerformanceBenchmark.cpp
#include <benchmark/benchmark.h>  // Google Benchmark
#include "../Source/PluginProcessor.h"

static void BM_ProcessBlock(benchmark::State& state) {
    MyPluginProcessor processor;
    processor.setPlayConfigDetails(2, 2, 44100.0, 512);
    processor.prepareToPlay(44100.0, 512);

    juce::AudioBuffer<float> buffer(2, 512);
    juce::MidiBuffer midi;

    // Fill with test signal
    for (int ch = 0; ch < 2; ++ch)
        for (int i = 0; i < 512; ++i)
            buffer.setSample(ch, i, std::sin(2 * M_PI * 440 * i / 44100.0));

    for (auto _ : state) {
        processor.processBlock(buffer, midi);
        benchmark::DoNotOptimize(buffer.getReadPointer(0));
    }

    // Report CPU usage metric
    state.SetItemsProcessed(state.iterations() * 512);
}

BENCHMARK(BM_ProcessBlock)->Iterations(10000);

BENCHMARK_MAIN();

Run Benchmarks

# Install Google Benchmark
git clone https://github.com/google/benchmark.git
cd benchmark
cmake -E make_directory "build"
cmake -E chdir "build" cmake -DCMAKE_BUILD_TYPE=Release ..
cmake --build "build" --config Release
sudo cmake --build "build" --config Release --target install

# Build and run your benchmarks
cmake -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build --target PerformanceBenchmark
./build/Tests/PerformanceBenchmark --benchmark_out=results.json --benchmark_out_format=json

Example Output:

-----------------------------------------------------------------
Benchmark                       Time             CPU   Iterations
-----------------------------------------------------------------
BM_ProcessBlock              1.23 ms         1.22 ms          571

Interpretation:

  • 1.22 ms per block @ 512 samples = ~24% CPU at 44.1kHz (1.22ms / (512/44100 * 1000))
  • Goal: < 5% CPU (< 0.29 ms/block)

Compare Before/After Optimization

# Save baseline
./build/Tests/PerformanceBenchmark > baseline.txt

# Make optimizations
# ...

# Compare
./build/Tests/PerformanceBenchmark > optimized.txt
diff baseline.txt optimized.txt

Step 5: UI Performance Profiling

@ui-engineer - Ensure UI doesn't impact audio performance.

Profile UI Rendering

  1. Check UI frame rate:

    // Add to Editor
    class MyEditor : public juce::AudioProcessorEditor, private juce::Timer {
        void timerCallback() override {
            auto now = juce::Time::getMillisecondCounterHiRes();
            double fps = 1000.0 / (now - lastFrameTime);
            DBG("FPS: " << fps);  // Should be 60fps
            lastFrameTime = now;
            repaint();
        }
    
        double lastFrameTime = 0;
    };
    
  2. Profile with Instruments (macOS):

    • Use "Core Animation" template
    • Check for:
      • Dropped frames (should be 0)
      • Expensive drawing operations
      • Off-screen rendering
  3. Optimize UI:

    • Use repaint() only when needed (not on every audio callback!)
    • Coalesce repaints:
      // ❌ Repaint on every parameter change (60 times/sec from audio thread!)
      parameterChanged(parameter, newValue) {
          repaint();  // BAD
      }
      
      // ✅ Rate-limit repaints
      parameterChanged(parameter, newValue) {
          startTimer(16);  // 60fps max
      }
      
      timerCallback() {
          stopTimer();
          repaint();
      }
      
    • Cache rendered graphics:
      juce::Image cachedBackground;
      
      void paint(Graphics& g) {
          if (cachedBackground.isNull()) {
              cachedBackground = juce::Image(juce::Image::ARGB, getWidth(), getHeight(), true);
              Graphics cg(cachedBackground);
              drawComplexBackground(cg);
          }
          g.drawImageAt(cachedBackground, 0, 0);
      }
      

Step 6: Memory Profiling

@test-automation-engineer - Detect memory leaks and excessive allocations.

macOS - Instruments Leaks

  1. Launch Instruments → Leaks template
  2. Record session (load/unload plugin multiple times)
  3. Check for leaks:
    • Red flags = memory leaks
    • Click to see stack trace of allocation

Linux - Valgrind

# Profile standalone plugin
valgrind --leak-check=full --track-origins=yes ./build/MyPlugin_Standalone

# Play audio, then quit
# Check report for leaks

Windows - Visual Studio Memory Profiler

  1. Debug → Performance Profiler → Memory Usage
  2. Take snapshots before and after loading plugin
  3. Compare snapshots for memory growth

Check for Allocations in Audio Thread

Use -fsanitize=address (Clang/GCC):

cmake -B build -DCMAKE_CXX_FLAGS="-fsanitize=address -g"
cmake --build build
./build/Tests/MyPluginTests

Look for: Allocations called from processBlock() - these are FORBIDDEN.


Step 7: Advanced Profiling - Tracy

@test-automation-engineer - Use Tracy for frame-perfect profiling.

Tracy is a real-time profiler with nanosecond precision.

Integrate Tracy

// CMakeLists.txt
include(FetchContent)
FetchContent_Declare(
    tracy
    GIT_REPOSITORY https://github.com/wolfpld/tracy.git
    GIT_TAG v0.10
)
FetchContent_MakeAvailable(tracy)

target_link_libraries(MyPlugin PRIVATE TracyClient)
target_compile_definitions(MyPlugin PRIVATE TRACY_ENABLE)

Add Tracy Zones

#include <tracy/Tracy.hpp>

void processBlock(AudioBuffer<float>& buffer, MidiBuffer& midi) {
    ZoneScoped;  // Automatic profiling for this function

    {
        ZoneScopedN("Filter Processing");
        filter.process(buffer);
    }

    {
        ZoneScopedN("Distortion Processing");
        distortion.process(buffer);
    }
}

Run Tracy

# Launch Tracy profiler GUI
tracy

# Run plugin in DAW
# Tracy will automatically connect and show real-time profiling

Benefits:

  • Real-time visualization
  • Frame-by-frame analysis
  • Memory allocation tracking
  • Lock contention detection

Step 8: Generate Performance Report

@support-engineer - Document findings and recommendations.

Report Template

# Performance Analysis Report - MyPlugin v1.2.0

**Date:** 2024-05-15
**Analyst:** @dsp-engineer
**Platform:** macOS 14.5, Apple M1 Max

## Summary

CPU usage has been reduced from **8.2%** to **2.1%** (74% improvement) through targeted optimizations.

## Profiling Results

### Baseline (v1.1.0)
- Single instance CPU: 8.2% @ 44.1kHz, 512 samples
- 10 instances: 82% CPU (not sustainable)
- Hot path: `std::pow()` in saturation curve (45% of CPU time)

### Optimized (v1.2.0)
- Single instance CPU: 2.1%
- 10 instances: 21% CPU
- Hot path: Vectorized filter processing (18% of CPU time)

## Optimizations Applied

### 1. Replaced `std::pow()` with Lookup Table
**Impact:** 45% CPU reduction
**Location:** `Source/DSP/Saturation.cpp:42`

### 2. SIMD Vectorization of Filter
**Impact:** 15% CPU reduction
**Location:** `Source/DSP/SVFilter.cpp:87`

### 3. Removed Allocation in processBlock
**Impact:** Eliminated RT violations
**Location:** `Source/PluginProcessor.cpp:156`

## Benchmark Results

| Test | Baseline | Optimized | Improvement |
|------|----------|-----------|-------------|
| ProcessBlock (512 samples) | 1.85 ms | 0.48 ms | 74% faster |
| Single instance CPU | 8.2% | 2.1% | 74% reduction |
| 50 instances CPU | 410% | 105% | 74% reduction |

## Remaining Bottlenecks

1. **Reverb Algorithm** - Still using naive implementation (12% CPU)
   - Recommendation: Switch to FDN reverb or partitioned convolution
2. **UI Repaints** - Currently 120fps (unnecessary)
   - Recommendation: Rate-limit to 60fps

## Next Steps

- [ ] Optimize reverb algorithm (target: 5% CPU)
- [ ] Rate-limit UI repaints (target: 60fps)
- [ ] Profile on Windows (Intel CPU) to verify SIMD portability
- [ ] Run stress test with 100+ instances

## Flame Graphs

![Baseline Flame Graph](flamegraph-baseline.svg)
![Optimized Flame Graph](flamegraph-optimized.svg)

## Conclusion

Plugin now meets performance targets for release:
- ✅ Single instance < 5% CPU
- ✅ 20 instances < 50% CPU
- ✅ No RT violations detected
- ⚠️ Further optimization possible in reverb module

Definition of Done

Performance analysis is complete when:

  • Profiling data collected on target platforms
  • Hot paths identified and documented
  • Optimization opportunities prioritized
  • Key optimizations implemented and benchmarked
  • Performance regression tests added
  • Report generated with flame graphs and recommendations
  • CPU usage meets targets (< 5% single instance)
  • No allocations or locks in audio thread

Performance Targets

CPU Usage Goals

Scenario Target Acceptable Poor
Single instance @ 44.1kHz, 512 samples < 2% < 5% > 10%
10 instances < 20% < 40% > 60%
50 instances < 50% < 80% > 100%

Latency Goals

Plugin Type Target Latency
Dynamics (compressor, gate) 0 samples
EQ, filter 0-64 samples
Modulation effects 0-128 samples
Reverb, delay 0-512 samples

Memory Usage

  • RAM: < 50 MB per instance
  • Allocations: 0 in processBlock()

Quick Profiling Checklist

For rapid performance validation:

# 1. Build optimized
cmake -B build -DCMAKE_BUILD_TYPE=RelWithDebInfo
cmake --build build

# 2. Profile (macOS)
instruments -t "Time Profiler" ./build/MyPlugin_Standalone

# 3. Check for RT violations
# (Look for malloc/new in processBlock stack traces)

# 4. Benchmark
./build/Tests/PerformanceBenchmark

# 5. Verify targets
# Single instance should be < 5% CPU

Time Required: 30 minutes


Expert Help

Delegate performance tasks:

  • @dsp-engineer - Optimize DSP algorithms, implement SIMD
  • @test-automation-engineer - Set up benchmarks, run profilers
  • @ui-engineer - Optimize UI rendering, fix repainting issues
  • @technical-lead - Review architectural performance issues
  • @plugin-engineer - Integrate optimizations into build system

  • TESTING_STRATEGY.md - Performance testing in CI/CD
  • juce-best-practices skill - Realtime safety guidelines
  • dsp-cookbook skill - Optimized DSP algorithms
  • /run-pluginval command - Validation includes performance tests

Tools Reference

macOS

  • Instruments (Xcode) - Time Profiler, Allocations, Leaks
  • Activity Monitor - Real-time CPU monitoring
  • sample - Command-line profiler: sample <PID> 10 -f output.txt

Windows

  • Visual Studio Profiler - CPU Usage, Memory Usage
  • Intel VTune - Advanced profiling, hardware counters
  • Windows Performance Analyzer (WPA) - System-wide profiling

Linux

  • perf - Linux performance profiler
  • Valgrind - Memory profiling, cache profiling
  • gprof - GNU profiler
  • Tracy - Real-time frame profiler

Cross-Platform

  • Tracy Profiler - Real-time, frame-perfect profiling
  • Google Benchmark - Microbenchmarking library
  • Superluminal - Commercial profiler (excellent for audio plugins)

Remember: "Premature optimization is the root of all evil" - but audio plugins are performance-critical. Profile first, optimize hot paths, and always measure the impact!