zhongwei/gh-yebot-rad-cc-plugins-plugins-juce-dev-team

Files

Zhongwei Li a886924d29 Initial commit

2025-11-30 09:08:03 +08:00

22 KiB

Raw Blame History

argument-hint, description, allowed-tools, model

argument-hint	description	allowed-tools	model
[target] [--profiler=<tool>] [--benchmark] [--report]	Profile plugin performance with Instruments, perf, VTune, Tracy to identify bottlenecks and optimize DSP algorithms	Bash, Read, Write, AskUserQuestion	sonnet

/analyze-performance - Performance Profiling and Optimization

Profile your JUCE plugin's performance to identify bottlenecks, optimize DSP algorithms, and ensure efficient CPU usage across different scenarios.

Overview

This command guides you through comprehensive performance analysis using profiling tools, benchmarking, and optimization strategies. It helps identify hot paths in DSP code, memory bottlenecks, and inefficient algorithms.

Syntax

/analyze-performance [target] [--profiler=<tool>] [--benchmark] [--report]

Arguments

target (optional): What to profile - dsp, ui, full, or specific (default: dsp)
--profiler=<tool>: Profiler to use - instruments, perf, vtune, tracy, or auto (default: auto)
--benchmark: Run performance benchmarks and compare against baseline
--report: Generate detailed performance report with recommendations

Examples

# Profile DSP code with platform default profiler
/analyze-performance dsp

# Profile entire plugin with Instruments (macOS)
/analyze-performance full --profiler=instruments

# Run benchmarks and generate report
/analyze-performance dsp --benchmark --report

# Profile UI rendering performance
/analyze-performance ui --profiler=instruments

Instructions

Step 1: Pre-Profiling Setup

@build-engineer - Prepare optimized build with profiling symbols.

Build with Release optimization + debug symbols:

macOS (Xcode):

# CMakeLists.txt
if(APPLE)
    set(CMAKE_BUILD_TYPE RelWithDebInfo)
    # Disable stripping for profiling
    set(CMAKE_XCODE_ATTRIBUTE_STRIP_INSTALLED_PRODUCT NO)
    set(CMAKE_XCODE_ATTRIBUTE_DEPLOYMENT_POSTPROCESSING NO)
endif()

Build:

cmake -B build -DCMAKE_BUILD_TYPE=RelWithDebInfo
cmake --build build --config RelWithDebInfo

Windows (Visual Studio):

cmake -B build
cmake --build build --config RelWithDebInfo

Linux:

cmake -B build -DCMAKE_BUILD_TYPE=RelWithDebInfo -DCMAKE_CXX_FLAGS="-g -O2"
cmake --build build

Verify optimizations are enabled:

# macOS: Check optimization flags
xcrun otool -v -s __TEXT __text build/MyPlugin.vst3/Contents/MacOS/MyPlugin | grep -A 5 "optimization"

# Linux: Check binary for debug symbols and optimization
objdump -d build/MyPlugin.vst3 | head -50

Install profiling tools:

macOS:

# Instruments comes with Xcode
xcode-select --install

# Optional: Tracy profiler
brew install tracy

Windows:

# Intel VTune (recommended)
# Download from: https://www.intel.com/content/www/us/en/developer/tools/oneapi/vtune-profiler.html

# Or Visual Studio Profiler (included with VS)

Linux:

# perf (Linux perf tool)
sudo apt install linux-tools-common linux-tools-generic

# Tracy profiler
sudo apt install tracy-profiler

# OR build from source
git clone https://github.com/wolfpld/tracy
cd tracy/profiler
make

Step 2: DSP Performance Profiling

@dsp-engineer + @test-automation-engineer - Identify DSP bottlenecks.

Profile Audio Processing

macOS - Using Instruments

Launch Instruments with plugin:

# Open standalone plugin in Instruments
open -a Instruments

# Or profile in DAW:
# 1. Launch Instruments
# 2. Choose "Time Profiler" template
# 3. Click Record
# 4. In dropdown, select DAW process (e.g., "Logic Pro")
# 5. Load plugin in DAW and play audio

Time Profiler Configuration:
- Template: Time Profiler
- Sample Frequency: 1ms (high resolution)
- Record Waiting Threads: OFF (focus on CPU time)
- High Frequency: ON
Record profiling session:
- Click Record in Instruments
- Load plugin in standalone app or DAW
- Play test audio for 30-60 seconds
- Include various parameter automations
- Stop recording
Analyze results:
- Switch to "Call Tree" view
- Enable filters:
  - ✅ Separate by Thread
  - ✅ Invert Call Tree
  - ✅ Hide System Libraries
- Look for hot functions in your code
- Focus on: processBlock(), DSP algorithm functions
Identify bottlenecks:
- Functions taking > 5% of CPU time are candidates for optimization
- Look for:
  - Unexpected memory allocations (malloc, new)
  - Expensive math operations (use vectorization)
  - Inefficient loops
  - Cache misses (scattered memory access)

Example Instruments Output:

Symbol Name                               % Time
MyPlugin::processBlock()                   45.2%
  MyFilter::processSample()                28.3%
    std::pow()                             15.1%  ⚠️ Expensive!
    MyFilter::updateCoefficients()         13.2%
  MyDistortion::process()                  16.9%

Red Flags:

std::pow(), std::sin(), std::cos() in inner loops → Use lookup tables or approximations
Memory allocations → Pre-allocate in prepareToPlay()
Virtual function calls in hot path → Consider static polymorphism

Windows - Using Visual Studio Profiler

Start profiling:
- Open Visual Studio
- Debug → Performance Profiler
- Select: CPU Usage
- Start profiling
- Launch DAW and load plugin
- Play audio for 60 seconds
- Stop profiling
Analyze:
- View "Hot Path"
- Check "Functions" view sorted by "Total CPU %"
- Drill into processBlock()
Generate report:
- File → Export Report
- Save as performance-analysis-[date].diagsession

Linux - Using perf

Record performance data:

# Profile specific process (find PID of DAW or standalone)
perf record -F 999 -g -p <PID>

# Or profile command:
perf record -F 999 -g ./build/MyPlugin_Standalone

# Play audio for 60 seconds, then Ctrl+C to stop

View results:

# Interactive TUI
perf report

# Generate flame graph (requires flamegraph tools)
git clone https://github.com/brendangregg/FlameGraph
perf script | FlameGraph/stackcollapse-perf.pl | FlameGraph/flamegraph.pl > flamegraph.svg
open flamegraph.svg  # or xdg-open on Linux

Interpret flame graph:
- Width = CPU time
- Look for wide sections in your code
- Drill down into processBlock() stack frames

Common Performance Issues and Fixes

Issue 1: Expensive Transcendental Functions

Symptom: std::sin(), std::cos(), std::pow() show up in profiler.

Solution: Use lookup tables or polynomial approximations.

Example:

// ❌ Slow: Direct call in processBlock
float sine = std::sin(phase);

// ✅ Fast: Lookup table
class SineLUT {
    static constexpr int tableSize = 2048;
    std::array<float, tableSize> table;

public:
    SineLUT() {
        for (int i = 0; i < tableSize; ++i)
            table[i] = std::sin(2.0 * M_PI * i / tableSize);
    }

    float lookup(float phase) const {
        float index = phase * tableSize;
        int i0 = static_cast<int>(index) % tableSize;
        int i1 = (i0 + 1) % tableSize;
        float frac = index - std::floor(index);
        return table[i0] + frac * (table[i1] - table[i0]);
    }
};

// Use in processBlock:
float sine = sineLUT.lookup(phase);

Speedup: 5-10x faster

Issue 2: Inefficient Memory Access Patterns

Symptom: High cache miss rate, poor vectorization.

Solution: Structure-of-arrays instead of array-of-structures.

Example:

// ❌ Poor cache locality (AoS)
struct Voice {
    float frequency, amplitude, phase;
};
std::vector<Voice> voices;

for (auto& voice : voices) {
    voice.phase += voice.frequency;
    output += voice.amplitude * std::sin(voice.phase);
}

// ✅ Better cache locality (SoA)
struct VoiceBank {
    std::vector<float> frequencies;
    std::vector<float> amplitudes;
    std::vector<float> phases;
};

for (int i = 0; i < voices.phases.size(); ++i) {
    voices.phases[i] += voices.frequencies[i];
    output += voices.amplitudes[i] * std::sin(voices.phases[i]);
}

Benefit: Better SIMD vectorization, fewer cache misses.

Issue 3: Unnecessary Branching

Symptom: Unpredictable branches in inner loops.

Solution: Branchless code or precompute decisions.

Example:

// ❌ Branch in inner loop
for (int i = 0; i < numSamples; ++i) {
    if (bypassEnabled)
        output[i] = input[i];
    else
        output[i] = process(input[i]);
}

// ✅ Branchless or separate loops
if (bypassEnabled) {
    std::copy(input, input + numSamples, output);
} else {
    for (int i = 0; i < numSamples; ++i)
        output[i] = process(input[i]);
}

Issue 4: Virtual Function Calls

Symptom: Virtual dispatch overhead in hot path.

Solution: Static polymorphism (templates) or function pointers.

Example:

// ❌ Virtual function call per sample
class Filter {
public:
    virtual float process(float input) = 0;
};

// ✅ Template-based static polymorphism
template<typename FilterType>
class Processor {
    FilterType filter;
public:
    void processBlock(float* buffer, int numSamples) {
        for (int i = 0; i < numSamples; ++i)
            buffer[i] = filter.process(buffer[i]);  // Inlined!
    }
};

Step 3: SIMD Optimization

@dsp-engineer - Leverage SIMD instructions for maximum performance.

Identify Vectorization Opportunities

Check if compiler vectorized loops:

macOS/Linux:

# GCC/Clang vectorization report
cmake -B build -DCMAKE_CXX_FLAGS="-O3 -fopt-info-vec"
cmake --build build 2>&1 | grep vectorized

Windows:

# MSVC vectorization report
cmake -B build
cmake --build build -- /p:CL="/Qvec-report:2"

Manual SIMD with JUCE:

JUCE provides cross-platform SIMD abstractions:

#include <juce_dsp/juce_dsp.h>

// Example: Process 4 samples at once with SIMD
void processBlock(juce::AudioBuffer<float>& buffer) {
    auto* channelData = buffer.getWritePointer(0);
    int numSamples = buffer.getNumSamples();

    // Process in chunks of 4 (SSE/NEON)
    using SIMDFloat = juce::dsp::SIMDRegister<float>;
    constexpr int simdSize = SIMDFloat::size();

    int i = 0;
    for (; i < numSamples - simdSize; i += simdSize) {
        auto simdInput = SIMDFloat::fromRawArray(channelData + i);
        auto simdOutput = simdInput * SIMDFloat(gain);  // SIMD multiply
        simdOutput.copyToRawArray(channelData + i);
    }

    // Handle remaining samples
    for (; i < numSamples; ++i) {
        channelData[i] *= gain;
    }
}

Benchmark SIMD vs scalar:

# Build with SIMD enabled
cmake -B build-simd -DCMAKE_CXX_FLAGS="-O3 -march=native"
cmake --build build-simd

# Compare performance (see Step 4 below)

Expected Speedup: 2-4x for SIMD-friendly code.

Step 4: Performance Benchmarking

@test-automation-engineer - Quantify performance improvements.

Create Benchmark Tests

// Tests/PerformanceBenchmark.cpp
#include <benchmark/benchmark.h>  // Google Benchmark
#include "../Source/PluginProcessor.h"

static void BM_ProcessBlock(benchmark::State& state) {
    MyPluginProcessor processor;
    processor.setPlayConfigDetails(2, 2, 44100.0, 512);
    processor.prepareToPlay(44100.0, 512);

    juce::AudioBuffer<float> buffer(2, 512);
    juce::MidiBuffer midi;

    // Fill with test signal
    for (int ch = 0; ch < 2; ++ch)
        for (int i = 0; i < 512; ++i)
            buffer.setSample(ch, i, std::sin(2 * M_PI * 440 * i / 44100.0));

    for (auto _ : state) {
        processor.processBlock(buffer, midi);
        benchmark::DoNotOptimize(buffer.getReadPointer(0));
    }

    // Report CPU usage metric
    state.SetItemsProcessed(state.iterations() * 512);
}

BENCHMARK(BM_ProcessBlock)->Iterations(10000);

BENCHMARK_MAIN();

Run Benchmarks

# Install Google Benchmark
git clone https://github.com/google/benchmark.git
cd benchmark
cmake -E make_directory "build"
cmake -E chdir "build" cmake -DCMAKE_BUILD_TYPE=Release ..
cmake --build "build" --config Release
sudo cmake --build "build" --config Release --target install

# Build and run your benchmarks
cmake -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build --target PerformanceBenchmark
./build/Tests/PerformanceBenchmark --benchmark_out=results.json --benchmark_out_format=json

Example Output:

-----------------------------------------------------------------
Benchmark                       Time             CPU   Iterations
-----------------------------------------------------------------
BM_ProcessBlock              1.23 ms         1.22 ms          571

Interpretation:

1.22 ms per block @ 512 samples = ~24% CPU at 44.1kHz (1.22ms / (512/44100 * 1000))
Goal: < 5% CPU (< 0.29 ms/block)

Compare Before/After Optimization

# Save baseline
./build/Tests/PerformanceBenchmark > baseline.txt

# Make optimizations
# ...

# Compare
./build/Tests/PerformanceBenchmark > optimized.txt
diff baseline.txt optimized.txt

Step 5: UI Performance Profiling

@ui-engineer - Ensure UI doesn't impact audio performance.

Profile UI Rendering

Check UI frame rate:

// Add to Editor
class MyEditor : public juce::AudioProcessorEditor, private juce::Timer {
    void timerCallback() override {
        auto now = juce::Time::getMillisecondCounterHiRes();
        double fps = 1000.0 / (now - lastFrameTime);
        DBG("FPS: " << fps);  // Should be 60fps
        lastFrameTime = now;
        repaint();
    }

    double lastFrameTime = 0;
};

Profile with Instruments (macOS):
- Use "Core Animation" template
- Check for:
  - Dropped frames (should be 0)
  - Expensive drawing operations
  - Off-screen rendering

Optimize UI:

Use repaint() only when needed (not on every audio callback!)

Coalesce repaints:

// ❌ Repaint on every parameter change (60 times/sec from audio thread!)
parameterChanged(parameter, newValue) {
    repaint();  // BAD
}

// ✅ Rate-limit repaints
parameterChanged(parameter, newValue) {
    startTimer(16);  // 60fps max
}

timerCallback() {
    stopTimer();
    repaint();
}

Cache rendered graphics:

juce::Image cachedBackground;

void paint(Graphics& g) {
    if (cachedBackground.isNull()) {
        cachedBackground = juce::Image(juce::Image::ARGB, getWidth(), getHeight(), true);
        Graphics cg(cachedBackground);
        drawComplexBackground(cg);
    }
    g.drawImageAt(cachedBackground, 0, 0);
}

Step 6: Memory Profiling

@test-automation-engineer - Detect memory leaks and excessive allocations.

macOS - Instruments Leaks

Launch Instruments → Leaks template
Record session (load/unload plugin multiple times)
Check for leaks:
- Red flags = memory leaks
- Click to see stack trace of allocation

Linux - Valgrind

# Profile standalone plugin
valgrind --leak-check=full --track-origins=yes ./build/MyPlugin_Standalone

# Play audio, then quit
# Check report for leaks

Windows - Visual Studio Memory Profiler

Debug → Performance Profiler → Memory Usage
Take snapshots before and after loading plugin
Compare snapshots for memory growth

Check for Allocations in Audio Thread

Use -fsanitize=address (Clang/GCC):

cmake -B build -DCMAKE_CXX_FLAGS="-fsanitize=address -g"
cmake --build build
./build/Tests/MyPluginTests

Look for: Allocations called from processBlock() - these are FORBIDDEN.

Step 7: Advanced Profiling - Tracy

@test-automation-engineer - Use Tracy for frame-perfect profiling.

Tracy is a real-time profiler with nanosecond precision.

Integrate Tracy

// CMakeLists.txt
include(FetchContent)
FetchContent_Declare(
    tracy
    GIT_REPOSITORY https://github.com/wolfpld/tracy.git
    GIT_TAG v0.10
)
FetchContent_MakeAvailable(tracy)

target_link_libraries(MyPlugin PRIVATE TracyClient)
target_compile_definitions(MyPlugin PRIVATE TRACY_ENABLE)

Add Tracy Zones

#include <tracy/Tracy.hpp>

void processBlock(AudioBuffer<float>& buffer, MidiBuffer& midi) {
    ZoneScoped;  // Automatic profiling for this function

    {
        ZoneScopedN("Filter Processing");
        filter.process(buffer);
    }

    {
        ZoneScopedN("Distortion Processing");
        distortion.process(buffer);
    }
}

Run Tracy

# Launch Tracy profiler GUI
tracy

# Run plugin in DAW
# Tracy will automatically connect and show real-time profiling

Benefits:

Real-time visualization
Frame-by-frame analysis
Memory allocation tracking
Lock contention detection

Step 8: Generate Performance Report

@support-engineer - Document findings and recommendations.

Report Template

# Performance Analysis Report - MyPlugin v1.2.0

**Date:** 2024-05-15
**Analyst:** @dsp-engineer
**Platform:** macOS 14.5, Apple M1 Max

## Summary

CPU usage has been reduced from **8.2%** to **2.1%** (74% improvement) through targeted optimizations.

## Profiling Results

### Baseline (v1.1.0)
- Single instance CPU: 8.2% @ 44.1kHz, 512 samples
- 10 instances: 82% CPU (not sustainable)
- Hot path: `std::pow()` in saturation curve (45% of CPU time)

### Optimized (v1.2.0)
- Single instance CPU: 2.1%
- 10 instances: 21% CPU
- Hot path: Vectorized filter processing (18% of CPU time)

## Optimizations Applied

### 1. Replaced `std::pow()` with Lookup Table
**Impact:** 45% CPU reduction
**Location:** `Source/DSP/Saturation.cpp:42`

### 2. SIMD Vectorization of Filter
**Impact:** 15% CPU reduction
**Location:** `Source/DSP/SVFilter.cpp:87`

### 3. Removed Allocation in processBlock
**Impact:** Eliminated RT violations
**Location:** `Source/PluginProcessor.cpp:156`

## Benchmark Results

| Test | Baseline | Optimized | Improvement |
|------|----------|-----------|-------------|
| ProcessBlock (512 samples) | 1.85 ms | 0.48 ms | 74% faster |
| Single instance CPU | 8.2% | 2.1% | 74% reduction |
| 50 instances CPU | 410% | 105% | 74% reduction |

## Remaining Bottlenecks

1. **Reverb Algorithm** - Still using naive implementation (12% CPU)
   - Recommendation: Switch to FDN reverb or partitioned convolution
2. **UI Repaints** - Currently 120fps (unnecessary)
   - Recommendation: Rate-limit to 60fps

## Next Steps

- [ ] Optimize reverb algorithm (target: 5% CPU)
- [ ] Rate-limit UI repaints (target: 60fps)
- [ ] Profile on Windows (Intel CPU) to verify SIMD portability
- [ ] Run stress test with 100+ instances

## Flame Graphs

![Baseline Flame Graph](flamegraph-baseline.svg)
![Optimized Flame Graph](flamegraph-optimized.svg)

## Conclusion

Plugin now meets performance targets for release:
- ✅ Single instance < 5% CPU
- ✅ 20 instances < 50% CPU
- ✅ No RT violations detected
- ⚠️ Further optimization possible in reverb module

Definition of Done

Performance analysis is complete when:

✅ Profiling data collected on target platforms
✅ Hot paths identified and documented
✅ Optimization opportunities prioritized
✅ Key optimizations implemented and benchmarked
✅ Performance regression tests added
✅ Report generated with flame graphs and recommendations
✅ CPU usage meets targets (< 5% single instance)
✅ No allocations or locks in audio thread

Performance Targets

CPU Usage Goals

Scenario	Target	Acceptable	Poor
Single instance @ 44.1kHz, 512 samples	< 2%	< 5%	> 10%
10 instances	< 20%	< 40%	> 60%
50 instances	< 50%	< 80%	> 100%

Latency Goals

Plugin Type	Target Latency
Dynamics (compressor, gate)	0 samples
EQ, filter	0-64 samples
Modulation effects	0-128 samples
Reverb, delay	0-512 samples

Memory Usage

RAM: < 50 MB per instance
Allocations: 0 in processBlock()

Quick Profiling Checklist

For rapid performance validation:

# 1. Build optimized
cmake -B build -DCMAKE_BUILD_TYPE=RelWithDebInfo
cmake --build build

# 2. Profile (macOS)
instruments -t "Time Profiler" ./build/MyPlugin_Standalone

# 3. Check for RT violations
# (Look for malloc/new in processBlock stack traces)

# 4. Benchmark
./build/Tests/PerformanceBenchmark

# 5. Verify targets
# Single instance should be < 5% CPU

Time Required: 30 minutes

Expert Help

Delegate performance tasks:

@dsp-engineer - Optimize DSP algorithms, implement SIMD
@test-automation-engineer - Set up benchmarks, run profilers
@ui-engineer - Optimize UI rendering, fix repainting issues
@technical-lead - Review architectural performance issues
@plugin-engineer - Integrate optimizations into build system

TESTING_STRATEGY.md - Performance testing in CI/CD
juce-best-practices skill - Realtime safety guidelines
dsp-cookbook skill - Optimized DSP algorithms
/run-pluginval command - Validation includes performance tests

Tools Reference

macOS

Instruments (Xcode) - Time Profiler, Allocations, Leaks
Activity Monitor - Real-time CPU monitoring
sample - Command-line profiler: sample <PID> 10 -f output.txt

Windows

Visual Studio Profiler - CPU Usage, Memory Usage
Intel VTune - Advanced profiling, hardware counters
Windows Performance Analyzer (WPA) - System-wide profiling

Linux

perf - Linux performance profiler
Valgrind - Memory profiling, cache profiling
gprof - GNU profiler
Tracy - Real-time frame profiler

Cross-Platform

Tracy Profiler - Real-time, frame-perfect profiling
Google Benchmark - Microbenchmarking library
Superluminal - Commercial profiler (excellent for audio plugins)

Remember: "Premature optimization is the root of all evil" - but audio plugins are performance-critical. Profile first, optimize hot paths, and always measure the impact!

22 KiB Raw Blame History

/analyze-performance - Performance Profiling and Optimization

Overview

Syntax

Arguments

Examples

Instructions

Step 1: Pre-Profiling Setup

Step 2: DSP Performance Profiling

Profile Audio Processing

macOS - Using Instruments

Windows - Using Visual Studio Profiler

Linux - Using perf

Common Performance Issues and Fixes

Issue 1: Expensive Transcendental Functions

Issue 2: Inefficient Memory Access Patterns

Issue 3: Unnecessary Branching

Issue 4: Virtual Function Calls

Step 3: SIMD Optimization

Identify Vectorization Opportunities

Step 4: Performance Benchmarking

Create Benchmark Tests

Run Benchmarks

Compare Before/After Optimization

Step 5: UI Performance Profiling

Profile UI Rendering

Step 6: Memory Profiling

macOS - Instruments Leaks

Linux - Valgrind

Windows - Visual Studio Memory Profiler

Check for Allocations in Audio Thread

Step 7: Advanced Profiling - Tracy

Integrate Tracy

Add Tracy Zones

Run Tracy

Step 8: Generate Performance Report

Report Template

Definition of Done

Performance Targets

CPU Usage Goals

Latency Goals

Memory Usage

Quick Profiling Checklist

Expert Help

Related Documentation

Tools Reference

macOS

Windows

Linux

Cross-Platform

22 KiB

Raw Blame History