22 KiB
argument-hint, description, allowed-tools, model
| argument-hint | description | allowed-tools | model |
|---|---|---|---|
| [target] [--profiler=<tool>] [--benchmark] [--report] | Profile plugin performance with Instruments, perf, VTune, Tracy to identify bottlenecks and optimize DSP algorithms | Bash, Read, Write, AskUserQuestion | sonnet |
/analyze-performance - Performance Profiling and Optimization
Profile your JUCE plugin's performance to identify bottlenecks, optimize DSP algorithms, and ensure efficient CPU usage across different scenarios.
Overview
This command guides you through comprehensive performance analysis using profiling tools, benchmarking, and optimization strategies. It helps identify hot paths in DSP code, memory bottlenecks, and inefficient algorithms.
Syntax
/analyze-performance [target] [--profiler=<tool>] [--benchmark] [--report]
Arguments
target(optional): What to profile -dsp,ui,full, orspecific(default:dsp)--profiler=<tool>: Profiler to use -instruments,perf,vtune,tracy, orauto(default:auto)--benchmark: Run performance benchmarks and compare against baseline--report: Generate detailed performance report with recommendations
Examples
# Profile DSP code with platform default profiler
/analyze-performance dsp
# Profile entire plugin with Instruments (macOS)
/analyze-performance full --profiler=instruments
# Run benchmarks and generate report
/analyze-performance dsp --benchmark --report
# Profile UI rendering performance
/analyze-performance ui --profiler=instruments
Instructions
Step 1: Pre-Profiling Setup
@build-engineer - Prepare optimized build with profiling symbols.
-
Build with Release optimization + debug symbols:
macOS (Xcode):
# CMakeLists.txt if(APPLE) set(CMAKE_BUILD_TYPE RelWithDebInfo) # Disable stripping for profiling set(CMAKE_XCODE_ATTRIBUTE_STRIP_INSTALLED_PRODUCT NO) set(CMAKE_XCODE_ATTRIBUTE_DEPLOYMENT_POSTPROCESSING NO) endif()Build:
cmake -B build -DCMAKE_BUILD_TYPE=RelWithDebInfo cmake --build build --config RelWithDebInfoWindows (Visual Studio):
cmake -B build cmake --build build --config RelWithDebInfoLinux:
cmake -B build -DCMAKE_BUILD_TYPE=RelWithDebInfo -DCMAKE_CXX_FLAGS="-g -O2" cmake --build build -
Verify optimizations are enabled:
# macOS: Check optimization flags xcrun otool -v -s __TEXT __text build/MyPlugin.vst3/Contents/MacOS/MyPlugin | grep -A 5 "optimization" # Linux: Check binary for debug symbols and optimization objdump -d build/MyPlugin.vst3 | head -50 -
Install profiling tools:
macOS:
# Instruments comes with Xcode xcode-select --install # Optional: Tracy profiler brew install tracyWindows:
# Intel VTune (recommended) # Download from: https://www.intel.com/content/www/us/en/developer/tools/oneapi/vtune-profiler.html # Or Visual Studio Profiler (included with VS)Linux:
# perf (Linux perf tool) sudo apt install linux-tools-common linux-tools-generic # Tracy profiler sudo apt install tracy-profiler # OR build from source git clone https://github.com/wolfpld/tracy cd tracy/profiler make
Step 2: DSP Performance Profiling
@dsp-engineer + @test-automation-engineer - Identify DSP bottlenecks.
Profile Audio Processing
macOS - Using Instruments
-
Launch Instruments with plugin:
# Open standalone plugin in Instruments open -a Instruments # Or profile in DAW: # 1. Launch Instruments # 2. Choose "Time Profiler" template # 3. Click Record # 4. In dropdown, select DAW process (e.g., "Logic Pro") # 5. Load plugin in DAW and play audio -
Time Profiler Configuration:
- Template: Time Profiler
- Sample Frequency: 1ms (high resolution)
- Record Waiting Threads: OFF (focus on CPU time)
- High Frequency: ON
-
Record profiling session:
- Click Record in Instruments
- Load plugin in standalone app or DAW
- Play test audio for 30-60 seconds
- Include various parameter automations
- Stop recording
-
Analyze results:
- Switch to "Call Tree" view
- Enable filters:
- ✅ Separate by Thread
- ✅ Invert Call Tree
- ✅ Hide System Libraries
- Look for hot functions in your code
- Focus on:
processBlock(), DSP algorithm functions
-
Identify bottlenecks:
- Functions taking > 5% of CPU time are candidates for optimization
- Look for:
- Unexpected memory allocations (
malloc,new) - Expensive math operations (use vectorization)
- Inefficient loops
- Cache misses (scattered memory access)
- Unexpected memory allocations (
Example Instruments Output:
Symbol Name % Time
MyPlugin::processBlock() 45.2%
MyFilter::processSample() 28.3%
std::pow() 15.1% ⚠️ Expensive!
MyFilter::updateCoefficients() 13.2%
MyDistortion::process() 16.9%
Red Flags:
std::pow(),std::sin(),std::cos()in inner loops → Use lookup tables or approximations- Memory allocations → Pre-allocate in
prepareToPlay() - Virtual function calls in hot path → Consider static polymorphism
Windows - Using Visual Studio Profiler
-
Start profiling:
- Open Visual Studio
- Debug → Performance Profiler
- Select: CPU Usage
- Start profiling
- Launch DAW and load plugin
- Play audio for 60 seconds
- Stop profiling
-
Analyze:
- View "Hot Path"
- Check "Functions" view sorted by "Total CPU %"
- Drill into
processBlock()
-
Generate report:
- File → Export Report
- Save as
performance-analysis-[date].diagsession
Linux - Using perf
-
Record performance data:
# Profile specific process (find PID of DAW or standalone) perf record -F 999 -g -p <PID> # Or profile command: perf record -F 999 -g ./build/MyPlugin_Standalone # Play audio for 60 seconds, then Ctrl+C to stop -
View results:
# Interactive TUI perf report # Generate flame graph (requires flamegraph tools) git clone https://github.com/brendangregg/FlameGraph perf script | FlameGraph/stackcollapse-perf.pl | FlameGraph/flamegraph.pl > flamegraph.svg open flamegraph.svg # or xdg-open on Linux -
Interpret flame graph:
- Width = CPU time
- Look for wide sections in your code
- Drill down into
processBlock()stack frames
Common Performance Issues and Fixes
Issue 1: Expensive Transcendental Functions
Symptom: std::sin(), std::cos(), std::pow() show up in profiler.
Solution: Use lookup tables or polynomial approximations.
Example:
// ❌ Slow: Direct call in processBlock
float sine = std::sin(phase);
// ✅ Fast: Lookup table
class SineLUT {
static constexpr int tableSize = 2048;
std::array<float, tableSize> table;
public:
SineLUT() {
for (int i = 0; i < tableSize; ++i)
table[i] = std::sin(2.0 * M_PI * i / tableSize);
}
float lookup(float phase) const {
float index = phase * tableSize;
int i0 = static_cast<int>(index) % tableSize;
int i1 = (i0 + 1) % tableSize;
float frac = index - std::floor(index);
return table[i0] + frac * (table[i1] - table[i0]);
}
};
// Use in processBlock:
float sine = sineLUT.lookup(phase);
Speedup: 5-10x faster
Issue 2: Inefficient Memory Access Patterns
Symptom: High cache miss rate, poor vectorization.
Solution: Structure-of-arrays instead of array-of-structures.
Example:
// ❌ Poor cache locality (AoS)
struct Voice {
float frequency, amplitude, phase;
};
std::vector<Voice> voices;
for (auto& voice : voices) {
voice.phase += voice.frequency;
output += voice.amplitude * std::sin(voice.phase);
}
// ✅ Better cache locality (SoA)
struct VoiceBank {
std::vector<float> frequencies;
std::vector<float> amplitudes;
std::vector<float> phases;
};
for (int i = 0; i < voices.phases.size(); ++i) {
voices.phases[i] += voices.frequencies[i];
output += voices.amplitudes[i] * std::sin(voices.phases[i]);
}
Benefit: Better SIMD vectorization, fewer cache misses.
Issue 3: Unnecessary Branching
Symptom: Unpredictable branches in inner loops.
Solution: Branchless code or precompute decisions.
Example:
// ❌ Branch in inner loop
for (int i = 0; i < numSamples; ++i) {
if (bypassEnabled)
output[i] = input[i];
else
output[i] = process(input[i]);
}
// ✅ Branchless or separate loops
if (bypassEnabled) {
std::copy(input, input + numSamples, output);
} else {
for (int i = 0; i < numSamples; ++i)
output[i] = process(input[i]);
}
Issue 4: Virtual Function Calls
Symptom: Virtual dispatch overhead in hot path.
Solution: Static polymorphism (templates) or function pointers.
Example:
// ❌ Virtual function call per sample
class Filter {
public:
virtual float process(float input) = 0;
};
// ✅ Template-based static polymorphism
template<typename FilterType>
class Processor {
FilterType filter;
public:
void processBlock(float* buffer, int numSamples) {
for (int i = 0; i < numSamples; ++i)
buffer[i] = filter.process(buffer[i]); // Inlined!
}
};
Step 3: SIMD Optimization
@dsp-engineer - Leverage SIMD instructions for maximum performance.
Identify Vectorization Opportunities
-
Check if compiler vectorized loops:
macOS/Linux:
# GCC/Clang vectorization report cmake -B build -DCMAKE_CXX_FLAGS="-O3 -fopt-info-vec" cmake --build build 2>&1 | grep vectorizedWindows:
# MSVC vectorization report cmake -B build cmake --build build -- /p:CL="/Qvec-report:2" -
Manual SIMD with JUCE:
JUCE provides cross-platform SIMD abstractions:
#include <juce_dsp/juce_dsp.h> // Example: Process 4 samples at once with SIMD void processBlock(juce::AudioBuffer<float>& buffer) { auto* channelData = buffer.getWritePointer(0); int numSamples = buffer.getNumSamples(); // Process in chunks of 4 (SSE/NEON) using SIMDFloat = juce::dsp::SIMDRegister<float>; constexpr int simdSize = SIMDFloat::size(); int i = 0; for (; i < numSamples - simdSize; i += simdSize) { auto simdInput = SIMDFloat::fromRawArray(channelData + i); auto simdOutput = simdInput * SIMDFloat(gain); // SIMD multiply simdOutput.copyToRawArray(channelData + i); } // Handle remaining samples for (; i < numSamples; ++i) { channelData[i] *= gain; } } -
Benchmark SIMD vs scalar:
# Build with SIMD enabled cmake -B build-simd -DCMAKE_CXX_FLAGS="-O3 -march=native" cmake --build build-simd # Compare performance (see Step 4 below)
Expected Speedup: 2-4x for SIMD-friendly code.
Step 4: Performance Benchmarking
@test-automation-engineer - Quantify performance improvements.
Create Benchmark Tests
// Tests/PerformanceBenchmark.cpp
#include <benchmark/benchmark.h> // Google Benchmark
#include "../Source/PluginProcessor.h"
static void BM_ProcessBlock(benchmark::State& state) {
MyPluginProcessor processor;
processor.setPlayConfigDetails(2, 2, 44100.0, 512);
processor.prepareToPlay(44100.0, 512);
juce::AudioBuffer<float> buffer(2, 512);
juce::MidiBuffer midi;
// Fill with test signal
for (int ch = 0; ch < 2; ++ch)
for (int i = 0; i < 512; ++i)
buffer.setSample(ch, i, std::sin(2 * M_PI * 440 * i / 44100.0));
for (auto _ : state) {
processor.processBlock(buffer, midi);
benchmark::DoNotOptimize(buffer.getReadPointer(0));
}
// Report CPU usage metric
state.SetItemsProcessed(state.iterations() * 512);
}
BENCHMARK(BM_ProcessBlock)->Iterations(10000);
BENCHMARK_MAIN();
Run Benchmarks
# Install Google Benchmark
git clone https://github.com/google/benchmark.git
cd benchmark
cmake -E make_directory "build"
cmake -E chdir "build" cmake -DCMAKE_BUILD_TYPE=Release ..
cmake --build "build" --config Release
sudo cmake --build "build" --config Release --target install
# Build and run your benchmarks
cmake -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build --target PerformanceBenchmark
./build/Tests/PerformanceBenchmark --benchmark_out=results.json --benchmark_out_format=json
Example Output:
-----------------------------------------------------------------
Benchmark Time CPU Iterations
-----------------------------------------------------------------
BM_ProcessBlock 1.23 ms 1.22 ms 571
Interpretation:
- 1.22 ms per block @ 512 samples = ~24% CPU at 44.1kHz (1.22ms / (512/44100 * 1000))
- Goal: < 5% CPU (< 0.29 ms/block)
Compare Before/After Optimization
# Save baseline
./build/Tests/PerformanceBenchmark > baseline.txt
# Make optimizations
# ...
# Compare
./build/Tests/PerformanceBenchmark > optimized.txt
diff baseline.txt optimized.txt
Step 5: UI Performance Profiling
@ui-engineer - Ensure UI doesn't impact audio performance.
Profile UI Rendering
-
Check UI frame rate:
// Add to Editor class MyEditor : public juce::AudioProcessorEditor, private juce::Timer { void timerCallback() override { auto now = juce::Time::getMillisecondCounterHiRes(); double fps = 1000.0 / (now - lastFrameTime); DBG("FPS: " << fps); // Should be 60fps lastFrameTime = now; repaint(); } double lastFrameTime = 0; }; -
Profile with Instruments (macOS):
- Use "Core Animation" template
- Check for:
- Dropped frames (should be 0)
- Expensive drawing operations
- Off-screen rendering
-
Optimize UI:
- Use
repaint()only when needed (not on every audio callback!) - Coalesce repaints:
// ❌ Repaint on every parameter change (60 times/sec from audio thread!) parameterChanged(parameter, newValue) { repaint(); // BAD } // ✅ Rate-limit repaints parameterChanged(parameter, newValue) { startTimer(16); // 60fps max } timerCallback() { stopTimer(); repaint(); } - Cache rendered graphics:
juce::Image cachedBackground; void paint(Graphics& g) { if (cachedBackground.isNull()) { cachedBackground = juce::Image(juce::Image::ARGB, getWidth(), getHeight(), true); Graphics cg(cachedBackground); drawComplexBackground(cg); } g.drawImageAt(cachedBackground, 0, 0); }
- Use
Step 6: Memory Profiling
@test-automation-engineer - Detect memory leaks and excessive allocations.
macOS - Instruments Leaks
- Launch Instruments → Leaks template
- Record session (load/unload plugin multiple times)
- Check for leaks:
- Red flags = memory leaks
- Click to see stack trace of allocation
Linux - Valgrind
# Profile standalone plugin
valgrind --leak-check=full --track-origins=yes ./build/MyPlugin_Standalone
# Play audio, then quit
# Check report for leaks
Windows - Visual Studio Memory Profiler
- Debug → Performance Profiler → Memory Usage
- Take snapshots before and after loading plugin
- Compare snapshots for memory growth
Check for Allocations in Audio Thread
Use -fsanitize=address (Clang/GCC):
cmake -B build -DCMAKE_CXX_FLAGS="-fsanitize=address -g"
cmake --build build
./build/Tests/MyPluginTests
Look for: Allocations called from processBlock() - these are FORBIDDEN.
Step 7: Advanced Profiling - Tracy
@test-automation-engineer - Use Tracy for frame-perfect profiling.
Tracy is a real-time profiler with nanosecond precision.
Integrate Tracy
// CMakeLists.txt
include(FetchContent)
FetchContent_Declare(
tracy
GIT_REPOSITORY https://github.com/wolfpld/tracy.git
GIT_TAG v0.10
)
FetchContent_MakeAvailable(tracy)
target_link_libraries(MyPlugin PRIVATE TracyClient)
target_compile_definitions(MyPlugin PRIVATE TRACY_ENABLE)
Add Tracy Zones
#include <tracy/Tracy.hpp>
void processBlock(AudioBuffer<float>& buffer, MidiBuffer& midi) {
ZoneScoped; // Automatic profiling for this function
{
ZoneScopedN("Filter Processing");
filter.process(buffer);
}
{
ZoneScopedN("Distortion Processing");
distortion.process(buffer);
}
}
Run Tracy
# Launch Tracy profiler GUI
tracy
# Run plugin in DAW
# Tracy will automatically connect and show real-time profiling
Benefits:
- Real-time visualization
- Frame-by-frame analysis
- Memory allocation tracking
- Lock contention detection
Step 8: Generate Performance Report
@support-engineer - Document findings and recommendations.
Report Template
# Performance Analysis Report - MyPlugin v1.2.0
**Date:** 2024-05-15
**Analyst:** @dsp-engineer
**Platform:** macOS 14.5, Apple M1 Max
## Summary
CPU usage has been reduced from **8.2%** to **2.1%** (74% improvement) through targeted optimizations.
## Profiling Results
### Baseline (v1.1.0)
- Single instance CPU: 8.2% @ 44.1kHz, 512 samples
- 10 instances: 82% CPU (not sustainable)
- Hot path: `std::pow()` in saturation curve (45% of CPU time)
### Optimized (v1.2.0)
- Single instance CPU: 2.1%
- 10 instances: 21% CPU
- Hot path: Vectorized filter processing (18% of CPU time)
## Optimizations Applied
### 1. Replaced `std::pow()` with Lookup Table
**Impact:** 45% CPU reduction
**Location:** `Source/DSP/Saturation.cpp:42`
### 2. SIMD Vectorization of Filter
**Impact:** 15% CPU reduction
**Location:** `Source/DSP/SVFilter.cpp:87`
### 3. Removed Allocation in processBlock
**Impact:** Eliminated RT violations
**Location:** `Source/PluginProcessor.cpp:156`
## Benchmark Results
| Test | Baseline | Optimized | Improvement |
|------|----------|-----------|-------------|
| ProcessBlock (512 samples) | 1.85 ms | 0.48 ms | 74% faster |
| Single instance CPU | 8.2% | 2.1% | 74% reduction |
| 50 instances CPU | 410% | 105% | 74% reduction |
## Remaining Bottlenecks
1. **Reverb Algorithm** - Still using naive implementation (12% CPU)
- Recommendation: Switch to FDN reverb or partitioned convolution
2. **UI Repaints** - Currently 120fps (unnecessary)
- Recommendation: Rate-limit to 60fps
## Next Steps
- [ ] Optimize reverb algorithm (target: 5% CPU)
- [ ] Rate-limit UI repaints (target: 60fps)
- [ ] Profile on Windows (Intel CPU) to verify SIMD portability
- [ ] Run stress test with 100+ instances
## Flame Graphs


## Conclusion
Plugin now meets performance targets for release:
- ✅ Single instance < 5% CPU
- ✅ 20 instances < 50% CPU
- ✅ No RT violations detected
- ⚠️ Further optimization possible in reverb module
Definition of Done
Performance analysis is complete when:
- ✅ Profiling data collected on target platforms
- ✅ Hot paths identified and documented
- ✅ Optimization opportunities prioritized
- ✅ Key optimizations implemented and benchmarked
- ✅ Performance regression tests added
- ✅ Report generated with flame graphs and recommendations
- ✅ CPU usage meets targets (< 5% single instance)
- ✅ No allocations or locks in audio thread
Performance Targets
CPU Usage Goals
| Scenario | Target | Acceptable | Poor |
|---|---|---|---|
| Single instance @ 44.1kHz, 512 samples | < 2% | < 5% | > 10% |
| 10 instances | < 20% | < 40% | > 60% |
| 50 instances | < 50% | < 80% | > 100% |
Latency Goals
| Plugin Type | Target Latency |
|---|---|
| Dynamics (compressor, gate) | 0 samples |
| EQ, filter | 0-64 samples |
| Modulation effects | 0-128 samples |
| Reverb, delay | 0-512 samples |
Memory Usage
- RAM: < 50 MB per instance
- Allocations: 0 in
processBlock()
Quick Profiling Checklist
For rapid performance validation:
# 1. Build optimized
cmake -B build -DCMAKE_BUILD_TYPE=RelWithDebInfo
cmake --build build
# 2. Profile (macOS)
instruments -t "Time Profiler" ./build/MyPlugin_Standalone
# 3. Check for RT violations
# (Look for malloc/new in processBlock stack traces)
# 4. Benchmark
./build/Tests/PerformanceBenchmark
# 5. Verify targets
# Single instance should be < 5% CPU
Time Required: 30 minutes
Expert Help
Delegate performance tasks:
- @dsp-engineer - Optimize DSP algorithms, implement SIMD
- @test-automation-engineer - Set up benchmarks, run profilers
- @ui-engineer - Optimize UI rendering, fix repainting issues
- @technical-lead - Review architectural performance issues
- @plugin-engineer - Integrate optimizations into build system
Related Documentation
- TESTING_STRATEGY.md - Performance testing in CI/CD
- juce-best-practices skill - Realtime safety guidelines
- dsp-cookbook skill - Optimized DSP algorithms
/run-pluginvalcommand - Validation includes performance tests
Tools Reference
macOS
- Instruments (Xcode) - Time Profiler, Allocations, Leaks
- Activity Monitor - Real-time CPU monitoring
- sample - Command-line profiler:
sample <PID> 10 -f output.txt
Windows
- Visual Studio Profiler - CPU Usage, Memory Usage
- Intel VTune - Advanced profiling, hardware counters
- Windows Performance Analyzer (WPA) - System-wide profiling
Linux
- perf - Linux performance profiler
- Valgrind - Memory profiling, cache profiling
- gprof - GNU profiler
- Tracy - Real-time frame profiler
Cross-Platform
- Tracy Profiler - Real-time, frame-perfect profiling
- Google Benchmark - Microbenchmarking library
- Superluminal - Commercial profiler (excellent for audio plugins)
Remember: "Premature optimization is the root of all evil" - but audio plugins are performance-critical. Profile first, optimize hot paths, and always measure the impact!