--- argument-hint: "[target] [--profiler=] [--benchmark] [--report]" description: "Profile plugin performance with Instruments, perf, VTune, Tracy to identify bottlenecks and optimize DSP algorithms" allowed-tools: Bash, Read, Write, AskUserQuestion model: sonnet --- # /analyze-performance - Performance Profiling and Optimization Profile your JUCE plugin's performance to identify bottlenecks, optimize DSP algorithms, and ensure efficient CPU usage across different scenarios. ## Overview This command guides you through comprehensive performance analysis using profiling tools, benchmarking, and optimization strategies. It helps identify hot paths in DSP code, memory bottlenecks, and inefficient algorithms. ## Syntax ```bash /analyze-performance [target] [--profiler=] [--benchmark] [--report] ``` ### Arguments - `target` (optional): What to profile - `dsp`, `ui`, `full`, or `specific` (default: `dsp`) - `--profiler=`: Profiler to use - `instruments`, `perf`, `vtune`, `tracy`, or `auto` (default: `auto`) - `--benchmark`: Run performance benchmarks and compare against baseline - `--report`: Generate detailed performance report with recommendations ### Examples ```bash # Profile DSP code with platform default profiler /analyze-performance dsp # Profile entire plugin with Instruments (macOS) /analyze-performance full --profiler=instruments # Run benchmarks and generate report /analyze-performance dsp --benchmark --report # Profile UI rendering performance /analyze-performance ui --profiler=instruments ``` ## Instructions ### Step 1: Pre-Profiling Setup **@build-engineer** - Prepare optimized build with profiling symbols. 1. **Build with Release optimization + debug symbols:** macOS (Xcode): ```cmake # CMakeLists.txt if(APPLE) set(CMAKE_BUILD_TYPE RelWithDebInfo) # Disable stripping for profiling set(CMAKE_XCODE_ATTRIBUTE_STRIP_INSTALLED_PRODUCT NO) set(CMAKE_XCODE_ATTRIBUTE_DEPLOYMENT_POSTPROCESSING NO) endif() ``` Build: ```bash cmake -B build -DCMAKE_BUILD_TYPE=RelWithDebInfo cmake --build build --config RelWithDebInfo ``` Windows (Visual Studio): ```bash cmake -B build cmake --build build --config RelWithDebInfo ``` Linux: ```bash cmake -B build -DCMAKE_BUILD_TYPE=RelWithDebInfo -DCMAKE_CXX_FLAGS="-g -O2" cmake --build build ``` 2. **Verify optimizations are enabled:** ```bash # macOS: Check optimization flags xcrun otool -v -s __TEXT __text build/MyPlugin.vst3/Contents/MacOS/MyPlugin | grep -A 5 "optimization" # Linux: Check binary for debug symbols and optimization objdump -d build/MyPlugin.vst3 | head -50 ``` 3. **Install profiling tools:** **macOS:** ```bash # Instruments comes with Xcode xcode-select --install # Optional: Tracy profiler brew install tracy ``` **Windows:** ```powershell # Intel VTune (recommended) # Download from: https://www.intel.com/content/www/us/en/developer/tools/oneapi/vtune-profiler.html # Or Visual Studio Profiler (included with VS) ``` **Linux:** ```bash # perf (Linux perf tool) sudo apt install linux-tools-common linux-tools-generic # Tracy profiler sudo apt install tracy-profiler # OR build from source git clone https://github.com/wolfpld/tracy cd tracy/profiler make ``` --- ## Step 2: DSP Performance Profiling **@dsp-engineer** + **@test-automation-engineer** - Identify DSP bottlenecks. ### Profile Audio Processing #### macOS - Using Instruments 1. **Launch Instruments with plugin:** ```bash # Open standalone plugin in Instruments open -a Instruments # Or profile in DAW: # 1. Launch Instruments # 2. Choose "Time Profiler" template # 3. Click Record # 4. In dropdown, select DAW process (e.g., "Logic Pro") # 5. Load plugin in DAW and play audio ``` 2. **Time Profiler Configuration:** - Template: Time Profiler - Sample Frequency: 1ms (high resolution) - Record Waiting Threads: OFF (focus on CPU time) - High Frequency: ON 3. **Record profiling session:** - Click Record in Instruments - Load plugin in standalone app or DAW - Play test audio for 30-60 seconds - Include various parameter automations - Stop recording 4. **Analyze results:** - Switch to "Call Tree" view - Enable filters: - ✅ Separate by Thread - ✅ Invert Call Tree - ✅ Hide System Libraries - Look for hot functions in your code - **Focus on:** `processBlock()`, DSP algorithm functions 5. **Identify bottlenecks:** - Functions taking > 5% of CPU time are candidates for optimization - Look for: - Unexpected memory allocations (`malloc`, `new`) - Expensive math operations (use vectorization) - Inefficient loops - Cache misses (scattered memory access) **Example Instruments Output:** ``` Symbol Name % Time MyPlugin::processBlock() 45.2% MyFilter::processSample() 28.3% std::pow() 15.1% ⚠️ Expensive! MyFilter::updateCoefficients() 13.2% MyDistortion::process() 16.9% ``` **Red Flags:** - `std::pow()`, `std::sin()`, `std::cos()` in inner loops → Use lookup tables or approximations - Memory allocations → Pre-allocate in `prepareToPlay()` - Virtual function calls in hot path → Consider static polymorphism --- #### Windows - Using Visual Studio Profiler 1. **Start profiling:** - Open Visual Studio - Debug → Performance Profiler - Select: CPU Usage - Start profiling - Launch DAW and load plugin - Play audio for 60 seconds - Stop profiling 2. **Analyze:** - View "Hot Path" - Check "Functions" view sorted by "Total CPU %" - Drill into `processBlock()` 3. **Generate report:** - File → Export Report - Save as `performance-analysis-[date].diagsession` --- #### Linux - Using perf 1. **Record performance data:** ```bash # Profile specific process (find PID of DAW or standalone) perf record -F 999 -g -p # Or profile command: perf record -F 999 -g ./build/MyPlugin_Standalone # Play audio for 60 seconds, then Ctrl+C to stop ``` 2. **View results:** ```bash # Interactive TUI perf report # Generate flame graph (requires flamegraph tools) git clone https://github.com/brendangregg/FlameGraph perf script | FlameGraph/stackcollapse-perf.pl | FlameGraph/flamegraph.pl > flamegraph.svg open flamegraph.svg # or xdg-open on Linux ``` 3. **Interpret flame graph:** - Width = CPU time - Look for wide sections in your code - Drill down into `processBlock()` stack frames --- ### Common Performance Issues and Fixes #### Issue 1: Expensive Transcendental Functions **Symptom:** `std::sin()`, `std::cos()`, `std::pow()` show up in profiler. **Solution:** Use lookup tables or polynomial approximations. **Example:** ```cpp // ❌ Slow: Direct call in processBlock float sine = std::sin(phase); // ✅ Fast: Lookup table class SineLUT { static constexpr int tableSize = 2048; std::array table; public: SineLUT() { for (int i = 0; i < tableSize; ++i) table[i] = std::sin(2.0 * M_PI * i / tableSize); } float lookup(float phase) const { float index = phase * tableSize; int i0 = static_cast(index) % tableSize; int i1 = (i0 + 1) % tableSize; float frac = index - std::floor(index); return table[i0] + frac * (table[i1] - table[i0]); } }; // Use in processBlock: float sine = sineLUT.lookup(phase); ``` **Speedup:** 5-10x faster --- #### Issue 2: Inefficient Memory Access Patterns **Symptom:** High cache miss rate, poor vectorization. **Solution:** Structure-of-arrays instead of array-of-structures. **Example:** ```cpp // ❌ Poor cache locality (AoS) struct Voice { float frequency, amplitude, phase; }; std::vector voices; for (auto& voice : voices) { voice.phase += voice.frequency; output += voice.amplitude * std::sin(voice.phase); } // ✅ Better cache locality (SoA) struct VoiceBank { std::vector frequencies; std::vector amplitudes; std::vector phases; }; for (int i = 0; i < voices.phases.size(); ++i) { voices.phases[i] += voices.frequencies[i]; output += voices.amplitudes[i] * std::sin(voices.phases[i]); } ``` **Benefit:** Better SIMD vectorization, fewer cache misses. --- #### Issue 3: Unnecessary Branching **Symptom:** Unpredictable branches in inner loops. **Solution:** Branchless code or precompute decisions. **Example:** ```cpp // ❌ Branch in inner loop for (int i = 0; i < numSamples; ++i) { if (bypassEnabled) output[i] = input[i]; else output[i] = process(input[i]); } // ✅ Branchless or separate loops if (bypassEnabled) { std::copy(input, input + numSamples, output); } else { for (int i = 0; i < numSamples; ++i) output[i] = process(input[i]); } ``` --- #### Issue 4: Virtual Function Calls **Symptom:** Virtual dispatch overhead in hot path. **Solution:** Static polymorphism (templates) or function pointers. **Example:** ```cpp // ❌ Virtual function call per sample class Filter { public: virtual float process(float input) = 0; }; // ✅ Template-based static polymorphism template class Processor { FilterType filter; public: void processBlock(float* buffer, int numSamples) { for (int i = 0; i < numSamples; ++i) buffer[i] = filter.process(buffer[i]); // Inlined! } }; ``` --- ## Step 3: SIMD Optimization **@dsp-engineer** - Leverage SIMD instructions for maximum performance. ### Identify Vectorization Opportunities 1. **Check if compiler vectorized loops:** **macOS/Linux:** ```bash # GCC/Clang vectorization report cmake -B build -DCMAKE_CXX_FLAGS="-O3 -fopt-info-vec" cmake --build build 2>&1 | grep vectorized ``` **Windows:** ```powershell # MSVC vectorization report cmake -B build cmake --build build -- /p:CL="/Qvec-report:2" ``` 2. **Manual SIMD with JUCE:** JUCE provides cross-platform SIMD abstractions: ```cpp #include // Example: Process 4 samples at once with SIMD void processBlock(juce::AudioBuffer& buffer) { auto* channelData = buffer.getWritePointer(0); int numSamples = buffer.getNumSamples(); // Process in chunks of 4 (SSE/NEON) using SIMDFloat = juce::dsp::SIMDRegister; constexpr int simdSize = SIMDFloat::size(); int i = 0; for (; i < numSamples - simdSize; i += simdSize) { auto simdInput = SIMDFloat::fromRawArray(channelData + i); auto simdOutput = simdInput * SIMDFloat(gain); // SIMD multiply simdOutput.copyToRawArray(channelData + i); } // Handle remaining samples for (; i < numSamples; ++i) { channelData[i] *= gain; } } ``` 3. **Benchmark SIMD vs scalar:** ```bash # Build with SIMD enabled cmake -B build-simd -DCMAKE_CXX_FLAGS="-O3 -march=native" cmake --build build-simd # Compare performance (see Step 4 below) ``` **Expected Speedup:** 2-4x for SIMD-friendly code. --- ## Step 4: Performance Benchmarking **@test-automation-engineer** - Quantify performance improvements. ### Create Benchmark Tests ```cpp // Tests/PerformanceBenchmark.cpp #include // Google Benchmark #include "../Source/PluginProcessor.h" static void BM_ProcessBlock(benchmark::State& state) { MyPluginProcessor processor; processor.setPlayConfigDetails(2, 2, 44100.0, 512); processor.prepareToPlay(44100.0, 512); juce::AudioBuffer buffer(2, 512); juce::MidiBuffer midi; // Fill with test signal for (int ch = 0; ch < 2; ++ch) for (int i = 0; i < 512; ++i) buffer.setSample(ch, i, std::sin(2 * M_PI * 440 * i / 44100.0)); for (auto _ : state) { processor.processBlock(buffer, midi); benchmark::DoNotOptimize(buffer.getReadPointer(0)); } // Report CPU usage metric state.SetItemsProcessed(state.iterations() * 512); } BENCHMARK(BM_ProcessBlock)->Iterations(10000); BENCHMARK_MAIN(); ``` ### Run Benchmarks ```bash # Install Google Benchmark git clone https://github.com/google/benchmark.git cd benchmark cmake -E make_directory "build" cmake -E chdir "build" cmake -DCMAKE_BUILD_TYPE=Release .. cmake --build "build" --config Release sudo cmake --build "build" --config Release --target install # Build and run your benchmarks cmake -B build -DCMAKE_BUILD_TYPE=Release cmake --build build --target PerformanceBenchmark ./build/Tests/PerformanceBenchmark --benchmark_out=results.json --benchmark_out_format=json ``` **Example Output:** ``` ----------------------------------------------------------------- Benchmark Time CPU Iterations ----------------------------------------------------------------- BM_ProcessBlock 1.23 ms 1.22 ms 571 ``` **Interpretation:** - 1.22 ms per block @ 512 samples = ~24% CPU at 44.1kHz (1.22ms / (512/44100 * 1000)) - Goal: < 5% CPU (< 0.29 ms/block) ### Compare Before/After Optimization ```bash # Save baseline ./build/Tests/PerformanceBenchmark > baseline.txt # Make optimizations # ... # Compare ./build/Tests/PerformanceBenchmark > optimized.txt diff baseline.txt optimized.txt ``` --- ## Step 5: UI Performance Profiling **@ui-engineer** - Ensure UI doesn't impact audio performance. ### Profile UI Rendering 1. **Check UI frame rate:** ```cpp // Add to Editor class MyEditor : public juce::AudioProcessorEditor, private juce::Timer { void timerCallback() override { auto now = juce::Time::getMillisecondCounterHiRes(); double fps = 1000.0 / (now - lastFrameTime); DBG("FPS: " << fps); // Should be 60fps lastFrameTime = now; repaint(); } double lastFrameTime = 0; }; ``` 2. **Profile with Instruments (macOS):** - Use "Core Animation" template - Check for: - Dropped frames (should be 0) - Expensive drawing operations - Off-screen rendering 3. **Optimize UI:** - **Use `repaint()` only when needed** (not on every audio callback!) - **Coalesce repaints:** ```cpp // ❌ Repaint on every parameter change (60 times/sec from audio thread!) parameterChanged(parameter, newValue) { repaint(); // BAD } // ✅ Rate-limit repaints parameterChanged(parameter, newValue) { startTimer(16); // 60fps max } timerCallback() { stopTimer(); repaint(); } ``` - **Cache rendered graphics:** ```cpp juce::Image cachedBackground; void paint(Graphics& g) { if (cachedBackground.isNull()) { cachedBackground = juce::Image(juce::Image::ARGB, getWidth(), getHeight(), true); Graphics cg(cachedBackground); drawComplexBackground(cg); } g.drawImageAt(cachedBackground, 0, 0); } ``` --- ## Step 6: Memory Profiling **@test-automation-engineer** - Detect memory leaks and excessive allocations. ### macOS - Instruments Leaks 1. **Launch Instruments → Leaks template** 2. **Record session** (load/unload plugin multiple times) 3. **Check for leaks:** - Red flags = memory leaks - Click to see stack trace of allocation ### Linux - Valgrind ```bash # Profile standalone plugin valgrind --leak-check=full --track-origins=yes ./build/MyPlugin_Standalone # Play audio, then quit # Check report for leaks ``` ### Windows - Visual Studio Memory Profiler 1. Debug → Performance Profiler → Memory Usage 2. Take snapshots before and after loading plugin 3. Compare snapshots for memory growth ### Check for Allocations in Audio Thread Use `-fsanitize=address` (Clang/GCC): ```bash cmake -B build -DCMAKE_CXX_FLAGS="-fsanitize=address -g" cmake --build build ./build/Tests/MyPluginTests ``` **Look for:** Allocations called from `processBlock()` - these are FORBIDDEN. --- ## Step 7: Advanced Profiling - Tracy **@test-automation-engineer** - Use Tracy for frame-perfect profiling. Tracy is a real-time profiler with nanosecond precision. ### Integrate Tracy ```cpp // CMakeLists.txt include(FetchContent) FetchContent_Declare( tracy GIT_REPOSITORY https://github.com/wolfpld/tracy.git GIT_TAG v0.10 ) FetchContent_MakeAvailable(tracy) target_link_libraries(MyPlugin PRIVATE TracyClient) target_compile_definitions(MyPlugin PRIVATE TRACY_ENABLE) ``` ### Add Tracy Zones ```cpp #include void processBlock(AudioBuffer& buffer, MidiBuffer& midi) { ZoneScoped; // Automatic profiling for this function { ZoneScopedN("Filter Processing"); filter.process(buffer); } { ZoneScopedN("Distortion Processing"); distortion.process(buffer); } } ``` ### Run Tracy ```bash # Launch Tracy profiler GUI tracy # Run plugin in DAW # Tracy will automatically connect and show real-time profiling ``` **Benefits:** - Real-time visualization - Frame-by-frame analysis - Memory allocation tracking - Lock contention detection --- ## Step 8: Generate Performance Report **@support-engineer** - Document findings and recommendations. ### Report Template ```markdown # Performance Analysis Report - MyPlugin v1.2.0 **Date:** 2024-05-15 **Analyst:** @dsp-engineer **Platform:** macOS 14.5, Apple M1 Max ## Summary CPU usage has been reduced from **8.2%** to **2.1%** (74% improvement) through targeted optimizations. ## Profiling Results ### Baseline (v1.1.0) - Single instance CPU: 8.2% @ 44.1kHz, 512 samples - 10 instances: 82% CPU (not sustainable) - Hot path: `std::pow()` in saturation curve (45% of CPU time) ### Optimized (v1.2.0) - Single instance CPU: 2.1% - 10 instances: 21% CPU - Hot path: Vectorized filter processing (18% of CPU time) ## Optimizations Applied ### 1. Replaced `std::pow()` with Lookup Table **Impact:** 45% CPU reduction **Location:** `Source/DSP/Saturation.cpp:42` ### 2. SIMD Vectorization of Filter **Impact:** 15% CPU reduction **Location:** `Source/DSP/SVFilter.cpp:87` ### 3. Removed Allocation in processBlock **Impact:** Eliminated RT violations **Location:** `Source/PluginProcessor.cpp:156` ## Benchmark Results | Test | Baseline | Optimized | Improvement | |------|----------|-----------|-------------| | ProcessBlock (512 samples) | 1.85 ms | 0.48 ms | 74% faster | | Single instance CPU | 8.2% | 2.1% | 74% reduction | | 50 instances CPU | 410% | 105% | 74% reduction | ## Remaining Bottlenecks 1. **Reverb Algorithm** - Still using naive implementation (12% CPU) - Recommendation: Switch to FDN reverb or partitioned convolution 2. **UI Repaints** - Currently 120fps (unnecessary) - Recommendation: Rate-limit to 60fps ## Next Steps - [ ] Optimize reverb algorithm (target: 5% CPU) - [ ] Rate-limit UI repaints (target: 60fps) - [ ] Profile on Windows (Intel CPU) to verify SIMD portability - [ ] Run stress test with 100+ instances ## Flame Graphs ![Baseline Flame Graph](flamegraph-baseline.svg) ![Optimized Flame Graph](flamegraph-optimized.svg) ## Conclusion Plugin now meets performance targets for release: - ✅ Single instance < 5% CPU - ✅ 20 instances < 50% CPU - ✅ No RT violations detected - ⚠️ Further optimization possible in reverb module ``` --- ## Definition of Done Performance analysis is complete when: - ✅ Profiling data collected on target platforms - ✅ Hot paths identified and documented - ✅ Optimization opportunities prioritized - ✅ Key optimizations implemented and benchmarked - ✅ Performance regression tests added - ✅ Report generated with flame graphs and recommendations - ✅ CPU usage meets targets (< 5% single instance) - ✅ No allocations or locks in audio thread --- ## Performance Targets ### CPU Usage Goals | Scenario | Target | Acceptable | Poor | |----------|--------|------------|------| | Single instance @ 44.1kHz, 512 samples | < 2% | < 5% | > 10% | | 10 instances | < 20% | < 40% | > 60% | | 50 instances | < 50% | < 80% | > 100% | ### Latency Goals | Plugin Type | Target Latency | |-------------|----------------| | Dynamics (compressor, gate) | 0 samples | | EQ, filter | 0-64 samples | | Modulation effects | 0-128 samples | | Reverb, delay | 0-512 samples | ### Memory Usage - **RAM:** < 50 MB per instance - **Allocations:** 0 in `processBlock()` --- ## Quick Profiling Checklist For rapid performance validation: ```bash # 1. Build optimized cmake -B build -DCMAKE_BUILD_TYPE=RelWithDebInfo cmake --build build # 2. Profile (macOS) instruments -t "Time Profiler" ./build/MyPlugin_Standalone # 3. Check for RT violations # (Look for malloc/new in processBlock stack traces) # 4. Benchmark ./build/Tests/PerformanceBenchmark # 5. Verify targets # Single instance should be < 5% CPU ``` **Time Required:** 30 minutes --- ## Expert Help Delegate performance tasks: - **@dsp-engineer** - Optimize DSP algorithms, implement SIMD - **@test-automation-engineer** - Set up benchmarks, run profilers - **@ui-engineer** - Optimize UI rendering, fix repainting issues - **@technical-lead** - Review architectural performance issues - **@plugin-engineer** - Integrate optimizations into build system --- ## Related Documentation - **TESTING_STRATEGY.md** - Performance testing in CI/CD - **juce-best-practices** skill - Realtime safety guidelines - **dsp-cookbook** skill - Optimized DSP algorithms - `/run-pluginval` command - Validation includes performance tests --- ## Tools Reference ### macOS - **Instruments** (Xcode) - Time Profiler, Allocations, Leaks - **Activity Monitor** - Real-time CPU monitoring - **sample** - Command-line profiler: `sample 10 -f output.txt` ### Windows - **Visual Studio Profiler** - CPU Usage, Memory Usage - **Intel VTune** - Advanced profiling, hardware counters - **Windows Performance Analyzer (WPA)** - System-wide profiling ### Linux - **perf** - Linux performance profiler - **Valgrind** - Memory profiling, cache profiling - **gprof** - GNU profiler - **Tracy** - Real-time frame profiler ### Cross-Platform - **Tracy Profiler** - Real-time, frame-perfect profiling - **Google Benchmark** - Microbenchmarking library - **Superluminal** - Commercial profiler (excellent for audio plugins) --- **Remember:** "Premature optimization is the root of all evil" - but audio plugins are performance-critical. Profile first, optimize hot paths, and always measure the impact!