887 lines
22 KiB
Markdown
887 lines
22 KiB
Markdown
---
|
|
argument-hint: "[target] [--profiler=<tool>] [--benchmark] [--report]"
|
|
description: "Profile plugin performance with Instruments, perf, VTune, Tracy to identify bottlenecks and optimize DSP algorithms"
|
|
allowed-tools: Bash, Read, Write, AskUserQuestion
|
|
model: sonnet
|
|
---
|
|
|
|
# /analyze-performance - Performance Profiling and Optimization
|
|
|
|
Profile your JUCE plugin's performance to identify bottlenecks, optimize DSP algorithms, and ensure efficient CPU usage across different scenarios.
|
|
|
|
## Overview
|
|
|
|
This command guides you through comprehensive performance analysis using profiling tools, benchmarking, and optimization strategies. It helps identify hot paths in DSP code, memory bottlenecks, and inefficient algorithms.
|
|
|
|
## Syntax
|
|
|
|
```bash
|
|
/analyze-performance [target] [--profiler=<tool>] [--benchmark] [--report]
|
|
```
|
|
|
|
### Arguments
|
|
|
|
- `target` (optional): What to profile - `dsp`, `ui`, `full`, or `specific` (default: `dsp`)
|
|
- `--profiler=<tool>`: Profiler to use - `instruments`, `perf`, `vtune`, `tracy`, or `auto` (default: `auto`)
|
|
- `--benchmark`: Run performance benchmarks and compare against baseline
|
|
- `--report`: Generate detailed performance report with recommendations
|
|
|
|
### Examples
|
|
|
|
```bash
|
|
# Profile DSP code with platform default profiler
|
|
/analyze-performance dsp
|
|
|
|
# Profile entire plugin with Instruments (macOS)
|
|
/analyze-performance full --profiler=instruments
|
|
|
|
# Run benchmarks and generate report
|
|
/analyze-performance dsp --benchmark --report
|
|
|
|
# Profile UI rendering performance
|
|
/analyze-performance ui --profiler=instruments
|
|
```
|
|
|
|
## Instructions
|
|
|
|
### Step 1: Pre-Profiling Setup
|
|
|
|
**@build-engineer** - Prepare optimized build with profiling symbols.
|
|
|
|
1. **Build with Release optimization + debug symbols:**
|
|
|
|
macOS (Xcode):
|
|
```cmake
|
|
# CMakeLists.txt
|
|
if(APPLE)
|
|
set(CMAKE_BUILD_TYPE RelWithDebInfo)
|
|
# Disable stripping for profiling
|
|
set(CMAKE_XCODE_ATTRIBUTE_STRIP_INSTALLED_PRODUCT NO)
|
|
set(CMAKE_XCODE_ATTRIBUTE_DEPLOYMENT_POSTPROCESSING NO)
|
|
endif()
|
|
```
|
|
|
|
Build:
|
|
```bash
|
|
cmake -B build -DCMAKE_BUILD_TYPE=RelWithDebInfo
|
|
cmake --build build --config RelWithDebInfo
|
|
```
|
|
|
|
Windows (Visual Studio):
|
|
```bash
|
|
cmake -B build
|
|
cmake --build build --config RelWithDebInfo
|
|
```
|
|
|
|
Linux:
|
|
```bash
|
|
cmake -B build -DCMAKE_BUILD_TYPE=RelWithDebInfo -DCMAKE_CXX_FLAGS="-g -O2"
|
|
cmake --build build
|
|
```
|
|
|
|
2. **Verify optimizations are enabled:**
|
|
```bash
|
|
# macOS: Check optimization flags
|
|
xcrun otool -v -s __TEXT __text build/MyPlugin.vst3/Contents/MacOS/MyPlugin | grep -A 5 "optimization"
|
|
|
|
# Linux: Check binary for debug symbols and optimization
|
|
objdump -d build/MyPlugin.vst3 | head -50
|
|
```
|
|
|
|
3. **Install profiling tools:**
|
|
|
|
**macOS:**
|
|
```bash
|
|
# Instruments comes with Xcode
|
|
xcode-select --install
|
|
|
|
# Optional: Tracy profiler
|
|
brew install tracy
|
|
```
|
|
|
|
**Windows:**
|
|
```powershell
|
|
# Intel VTune (recommended)
|
|
# Download from: https://www.intel.com/content/www/us/en/developer/tools/oneapi/vtune-profiler.html
|
|
|
|
# Or Visual Studio Profiler (included with VS)
|
|
```
|
|
|
|
**Linux:**
|
|
```bash
|
|
# perf (Linux perf tool)
|
|
sudo apt install linux-tools-common linux-tools-generic
|
|
|
|
# Tracy profiler
|
|
sudo apt install tracy-profiler
|
|
|
|
# OR build from source
|
|
git clone https://github.com/wolfpld/tracy
|
|
cd tracy/profiler
|
|
make
|
|
```
|
|
|
|
---
|
|
|
|
## Step 2: DSP Performance Profiling
|
|
|
|
**@dsp-engineer** + **@test-automation-engineer** - Identify DSP bottlenecks.
|
|
|
|
### Profile Audio Processing
|
|
|
|
#### macOS - Using Instruments
|
|
|
|
1. **Launch Instruments with plugin:**
|
|
```bash
|
|
# Open standalone plugin in Instruments
|
|
open -a Instruments
|
|
|
|
# Or profile in DAW:
|
|
# 1. Launch Instruments
|
|
# 2. Choose "Time Profiler" template
|
|
# 3. Click Record
|
|
# 4. In dropdown, select DAW process (e.g., "Logic Pro")
|
|
# 5. Load plugin in DAW and play audio
|
|
```
|
|
|
|
2. **Time Profiler Configuration:**
|
|
- Template: Time Profiler
|
|
- Sample Frequency: 1ms (high resolution)
|
|
- Record Waiting Threads: OFF (focus on CPU time)
|
|
- High Frequency: ON
|
|
|
|
3. **Record profiling session:**
|
|
- Click Record in Instruments
|
|
- Load plugin in standalone app or DAW
|
|
- Play test audio for 30-60 seconds
|
|
- Include various parameter automations
|
|
- Stop recording
|
|
|
|
4. **Analyze results:**
|
|
- Switch to "Call Tree" view
|
|
- Enable filters:
|
|
- ✅ Separate by Thread
|
|
- ✅ Invert Call Tree
|
|
- ✅ Hide System Libraries
|
|
- Look for hot functions in your code
|
|
- **Focus on:** `processBlock()`, DSP algorithm functions
|
|
|
|
5. **Identify bottlenecks:**
|
|
- Functions taking > 5% of CPU time are candidates for optimization
|
|
- Look for:
|
|
- Unexpected memory allocations (`malloc`, `new`)
|
|
- Expensive math operations (use vectorization)
|
|
- Inefficient loops
|
|
- Cache misses (scattered memory access)
|
|
|
|
**Example Instruments Output:**
|
|
```
|
|
Symbol Name % Time
|
|
MyPlugin::processBlock() 45.2%
|
|
MyFilter::processSample() 28.3%
|
|
std::pow() 15.1% ⚠️ Expensive!
|
|
MyFilter::updateCoefficients() 13.2%
|
|
MyDistortion::process() 16.9%
|
|
```
|
|
|
|
**Red Flags:**
|
|
- `std::pow()`, `std::sin()`, `std::cos()` in inner loops → Use lookup tables or approximations
|
|
- Memory allocations → Pre-allocate in `prepareToPlay()`
|
|
- Virtual function calls in hot path → Consider static polymorphism
|
|
|
|
---
|
|
|
|
#### Windows - Using Visual Studio Profiler
|
|
|
|
1. **Start profiling:**
|
|
- Open Visual Studio
|
|
- Debug → Performance Profiler
|
|
- Select: CPU Usage
|
|
- Start profiling
|
|
- Launch DAW and load plugin
|
|
- Play audio for 60 seconds
|
|
- Stop profiling
|
|
|
|
2. **Analyze:**
|
|
- View "Hot Path"
|
|
- Check "Functions" view sorted by "Total CPU %"
|
|
- Drill into `processBlock()`
|
|
|
|
3. **Generate report:**
|
|
- File → Export Report
|
|
- Save as `performance-analysis-[date].diagsession`
|
|
|
|
---
|
|
|
|
#### Linux - Using perf
|
|
|
|
1. **Record performance data:**
|
|
```bash
|
|
# Profile specific process (find PID of DAW or standalone)
|
|
perf record -F 999 -g -p <PID>
|
|
|
|
# Or profile command:
|
|
perf record -F 999 -g ./build/MyPlugin_Standalone
|
|
|
|
# Play audio for 60 seconds, then Ctrl+C to stop
|
|
```
|
|
|
|
2. **View results:**
|
|
```bash
|
|
# Interactive TUI
|
|
perf report
|
|
|
|
# Generate flame graph (requires flamegraph tools)
|
|
git clone https://github.com/brendangregg/FlameGraph
|
|
perf script | FlameGraph/stackcollapse-perf.pl | FlameGraph/flamegraph.pl > flamegraph.svg
|
|
open flamegraph.svg # or xdg-open on Linux
|
|
```
|
|
|
|
3. **Interpret flame graph:**
|
|
- Width = CPU time
|
|
- Look for wide sections in your code
|
|
- Drill down into `processBlock()` stack frames
|
|
|
|
---
|
|
|
|
### Common Performance Issues and Fixes
|
|
|
|
#### Issue 1: Expensive Transcendental Functions
|
|
|
|
**Symptom:** `std::sin()`, `std::cos()`, `std::pow()` show up in profiler.
|
|
|
|
**Solution:** Use lookup tables or polynomial approximations.
|
|
|
|
**Example:**
|
|
```cpp
|
|
// ❌ Slow: Direct call in processBlock
|
|
float sine = std::sin(phase);
|
|
|
|
// ✅ Fast: Lookup table
|
|
class SineLUT {
|
|
static constexpr int tableSize = 2048;
|
|
std::array<float, tableSize> table;
|
|
|
|
public:
|
|
SineLUT() {
|
|
for (int i = 0; i < tableSize; ++i)
|
|
table[i] = std::sin(2.0 * M_PI * i / tableSize);
|
|
}
|
|
|
|
float lookup(float phase) const {
|
|
float index = phase * tableSize;
|
|
int i0 = static_cast<int>(index) % tableSize;
|
|
int i1 = (i0 + 1) % tableSize;
|
|
float frac = index - std::floor(index);
|
|
return table[i0] + frac * (table[i1] - table[i0]);
|
|
}
|
|
};
|
|
|
|
// Use in processBlock:
|
|
float sine = sineLUT.lookup(phase);
|
|
```
|
|
|
|
**Speedup:** 5-10x faster
|
|
|
|
---
|
|
|
|
#### Issue 2: Inefficient Memory Access Patterns
|
|
|
|
**Symptom:** High cache miss rate, poor vectorization.
|
|
|
|
**Solution:** Structure-of-arrays instead of array-of-structures.
|
|
|
|
**Example:**
|
|
```cpp
|
|
// ❌ Poor cache locality (AoS)
|
|
struct Voice {
|
|
float frequency, amplitude, phase;
|
|
};
|
|
std::vector<Voice> voices;
|
|
|
|
for (auto& voice : voices) {
|
|
voice.phase += voice.frequency;
|
|
output += voice.amplitude * std::sin(voice.phase);
|
|
}
|
|
|
|
// ✅ Better cache locality (SoA)
|
|
struct VoiceBank {
|
|
std::vector<float> frequencies;
|
|
std::vector<float> amplitudes;
|
|
std::vector<float> phases;
|
|
};
|
|
|
|
for (int i = 0; i < voices.phases.size(); ++i) {
|
|
voices.phases[i] += voices.frequencies[i];
|
|
output += voices.amplitudes[i] * std::sin(voices.phases[i]);
|
|
}
|
|
```
|
|
|
|
**Benefit:** Better SIMD vectorization, fewer cache misses.
|
|
|
|
---
|
|
|
|
#### Issue 3: Unnecessary Branching
|
|
|
|
**Symptom:** Unpredictable branches in inner loops.
|
|
|
|
**Solution:** Branchless code or precompute decisions.
|
|
|
|
**Example:**
|
|
```cpp
|
|
// ❌ Branch in inner loop
|
|
for (int i = 0; i < numSamples; ++i) {
|
|
if (bypassEnabled)
|
|
output[i] = input[i];
|
|
else
|
|
output[i] = process(input[i]);
|
|
}
|
|
|
|
// ✅ Branchless or separate loops
|
|
if (bypassEnabled) {
|
|
std::copy(input, input + numSamples, output);
|
|
} else {
|
|
for (int i = 0; i < numSamples; ++i)
|
|
output[i] = process(input[i]);
|
|
}
|
|
```
|
|
|
|
---
|
|
|
|
#### Issue 4: Virtual Function Calls
|
|
|
|
**Symptom:** Virtual dispatch overhead in hot path.
|
|
|
|
**Solution:** Static polymorphism (templates) or function pointers.
|
|
|
|
**Example:**
|
|
```cpp
|
|
// ❌ Virtual function call per sample
|
|
class Filter {
|
|
public:
|
|
virtual float process(float input) = 0;
|
|
};
|
|
|
|
// ✅ Template-based static polymorphism
|
|
template<typename FilterType>
|
|
class Processor {
|
|
FilterType filter;
|
|
public:
|
|
void processBlock(float* buffer, int numSamples) {
|
|
for (int i = 0; i < numSamples; ++i)
|
|
buffer[i] = filter.process(buffer[i]); // Inlined!
|
|
}
|
|
};
|
|
```
|
|
|
|
---
|
|
|
|
## Step 3: SIMD Optimization
|
|
|
|
**@dsp-engineer** - Leverage SIMD instructions for maximum performance.
|
|
|
|
### Identify Vectorization Opportunities
|
|
|
|
1. **Check if compiler vectorized loops:**
|
|
|
|
**macOS/Linux:**
|
|
```bash
|
|
# GCC/Clang vectorization report
|
|
cmake -B build -DCMAKE_CXX_FLAGS="-O3 -fopt-info-vec"
|
|
cmake --build build 2>&1 | grep vectorized
|
|
```
|
|
|
|
**Windows:**
|
|
```powershell
|
|
# MSVC vectorization report
|
|
cmake -B build
|
|
cmake --build build -- /p:CL="/Qvec-report:2"
|
|
```
|
|
|
|
2. **Manual SIMD with JUCE:**
|
|
|
|
JUCE provides cross-platform SIMD abstractions:
|
|
|
|
```cpp
|
|
#include <juce_dsp/juce_dsp.h>
|
|
|
|
// Example: Process 4 samples at once with SIMD
|
|
void processBlock(juce::AudioBuffer<float>& buffer) {
|
|
auto* channelData = buffer.getWritePointer(0);
|
|
int numSamples = buffer.getNumSamples();
|
|
|
|
// Process in chunks of 4 (SSE/NEON)
|
|
using SIMDFloat = juce::dsp::SIMDRegister<float>;
|
|
constexpr int simdSize = SIMDFloat::size();
|
|
|
|
int i = 0;
|
|
for (; i < numSamples - simdSize; i += simdSize) {
|
|
auto simdInput = SIMDFloat::fromRawArray(channelData + i);
|
|
auto simdOutput = simdInput * SIMDFloat(gain); // SIMD multiply
|
|
simdOutput.copyToRawArray(channelData + i);
|
|
}
|
|
|
|
// Handle remaining samples
|
|
for (; i < numSamples; ++i) {
|
|
channelData[i] *= gain;
|
|
}
|
|
}
|
|
```
|
|
|
|
3. **Benchmark SIMD vs scalar:**
|
|
```bash
|
|
# Build with SIMD enabled
|
|
cmake -B build-simd -DCMAKE_CXX_FLAGS="-O3 -march=native"
|
|
cmake --build build-simd
|
|
|
|
# Compare performance (see Step 4 below)
|
|
```
|
|
|
|
**Expected Speedup:** 2-4x for SIMD-friendly code.
|
|
|
|
---
|
|
|
|
## Step 4: Performance Benchmarking
|
|
|
|
**@test-automation-engineer** - Quantify performance improvements.
|
|
|
|
### Create Benchmark Tests
|
|
|
|
```cpp
|
|
// Tests/PerformanceBenchmark.cpp
|
|
#include <benchmark/benchmark.h> // Google Benchmark
|
|
#include "../Source/PluginProcessor.h"
|
|
|
|
static void BM_ProcessBlock(benchmark::State& state) {
|
|
MyPluginProcessor processor;
|
|
processor.setPlayConfigDetails(2, 2, 44100.0, 512);
|
|
processor.prepareToPlay(44100.0, 512);
|
|
|
|
juce::AudioBuffer<float> buffer(2, 512);
|
|
juce::MidiBuffer midi;
|
|
|
|
// Fill with test signal
|
|
for (int ch = 0; ch < 2; ++ch)
|
|
for (int i = 0; i < 512; ++i)
|
|
buffer.setSample(ch, i, std::sin(2 * M_PI * 440 * i / 44100.0));
|
|
|
|
for (auto _ : state) {
|
|
processor.processBlock(buffer, midi);
|
|
benchmark::DoNotOptimize(buffer.getReadPointer(0));
|
|
}
|
|
|
|
// Report CPU usage metric
|
|
state.SetItemsProcessed(state.iterations() * 512);
|
|
}
|
|
|
|
BENCHMARK(BM_ProcessBlock)->Iterations(10000);
|
|
|
|
BENCHMARK_MAIN();
|
|
```
|
|
|
|
### Run Benchmarks
|
|
|
|
```bash
|
|
# Install Google Benchmark
|
|
git clone https://github.com/google/benchmark.git
|
|
cd benchmark
|
|
cmake -E make_directory "build"
|
|
cmake -E chdir "build" cmake -DCMAKE_BUILD_TYPE=Release ..
|
|
cmake --build "build" --config Release
|
|
sudo cmake --build "build" --config Release --target install
|
|
|
|
# Build and run your benchmarks
|
|
cmake -B build -DCMAKE_BUILD_TYPE=Release
|
|
cmake --build build --target PerformanceBenchmark
|
|
./build/Tests/PerformanceBenchmark --benchmark_out=results.json --benchmark_out_format=json
|
|
```
|
|
|
|
**Example Output:**
|
|
```
|
|
-----------------------------------------------------------------
|
|
Benchmark Time CPU Iterations
|
|
-----------------------------------------------------------------
|
|
BM_ProcessBlock 1.23 ms 1.22 ms 571
|
|
```
|
|
|
|
**Interpretation:**
|
|
- 1.22 ms per block @ 512 samples = ~24% CPU at 44.1kHz (1.22ms / (512/44100 * 1000))
|
|
- Goal: < 5% CPU (< 0.29 ms/block)
|
|
|
|
### Compare Before/After Optimization
|
|
|
|
```bash
|
|
# Save baseline
|
|
./build/Tests/PerformanceBenchmark > baseline.txt
|
|
|
|
# Make optimizations
|
|
# ...
|
|
|
|
# Compare
|
|
./build/Tests/PerformanceBenchmark > optimized.txt
|
|
diff baseline.txt optimized.txt
|
|
```
|
|
|
|
---
|
|
|
|
## Step 5: UI Performance Profiling
|
|
|
|
**@ui-engineer** - Ensure UI doesn't impact audio performance.
|
|
|
|
### Profile UI Rendering
|
|
|
|
1. **Check UI frame rate:**
|
|
```cpp
|
|
// Add to Editor
|
|
class MyEditor : public juce::AudioProcessorEditor, private juce::Timer {
|
|
void timerCallback() override {
|
|
auto now = juce::Time::getMillisecondCounterHiRes();
|
|
double fps = 1000.0 / (now - lastFrameTime);
|
|
DBG("FPS: " << fps); // Should be 60fps
|
|
lastFrameTime = now;
|
|
repaint();
|
|
}
|
|
|
|
double lastFrameTime = 0;
|
|
};
|
|
```
|
|
|
|
2. **Profile with Instruments (macOS):**
|
|
- Use "Core Animation" template
|
|
- Check for:
|
|
- Dropped frames (should be 0)
|
|
- Expensive drawing operations
|
|
- Off-screen rendering
|
|
|
|
3. **Optimize UI:**
|
|
- **Use `repaint()` only when needed** (not on every audio callback!)
|
|
- **Coalesce repaints:**
|
|
```cpp
|
|
// ❌ Repaint on every parameter change (60 times/sec from audio thread!)
|
|
parameterChanged(parameter, newValue) {
|
|
repaint(); // BAD
|
|
}
|
|
|
|
// ✅ Rate-limit repaints
|
|
parameterChanged(parameter, newValue) {
|
|
startTimer(16); // 60fps max
|
|
}
|
|
|
|
timerCallback() {
|
|
stopTimer();
|
|
repaint();
|
|
}
|
|
```
|
|
- **Cache rendered graphics:**
|
|
```cpp
|
|
juce::Image cachedBackground;
|
|
|
|
void paint(Graphics& g) {
|
|
if (cachedBackground.isNull()) {
|
|
cachedBackground = juce::Image(juce::Image::ARGB, getWidth(), getHeight(), true);
|
|
Graphics cg(cachedBackground);
|
|
drawComplexBackground(cg);
|
|
}
|
|
g.drawImageAt(cachedBackground, 0, 0);
|
|
}
|
|
```
|
|
|
|
---
|
|
|
|
## Step 6: Memory Profiling
|
|
|
|
**@test-automation-engineer** - Detect memory leaks and excessive allocations.
|
|
|
|
### macOS - Instruments Leaks
|
|
|
|
1. **Launch Instruments → Leaks template**
|
|
2. **Record session** (load/unload plugin multiple times)
|
|
3. **Check for leaks:**
|
|
- Red flags = memory leaks
|
|
- Click to see stack trace of allocation
|
|
|
|
### Linux - Valgrind
|
|
|
|
```bash
|
|
# Profile standalone plugin
|
|
valgrind --leak-check=full --track-origins=yes ./build/MyPlugin_Standalone
|
|
|
|
# Play audio, then quit
|
|
# Check report for leaks
|
|
```
|
|
|
|
### Windows - Visual Studio Memory Profiler
|
|
|
|
1. Debug → Performance Profiler → Memory Usage
|
|
2. Take snapshots before and after loading plugin
|
|
3. Compare snapshots for memory growth
|
|
|
|
### Check for Allocations in Audio Thread
|
|
|
|
Use `-fsanitize=address` (Clang/GCC):
|
|
```bash
|
|
cmake -B build -DCMAKE_CXX_FLAGS="-fsanitize=address -g"
|
|
cmake --build build
|
|
./build/Tests/MyPluginTests
|
|
```
|
|
|
|
**Look for:** Allocations called from `processBlock()` - these are FORBIDDEN.
|
|
|
|
---
|
|
|
|
## Step 7: Advanced Profiling - Tracy
|
|
|
|
**@test-automation-engineer** - Use Tracy for frame-perfect profiling.
|
|
|
|
Tracy is a real-time profiler with nanosecond precision.
|
|
|
|
### Integrate Tracy
|
|
|
|
```cpp
|
|
// CMakeLists.txt
|
|
include(FetchContent)
|
|
FetchContent_Declare(
|
|
tracy
|
|
GIT_REPOSITORY https://github.com/wolfpld/tracy.git
|
|
GIT_TAG v0.10
|
|
)
|
|
FetchContent_MakeAvailable(tracy)
|
|
|
|
target_link_libraries(MyPlugin PRIVATE TracyClient)
|
|
target_compile_definitions(MyPlugin PRIVATE TRACY_ENABLE)
|
|
```
|
|
|
|
### Add Tracy Zones
|
|
|
|
```cpp
|
|
#include <tracy/Tracy.hpp>
|
|
|
|
void processBlock(AudioBuffer<float>& buffer, MidiBuffer& midi) {
|
|
ZoneScoped; // Automatic profiling for this function
|
|
|
|
{
|
|
ZoneScopedN("Filter Processing");
|
|
filter.process(buffer);
|
|
}
|
|
|
|
{
|
|
ZoneScopedN("Distortion Processing");
|
|
distortion.process(buffer);
|
|
}
|
|
}
|
|
```
|
|
|
|
### Run Tracy
|
|
|
|
```bash
|
|
# Launch Tracy profiler GUI
|
|
tracy
|
|
|
|
# Run plugin in DAW
|
|
# Tracy will automatically connect and show real-time profiling
|
|
```
|
|
|
|
**Benefits:**
|
|
- Real-time visualization
|
|
- Frame-by-frame analysis
|
|
- Memory allocation tracking
|
|
- Lock contention detection
|
|
|
|
---
|
|
|
|
## Step 8: Generate Performance Report
|
|
|
|
**@support-engineer** - Document findings and recommendations.
|
|
|
|
### Report Template
|
|
|
|
```markdown
|
|
# Performance Analysis Report - MyPlugin v1.2.0
|
|
|
|
**Date:** 2024-05-15
|
|
**Analyst:** @dsp-engineer
|
|
**Platform:** macOS 14.5, Apple M1 Max
|
|
|
|
## Summary
|
|
|
|
CPU usage has been reduced from **8.2%** to **2.1%** (74% improvement) through targeted optimizations.
|
|
|
|
## Profiling Results
|
|
|
|
### Baseline (v1.1.0)
|
|
- Single instance CPU: 8.2% @ 44.1kHz, 512 samples
|
|
- 10 instances: 82% CPU (not sustainable)
|
|
- Hot path: `std::pow()` in saturation curve (45% of CPU time)
|
|
|
|
### Optimized (v1.2.0)
|
|
- Single instance CPU: 2.1%
|
|
- 10 instances: 21% CPU
|
|
- Hot path: Vectorized filter processing (18% of CPU time)
|
|
|
|
## Optimizations Applied
|
|
|
|
### 1. Replaced `std::pow()` with Lookup Table
|
|
**Impact:** 45% CPU reduction
|
|
**Location:** `Source/DSP/Saturation.cpp:42`
|
|
|
|
### 2. SIMD Vectorization of Filter
|
|
**Impact:** 15% CPU reduction
|
|
**Location:** `Source/DSP/SVFilter.cpp:87`
|
|
|
|
### 3. Removed Allocation in processBlock
|
|
**Impact:** Eliminated RT violations
|
|
**Location:** `Source/PluginProcessor.cpp:156`
|
|
|
|
## Benchmark Results
|
|
|
|
| Test | Baseline | Optimized | Improvement |
|
|
|------|----------|-----------|-------------|
|
|
| ProcessBlock (512 samples) | 1.85 ms | 0.48 ms | 74% faster |
|
|
| Single instance CPU | 8.2% | 2.1% | 74% reduction |
|
|
| 50 instances CPU | 410% | 105% | 74% reduction |
|
|
|
|
## Remaining Bottlenecks
|
|
|
|
1. **Reverb Algorithm** - Still using naive implementation (12% CPU)
|
|
- Recommendation: Switch to FDN reverb or partitioned convolution
|
|
2. **UI Repaints** - Currently 120fps (unnecessary)
|
|
- Recommendation: Rate-limit to 60fps
|
|
|
|
## Next Steps
|
|
|
|
- [ ] Optimize reverb algorithm (target: 5% CPU)
|
|
- [ ] Rate-limit UI repaints (target: 60fps)
|
|
- [ ] Profile on Windows (Intel CPU) to verify SIMD portability
|
|
- [ ] Run stress test with 100+ instances
|
|
|
|
## Flame Graphs
|
|
|
|

|
|

|
|
|
|
## Conclusion
|
|
|
|
Plugin now meets performance targets for release:
|
|
- ✅ Single instance < 5% CPU
|
|
- ✅ 20 instances < 50% CPU
|
|
- ✅ No RT violations detected
|
|
- ⚠️ Further optimization possible in reverb module
|
|
```
|
|
|
|
---
|
|
|
|
## Definition of Done
|
|
|
|
Performance analysis is complete when:
|
|
|
|
- ✅ Profiling data collected on target platforms
|
|
- ✅ Hot paths identified and documented
|
|
- ✅ Optimization opportunities prioritized
|
|
- ✅ Key optimizations implemented and benchmarked
|
|
- ✅ Performance regression tests added
|
|
- ✅ Report generated with flame graphs and recommendations
|
|
- ✅ CPU usage meets targets (< 5% single instance)
|
|
- ✅ No allocations or locks in audio thread
|
|
|
|
---
|
|
|
|
## Performance Targets
|
|
|
|
### CPU Usage Goals
|
|
|
|
| Scenario | Target | Acceptable | Poor |
|
|
|----------|--------|------------|------|
|
|
| Single instance @ 44.1kHz, 512 samples | < 2% | < 5% | > 10% |
|
|
| 10 instances | < 20% | < 40% | > 60% |
|
|
| 50 instances | < 50% | < 80% | > 100% |
|
|
|
|
### Latency Goals
|
|
|
|
| Plugin Type | Target Latency |
|
|
|-------------|----------------|
|
|
| Dynamics (compressor, gate) | 0 samples |
|
|
| EQ, filter | 0-64 samples |
|
|
| Modulation effects | 0-128 samples |
|
|
| Reverb, delay | 0-512 samples |
|
|
|
|
### Memory Usage
|
|
|
|
- **RAM:** < 50 MB per instance
|
|
- **Allocations:** 0 in `processBlock()`
|
|
|
|
---
|
|
|
|
## Quick Profiling Checklist
|
|
|
|
For rapid performance validation:
|
|
|
|
```bash
|
|
# 1. Build optimized
|
|
cmake -B build -DCMAKE_BUILD_TYPE=RelWithDebInfo
|
|
cmake --build build
|
|
|
|
# 2. Profile (macOS)
|
|
instruments -t "Time Profiler" ./build/MyPlugin_Standalone
|
|
|
|
# 3. Check for RT violations
|
|
# (Look for malloc/new in processBlock stack traces)
|
|
|
|
# 4. Benchmark
|
|
./build/Tests/PerformanceBenchmark
|
|
|
|
# 5. Verify targets
|
|
# Single instance should be < 5% CPU
|
|
```
|
|
|
|
**Time Required:** 30 minutes
|
|
|
|
---
|
|
|
|
## Expert Help
|
|
|
|
Delegate performance tasks:
|
|
|
|
- **@dsp-engineer** - Optimize DSP algorithms, implement SIMD
|
|
- **@test-automation-engineer** - Set up benchmarks, run profilers
|
|
- **@ui-engineer** - Optimize UI rendering, fix repainting issues
|
|
- **@technical-lead** - Review architectural performance issues
|
|
- **@plugin-engineer** - Integrate optimizations into build system
|
|
|
|
---
|
|
|
|
## Related Documentation
|
|
|
|
- **TESTING_STRATEGY.md** - Performance testing in CI/CD
|
|
- **juce-best-practices** skill - Realtime safety guidelines
|
|
- **dsp-cookbook** skill - Optimized DSP algorithms
|
|
- `/run-pluginval` command - Validation includes performance tests
|
|
|
|
---
|
|
|
|
## Tools Reference
|
|
|
|
### macOS
|
|
- **Instruments** (Xcode) - Time Profiler, Allocations, Leaks
|
|
- **Activity Monitor** - Real-time CPU monitoring
|
|
- **sample** - Command-line profiler: `sample <PID> 10 -f output.txt`
|
|
|
|
### Windows
|
|
- **Visual Studio Profiler** - CPU Usage, Memory Usage
|
|
- **Intel VTune** - Advanced profiling, hardware counters
|
|
- **Windows Performance Analyzer (WPA)** - System-wide profiling
|
|
|
|
### Linux
|
|
- **perf** - Linux performance profiler
|
|
- **Valgrind** - Memory profiling, cache profiling
|
|
- **gprof** - GNU profiler
|
|
- **Tracy** - Real-time frame profiler
|
|
|
|
### Cross-Platform
|
|
- **Tracy Profiler** - Real-time, frame-perfect profiling
|
|
- **Google Benchmark** - Microbenchmarking library
|
|
- **Superluminal** - Commercial profiler (excellent for audio plugins)
|
|
|
|
---
|
|
|
|
**Remember:** "Premature optimization is the root of all evil" - but audio plugins are performance-critical. Profile first, optimize hot paths, and always measure the impact!
|