gh-tachyon-beep-skillpacks-…/skills/performance-testing-fundamentals/SKILL.md

---
name: performance-testing-fundamentals
description: Use when starting performance testing, choosing load testing tools, interpreting performance metrics, debugging slow applications, or establishing performance baselines - provides decision frameworks and anti-patterns for load, stress, spike, and soak testing
---

# Performance Testing Fundamentals

## Overview

**Core principle:** Diagnose first, test second. Performance testing without understanding your bottlenecks wastes time.

**Rule:** Define SLAs before testing. You can't judge "good" performance without requirements.

## When NOT to Performance Test

Performance test only AFTER:
- ✅ Defining performance SLAs (latency, throughput, error rate targets)
- ✅ Profiling current bottlenecks (APM, database logs, profiling)
- ✅ Fixing obvious issues (missing indexes, N+1 queries, inefficient algorithms)

**Don't performance test to find problems** - use profiling/APM for that. Performance test to verify fixes and validate capacity.

## Tool Selection Decision Tree

| Your Constraint | Choose | Why |
|----------------|--------|-----|
| CI/CD integration, JavaScript team | **k6** | Modern, code-as-config, easy CI integration |
| Complex scenarios, enterprise, mature ecosystem | **JMeter** | GUI, plugins, every protocol |
| High throughput (10k+ RPS), Scala team | **Gatling** | Built for scale, excellent reports |
| Quick HTTP benchmark, no complex scenarios | **Apache Bench (ab)** or **wrk** | Command-line, no setup |
| Cloud-based, don't want infrastructure | **BlazeMeter**, **Loader.io** | SaaS, pay-per-use |
| Realistic browser testing (JS rendering) | **Playwright** + **k6** | Hybrid: Playwright for UX, k6 for load |

**For most teams:** k6 (modern, scriptable) or JMeter (mature, GUI)

## Test Type Quick Reference

| Test Type | Purpose | Duration | Load Pattern | Use When |
|-----------|---------|----------|--------------|----------|
| **Load Test** | Verify normal operations under expected load | 15-30 min | Steady (ramp to target, sustain) | Baseline validation, regression testing |
| **Stress Test** | Find breaking point | 5-15 min | Increasing (ramp until failure) | Capacity planning, finding limits |
| **Spike Test** | Test sudden traffic surge | 2-5 min | Instant jump (0 → peak) | Black Friday prep, auto-scaling validation |
| **Soak Test** | Find memory leaks, connection pool exhaustion | 2-8 hours | Steady sustained load | Pre-production validation, stability check |

**Start with Load Test** (validates baseline), then Stress/Spike (finds limits), finally Soak (validates stability).

## Anti-Patterns Catalog

### ❌ Premature Load Testing
**Symptom:** "App is slow, let's load test it"

**Why bad:** Load testing reveals "it's slow under load" but not WHY or WHERE

**Fix:** Profile first (APM, database slow query logs, profiler), fix obvious bottlenecks, THEN load test to validate

---

### ❌ Testing Without SLAs
**Symptom:** "My API handles 100 RPS with 200ms average latency. Is that good?"

**Why bad:** Can't judge "good" without requirements. A gaming API needs <50ms; batch processing tolerates 2s.

**Fix:** Define SLAs first:
- Target latency: P95 < 300ms, P99 < 500ms
- Target throughput: 500 RPS at peak
- Max error rate: < 0.1%

---

### ❌ Unrealistic SLAs
**Symptom:** "Our database-backed CRUD API with complex joins must have P95 < 10ms"

**Why bad:** Sets impossible targets. Database round-trip alone is often 5-20ms. Forces wasted optimization or architectural rewrites.

**Fix:** Compare against Performance Benchmarks table (see below). If target is 10x better than benchmark, profile current performance first, then negotiate realistic SLA based on what's achievable vs cost of optimization.

---

### ❌ Vanity Metrics
**Symptom:** Reporting only average response time

**Why bad:** Average hides tail latency. 99% of requests at 100ms + 1% at 10s = "average 200ms" looks fine, but users experience 10s delays.

**Fix:** Always report percentiles:
- P50 (median) - typical user experience
- P95 - most users
- P99 - worst-case for significant minority
- Max - outliers

---

### ❌ Load Testing in Production First
**Symptom:** "Let's test capacity by running load tests against production"

**Why bad:** Risks outages, contaminates real metrics, can trigger alerts/costs

**Fix:** Test in staging environment that mirrors production (same DB size, network latency, resource limits)

---

### ❌ Single-User "Load" Tests
**Symptom:** Running one user hitting the API as fast as possible

**Why bad:** Doesn't simulate realistic concurrency, misses resource contention (database connections, thread pools)

**Fix:** Simulate realistic concurrent users with realistic think time between requests

## Metrics Glossary

| Metric | Definition | Good Threshold (typical web API) |
|--------|------------|----------------------------------|
| **RPS** (Requests/Second) | Throughput - how many requests processed | Varies by app; know your peak |
| **Latency** | Time from request to response | P95 < 300ms, P99 < 500ms |
| **P50 (Median)** | 50% of requests faster than this | P50 < 100ms |
| **P95** | 95% of requests faster than this | P95 < 300ms |
| **P99** | 99% of requests faster than this | P99 < 500ms |
| **Error Rate** | % of 4xx/5xx responses | < 0.1% |
| **Throughput** | Data transferred per second (MB/s) | Depends on payload size |
| **Concurrent Users** | Active users at same time | Calculate from traffic patterns |

**Focus on P95/P99, not average.** Tail latency kills user experience.

## Diagnostic-First Workflow

Before load testing slow applications, follow this workflow:

**Step 1: Measure Current State**
- Install APM (DataDog, New Relic, Grafana) or logging
- Identify slowest 10 endpoints/operations
- Check database slow query logs

**Step 2: Common Quick Wins** (90% of performance issues)
- Missing database indexes
- N+1 query problem
- Unoptimized images/assets
- Missing caching (Redis, CDN)
- Synchronous operations that should be async
- Inefficient serialization (JSON parsing bottlenecks)

**Step 3: Profile Specific Bottleneck**
- Use profiler to see CPU/memory hotspots
- Trace requests to find where time is spent (DB? external API? computation?)
- Check for resource limits (max connections, thread pool exhaustion)

**Step 4: Fix and Measure**
- Apply fix (add index, cache layer, async processing)
- Measure improvement in production
- Document before/after metrics

**Step 5: THEN Load Test** (if needed)
- Validate fixes handle expected load
- Find new capacity limits
- Establish regression baseline

**Anti-pattern to avoid:** Skipping to Step 5 without Steps 1-4.

## Performance Benchmarks (Reference)

What "good" looks like by application type:

| Application Type | Typical P95 Latency | Typical Throughput | Notes |
|------------------|---------------------|-------------------|-------|
| **REST API (CRUD)** | < 200ms | 500-2000 RPS | Database-backed, simple queries |
| **Search API** | < 500ms | 100-500 RPS | Complex queries, ranking algorithms |
| **Payment Gateway** | < 1s | 50-200 RPS | External service calls, strict consistency |
| **Real-time Gaming** | < 50ms | 1000-10000 RPS | Low latency critical |
| **Batch Processing** | 2-10s/job | 10-100 jobs/min | Throughput > latency |
| **Static CDN** | < 100ms | 10000+ RPS | Edge-cached, minimal computation |

**Use as rough guide, not absolute targets.** Your SLAs depend on user needs.

## Results Interpretation Framework

After running a load test:

**Pass Criteria:**
- ✅ All requests meet latency SLA (e.g., P95 < 300ms)
- ✅ Error rate under threshold (< 0.1%)
- ✅ No resource exhaustion (CPU < 80%, memory stable, no connection pool saturation)
- ✅ Sustained load for test duration without degradation

**Fail Criteria:**
- ❌ Latency exceeds SLA
- ❌ Error rate spikes
- ❌ Gradual degradation over time (memory leak, connection leak)
- ❌ Resource exhaustion (CPU pegged, OOM errors)

**Next Steps:**
- **If passing:** Establish this as regression baseline, run periodically in CI
- **If failing:** Profile to find bottleneck, optimize, re-test
- **If borderline:** Test at higher load (stress test) to find safety margin

## Common Mistakes

### ❌ Not Ramping Load Gradually
**Symptom:** Instant 0 → 1000 users, everything fails

**Fix:** Ramp over 2-5 minutes to let auto-scaling/caching warm up (except spike tests, where instant jump is the point)

---

### ❌ Testing With Empty Database
**Symptom:** Tests pass with 100 records, fail with 1M records in production

**Fix:** Seed staging database with production-scale data

---

### ❌ Ignoring External Dependencies
**Symptom:** Your API is fast, but third-party payment gateway times out under load

**Fix:** Include external service latency in SLAs, or mock them for isolated API testing

## Quick Reference

**Getting Started Checklist:**
1. Define SLAs (latency P95/P99, throughput, error rate)
2. Choose tool (k6 or JMeter for most cases)
3. Start with Load Test (baseline validation)
4. Run Stress Test (find capacity limits)
5. Establish regression baseline
6. Run in CI on major changes

**When Debugging Slow App:**
1. Profile first (APM, database logs)
2. Fix obvious issues (indexes, N+1, caching)
3. Measure improvement
4. THEN load test to validate

**Interpreting Results:**
- Report P95/P99, not just average
- Compare against SLAs
- Check for resource exhaustion
- Look for degradation over time (soak tests)

## Bottom Line

**Performance testing validates capacity and catches regressions.**

**Profiling finds bottlenecks.**

Don't confuse the two - diagnose first, test second.