Files
gh-lyndonkl-claude/skills/brainstorm-diverge-converge/resources/examples/api-performance-optimization.md
2025-11-30 08:38:26 +08:00

14 KiB
Raw Blame History

Brainstorm: API Performance Optimization Strategies

Problem Statement

What we're solving: API response time has degraded from 200ms (p95) to 800ms (p95) over the past 3 months. Users are experiencing slow page loads and some are timing out.

Decision to make: Which optimization approaches should we prioritize for the next quarter to bring p95 response time back to <300ms?

Context:

  • REST API serving 50k requests/day
  • PostgreSQL database, 200GB data
  • Node.js/Express backend
  • Current p95: 800ms, p50: 350ms
  • Team: 3 backend engineers, 1 devops
  • Quarterly engineering budget: 4 engineer-months

Constraints:

  • Cannot break existing API contracts (backwards compatible)
  • Must maintain 99.9% uptime during changes
  • No more than $2k/month additional infrastructure cost
  • Must ship improvements within 3 months

Diverge: Generate Ideas

Target: 40 ideas

Prompt: Generate as many ways as possible to improve API response time. Suspend judgment. All ideas are valid - from quick wins to major architectural changes.

All Ideas

  1. Add Redis caching layer for frequent queries
  2. Database query optimization (add indexes)
  3. Implement database connection pooling
  4. Use GraphQL to reduce over-fetching
  5. Add CDN for static assets
  6. Implement HTTP/2 server push
  7. Compress API responses with gzip
  8. Paginate large result sets
  9. Use database read replicas
  10. Implement response caching headers (ETag, If-None-Match)
  11. Migrate to serverless (AWS Lambda)
  12. Add API gateway for request routing
  13. Implement request batching
  14. Use database query result caching
  15. Optimize N+1 query problems
  16. Implement lazy loading for related data
  17. Switch to gRPC from REST
  18. Add application-level caching (in-memory)
  19. Optimize JSON serialization
  20. Implement database partitioning
  21. Use faster ORM or raw SQL
  22. Add async processing for slow operations
  23. Implement API rate limiting to prevent overload
  24. Optimize Docker container size
  25. Use database materialized views
  26. Implement query result streaming
  27. Add load balancer for horizontal scaling
  28. Optimize database schema (denormalization)
  29. Implement incremental/delta responses
  30. Use WebSockets for real-time data
  31. Migrate to NoSQL (MongoDB, DynamoDB)
  32. Implement API response compression (Brotli)
  33. Add edge caching (Cloudflare Workers)
  34. Use database archival for old data
  35. Implement request queuing/throttling
  36. Optimize API middleware chain
  37. Use faster JSON parser (simdjson)
  38. Implement selective field loading
  39. Add monitoring and alerting for slow queries
  40. Database vacuum/analyze for query planner

Total generated: 40 ideas


Cluster: Organize Themes

Goal: Group similar ideas into 4-8 distinct categories

Cluster 1: Caching Strategies (9 ideas)

  • Add Redis caching layer for frequent queries
  • Implement response caching headers (ETag, If-None-Match)
  • Use database query result caching
  • Add application-level caching (in-memory)
  • Add CDN for static assets
  • Add edge caching (Cloudflare Workers)
  • Implement response caching at API gateway
  • Use database materialized views
  • Cache computed/aggregated results

Cluster 2: Database Query Optimization (11 ideas)

  • Database query optimization (add indexes)
  • Optimize N+1 query problems
  • Use faster ORM or raw SQL
  • Implement selective field loading
  • Optimize database schema (denormalization)
  • Add monitoring and alerting for slow queries
  • Database vacuum/analyze for query planner
  • Implement lazy loading for related data
  • Use database query result caching (also in caching)
  • Database archival for old data
  • Database partitioning

Cluster 3: Data Transfer Optimization (7 ideas)

  • Compress API responses with gzip/Brotli
  • Paginate large result sets
  • Implement request batching
  • Optimize JSON serialization
  • Use faster JSON parser (simdjson)
  • Implement incremental/delta responses
  • Implement query result streaming

Cluster 4: Infrastructure Scaling (7 ideas)

  • Use database read replicas
  • Add load balancer for horizontal scaling
  • Implement database connection pooling
  • Optimize Docker container size
  • Migrate to serverless (AWS Lambda)
  • Add API gateway for request routing
  • Implement request queuing/throttling

Cluster 5: Architectural Changes (4 ideas)

  • Use GraphQL to reduce over-fetching
  • Switch to gRPC from REST
  • Use WebSockets for real-time data
  • Migrate to NoSQL (MongoDB, DynamoDB)

Cluster 6: Async & Offloading (2 ideas)

  • Add async processing for slow operations
  • Implement background job processing for heavy tasks

Total clusters: 6 themes


Converge: Evaluate & Select

Evaluation Criteria:

  1. Impact on p95 latency (weight: 3x) - How much will this reduce response time?
  2. Implementation effort (weight: 2x) - Engineering time required (lower = better)
  3. Infrastructure cost (weight: 1x) - Additional monthly cost (lower = better)

Scoring scale: 1-10 (higher = better)

Scored Ideas

Idea Impact (3x) Effort (2x) Cost (1x) Weighted Total
Add Redis caching 9 7 7 9×3 + 7×2 + 7×1 = 48
Optimize N+1 queries 8 8 10 8×3 + 8×2 + 10×1 = 50
Add database indexes 7 9 10 7×3 + 9×2 + 10×1 = 49
Response compression (gzip) 6 9 10 6×3 + 9×2 + 10×1 = 45
Database connection pooling 6 8 10 6×3 + 8×2 + 10×1 = 44
Paginate large results 7 7 10 7×3 + 7×2 + 10×1 = 45
DB read replicas 8 5 4 8×3 + 5×2 + 4×1 = 38
Async processing 6 6 8 6×3 + 6×2 + 8×1 = 38
GraphQL migration 7 3 9 7×3 + 3×2 + 9×1 = 36
Serverless migration 5 2 5 5×3 + 2×2 + 5×1 = 24

Scoring notes:

  • Impact: Based on estimated latency reduction (9-10 = >400ms, 7-8 = 200-400ms, 5-6 = 100-200ms)
  • Effort: Inverse scale (9-10 = <1 week, 7-8 = 1-2 weeks, 5-6 = 3-4 weeks, 3-4 = 1-2 months, 1-2 = 3+ months)
  • Cost: Inverse scale (10 = $0, 8-9 = <$200/mo, 6-7 = <$500/mo, 4-5 = <$1k/mo, 1-3 = >$1k/mo)

Top 3 Selections

1. Fix N+1 Query Problems (Score: 50)

Why selected: Highest overall score - high impact, reasonable effort, zero cost

Rationale:

  • Impact (8/10): N+1 queries are a common culprit for slow APIs. Profiling shows several endpoints making 50-100 queries per request. Fixing this could reduce p95 by 300-500ms.
  • Effort (8/10): Can identify with APM tools (DataDog), fix iteratively. Estimated 2-3 weeks for main endpoints.
  • Cost (10/10): Zero additional infrastructure cost - purely code optimization.

Next steps:

  • Week 1: Profile top 10 slowest endpoints with APM to identify N+1 patterns
  • Week 2-3: Implement eager loading/joins for identified queries
  • Week 4: Deploy with feature flags, measure impact
  • Expected improvement: Reduce p95 from 800ms to 500-600ms

Measurement:

  • Track p95/p99 latency per endpoint before/after
  • Monitor database query counts (should decrease significantly)
  • Verify no increase in memory usage from eager loading

2. Add Database Indexes (Score: 49)

Why selected: Second highest score - very low effort for solid impact

Rationale:

  • Impact (7/10): Database query analysis shows several full table scans. Adding indexes could reduce individual query time by 50-80%.
  • Effort (9/10): Quick wins - can identify missing indexes via EXPLAIN ANALYZE, add indexes with minimal risk. Estimated 1 week.
  • Cost (10/10): Marginal storage cost for indexes (~5-10GB), no new infrastructure.

Next steps:

  • Day 1-2: Run EXPLAIN ANALYZE on slow queries (from slow query log)
  • Day 3-4: Create indexes on foreign keys, WHERE clause columns, JOIN columns
  • Day 5: Deploy indexes during low-traffic window, monitor impact
  • Expected improvement: Reduce p95 by 100-200ms for index-heavy endpoints

Measurement:

  • Compare query execution plans before/after (table scan → index scan)
  • Track index usage with pg_stat_user_indexes
  • Monitor index size growth

Considerations:

  • Some writes may slow down slightly (index maintenance)
  • Test on staging first to verify no lock contention

3. Implement Redis Caching (Score: 48)

Why selected: Highest impact potential, moderate effort and cost

Rationale:

  • Impact (9/10): Caching frequently-accessed data (user profiles, config, lookup tables) could eliminate 60-70% of database queries. Massive impact for cacheable endpoints.
  • Effort (7/10): Moderate effort - setup Redis, implement caching layer, handle cache invalidation. Estimated 2-3 weeks.
  • Cost (7/10): Redis managed service ~$200-400/month (ElastiCache t3.medium)

Next steps:

  • Week 1: Analyze request patterns - identify most-frequent queries for caching
  • Week 2: Setup Redis (ElastiCache), implement cache-aside pattern for top 3 endpoints
  • Week 3: Implement cache invalidation strategy (TTL + event-based)
  • Week 4: Rollout with monitoring
  • Expected improvement: Reduce p95 from 800ms to 300-400ms for cached endpoints (cache hit rate target: >80%)

Measurement:

  • Track cache hit rate (target >80%)
  • Monitor Redis memory usage and eviction rate
  • Compare endpoint latency with/without cache
  • Track database query reduction

Considerations:

  • Cache invalidation complexity (implement carefully to avoid stale data)
  • Redis failover strategy (what happens if Redis is down?)
  • Cold start performance (first request still slow)

Runner-Ups (For Future Consideration)

Response Compression (gzip) (Score: 45)

  • Very quick win (1-2 days to implement)
  • Modest impact for large payloads (~20-30% response size reduction → ~100ms latency improvement)
  • Recommendation: Implement in parallel with top 3 (low effort, no downside)

Database Connection Pooling (Score: 44)

  • Quick to implement if not already in place
  • Reduces connection overhead
  • Recommendation: Verify current pooling configuration first - may already be optimized

Pagination (Score: 45)

  • Essential for endpoints returning large result sets
  • Quick to implement (2-3 days)
  • Recommendation: Implement in parallel - protect against future growth

Database Read Replicas (Score: 38)

  • Good for read-heavy workload scaling
  • Higher cost (~$500-800/month)
  • Recommendation: Defer to Q2 after quick wins exhausted - consider if traffic grows 2-3x

Next Steps

Immediate Actions (Week 1-2)

Priority 1: N+1 Query Optimization

  • Enable APM detailed query tracing
  • Profile top 10 slowest endpoints
  • Create backlog of N+1 fixes prioritized by impact
  • Assign to Engineer A

Priority 2: Database Index Analysis

  • Export slow query log (queries >500ms)
  • Run EXPLAIN ANALYZE on top 20 slow queries
  • Identify missing indexes
  • Assign to Engineer B

Priority 3: Redis Caching Planning

  • Analyze request patterns to identify cacheable data
  • Design cache key strategy
  • Document cache invalidation approach
  • Get budget approval for Redis ($300/month)
  • Assign to Engineer C

Quick Win (parallel):

  • Implement gzip compression (Engineer A, 4 hours)
  • Verify connection pooling config (Engineer B, 2 hours)
  • Add pagination to /users and /orders endpoints (Engineer C, 1 day)

Timeline

Week 1-2: Analysis + quick wins

  • N+1 profiling complete
  • Index analysis complete
  • Redis architecture designed
  • Gzip compression live
  • Pagination live for 2 endpoints

Week 3-4: N+1 fixes + Indexes

  • Top 5 N+1 queries fixed and deployed
  • 10-15 database indexes added
  • Target: p95 drops to 600ms

Week 5-7: Redis caching

  • Redis infrastructure provisioned
  • Top 3 endpoints cached
  • Cache invalidation tested
  • Target: p95 drops to 350ms for cached endpoints

Week 8-9: Measure, iterate, polish

  • Monitor metrics
  • Fix any regressions
  • Extend caching to 5 more endpoints
  • Target: Overall p95 <300ms

Week 10-12: Buffer for unknowns

  • Address unexpected issues
  • Optimize further if needed
  • Document learnings

Success Criteria

Primary:

  • p95 latency <300ms (currently 800ms)
  • p99 latency <600ms (currently 1.5s)
  • No increase in error rate
  • 99.9% uptime maintained

Secondary:

  • Database query count reduced by >40%
  • Cache hit rate >80% for cached endpoints
  • Additional infrastructure cost <$500/month

Monitoring:

  • Daily p95/p99 latency dashboard
  • Weekly review of slow query log
  • Redis cache hit rate tracking
  • Database connection pool utilization

Risks & Mitigation

Risk 1: N+1 fixes increase memory usage

  • Mitigation: Profile memory before/after, implement pagination if needed
  • Rollback: Revert to lazy loading if memory spikes >20%

Risk 2: Cache invalidation bugs cause stale data

  • Mitigation: Start with short TTL (5 min), add event-based invalidation gradually
  • Rollback: Disable caching for affected endpoints immediately

Risk 3: Index additions cause write performance degradation

  • Mitigation: Test on staging with production-like load, monitor write latency
  • Rollback: Drop problematic indexes

Risk 4: Timeline slips due to complexity

  • Mitigation: Front-load quick wins (gzip, indexes) to show early progress
  • Contingency: Descope Redis to Q2 if needed, focus on N+1 and indexes

Rubric Self-Assessment

Using rubric_brainstorm_diverge_converge.json:

Scores:

  1. Divergence Quantity: 5/5 (40 ideas - comprehensive exploration)
  2. Divergence Variety: 4/5 (good variety from quick fixes to major architecture changes)
  3. Divergence Creativity: 4/5 (includes both practical and ambitious ideas)
  4. Cluster Quality: 5/5 (6 distinct, well-labeled themes)
  5. Cluster Coverage: 5/5 (6 clusters covering infrastructure, data, architecture)
  6. Evaluation Criteria Clarity: 5/5 (impact, effort, cost - specific and weighted)
  7. Scoring Rigor: 4/5 (systematic scoring with justification)
  8. Selection Quality: 5/5 (clear top 3 with tradeoff analysis)
  9. Actionability: 5/5 (detailed timeline, owners, success criteria)
  10. Process Integrity: 5/5 (clear phase separation, no premature filtering)

Average: 4.7/5 - Excellent (high-stakes technical decision quality)

Assessment: This brainstorm is ready for use in prioritizing engineering work. Strong divergence phase with 40 varied ideas, clear clustering by mechanism, and rigorous convergence with weighted scoring. Actionable plan with timeline and risk mitigation.