zhongwei/gh-anton-abyzov-specweave-plugins-specweave-infrastructure

Fork 0

Files

Zhongwei Li 9427ed1eea Initial commit

2025-11-29 17:56:41 +08:00

9.7 KiB

Raw Blame History

Backend/API Diagnostics

Purpose: Troubleshoot backend services, APIs, and application-level performance issues.

Common Backend Issues

1. Slow API Response

Symptoms:

API response time >1 second
Users report slow loading
Timeout errors

Diagnosis:

Check Application Logs

# Check for slow requests
grep "duration" /var/log/application.log | awk '{if ($5 > 1000) print}'

# Check error rate
grep "ERROR" /var/log/application.log | wc -l

# Check recent errors
tail -f /var/log/application.log | grep "ERROR"

Red flags:

Repeated errors for same endpoint
Increasing response times
Timeout errors

Check Application Metrics

# CPU usage
top -bn1 | grep "node\|java\|python"

# Memory usage
ps aux | grep "node\|java\|python" | awk '{print $4, $11}'

# Thread count
ps -eLf | grep "node\|java\|python" | wc -l

# Open file descriptors
lsof -p <PID> | wc -l

Red flags:

CPU >80%
Memory increasing over time
Thread count increasing (thread leak)
File descriptors increasing (connection leak)

Check Database Query Time

# If slow, likely database issue
# See database-diagnostics.md

# Check if query time matches API response time
# API response time = Query time + Application processing

Check External API Calls

# Check if calling external APIs
grep "http.request" /var/log/application.log

# Check external API response time
# Use APM tools or custom instrumentation

Red flags:

External API taking >500ms
External API rate limiting (429 errors)
External API errors (5xx errors)

Mitigation:

Cache external API responses
Add timeout (don't wait >5s)
Circuit breaker pattern
Fallback data

2. 5xx Errors (500, 502, 503, 504)

Symptoms:

Users getting error messages
Monitoring alerts for error rate
Some/all requests failing

Diagnosis by Error Code:

500 Internal Server Error

Cause: Application code error

Diagnosis:

# Check application logs for exceptions
grep "Exception\|Error" /var/log/application.log | tail -20

# Check stack traces
tail -100 /var/log/application.log

Common causes:

NullPointerException / TypeError
Unhandled promise rejection
Database connection error
Missing environment variable

Mitigation:

Fix bug in code
Add error handling
Add input validation
Add monitoring for this error

502 Bad Gateway

Cause: Reverse proxy can't reach backend

Diagnosis:

# Check if application is running
ps aux | grep "node\|java\|python"

# Check application port
netstat -tlnp | grep <PORT>

# Check reverse proxy logs (nginx, apache)
tail -f /var/log/nginx/error.log

Common causes:

Application crashed
Application not listening on expected port
Firewall blocking connection
Reverse proxy misconfigured

Mitigation:

Restart application
Check application logs for crash reason
Verify port configuration
Check reverse proxy config

503 Service Unavailable

Cause: Application overloaded or unhealthy

Diagnosis:

# Check application health
curl http://localhost:<PORT>/health

# Check connection pool
# Database connections, HTTP connections

# Check queue depth
# Message queues, task queues

Common causes:

Too many concurrent requests
Database connection pool exhausted
Dependency service down
Health check failing

Mitigation:

Scale horizontally (add more instances)
Increase connection pool size
Rate limiting
Circuit breaker for dependencies

504 Gateway Timeout

Cause: Application took too long to respond

Diagnosis:

# Check what's slow
# Database query? External API? Long computation?

# Check application logs for slow operations
grep "slow\|timeout" /var/log/application.log

Common causes:

Slow database query
Slow external API call
Long-running computation
Deadlock

Mitigation:

Optimize slow operation
Add timeout to prevent indefinite wait
Async processing (return 202 Accepted)
Increase timeout (last resort)

3. Memory Leak (Backend)

Symptoms:

Memory usage increasing over time
Application crashes with OutOfMemoryError
Performance degrades over time

Diagnosis:

Monitor Memory Over Time

# Linux
watch -n 5 'ps aux | grep <PROCESS> | awk "{print \$4, \$5, \$6}"'

# Get heap dump (Java)
jmap -dump:format=b,file=heap.bin <PID>

# Get heap snapshot (Node.js)
node --inspect index.js
# Chrome DevTools → Memory → Take heap snapshot

Red flags:

Memory increasing linearly
Memory not released after GC
Large arrays/objects in heap dump

Common Causes

// 1. Event listeners not removed
emitter.on('event', handler); // Never removed

// 2. Timers not cleared
setInterval(() => { /* ... */ }, 1000); // Never cleared

// 3. Global variables growing
global.cache = {}; // Grows forever

// 4. Closures holding references
function createHandler() {
  const largeData = new Array(1000000);
  return () => {
    // Closure keeps largeData in memory
  };
}

// 5. Connection leaks
const conn = await db.connect();
// Never closed → connection pool exhausted

Mitigation:

// 1. Remove event listeners
const handler = () => { /* ... */ };
emitter.on('event', handler);
// Later:
emitter.off('event', handler);

// 2. Clear timers
const intervalId = setInterval(() => { /* ... */ }, 1000);
// Later:
clearInterval(intervalId);

// 3. Use LRU cache
const LRU = require('lru-cache');
const cache = new LRU({ max: 1000 });

// 4. Be careful with closures
function createHandler() {
  return () => {
    const largeData = loadData(); // Load when needed
  };
}

// 5. Always close connections
const conn = await db.connect();
try {
  await conn.query(/* ... */);
} finally {
  await conn.close();
}

4. High CPU Usage

Symptoms:

CPU at 100%
Slow response times
Server becomes unresponsive

Diagnosis:

Identify CPU-heavy Process

# Top CPU processes
top -bn1 | head -20

# CPU per thread (Java)
top -H -p <PID>

# Profile application (Node.js)
node --prof index.js
node --prof-process isolate-*.log

Common causes:

Infinite loop
Heavy computation (parsing, encryption)
Regular expression catastrophic backtracking
Large JSON parsing

Mitigation:

// 1. Break up heavy computation
async function processLargeArray(items) {
  for (let i = 0; i < items.length; i++) {
    await processItem(items[i]);

    // Yield to event loop
    if (i % 100 === 0) {
      await new Promise(resolve => setImmediate(resolve));
    }
  }
}

// 2. Use worker threads (Node.js)
const { Worker } = require('worker_threads');
const worker = new Worker('./heavy-computation.js');

// 3. Cache results
const cache = new Map();
function expensiveOperation(input) {
  if (cache.has(input)) return cache.get(input);
  const result = /* heavy computation */;
  cache.set(input, result);
  return result;
}

// 4. Fix regex
// Bad: /(.+)*/ (catastrophic backtracking)
// Good: /(.+?)/ (non-greedy)

5. Connection Pool Exhausted

Symptoms:

"Connection pool exhausted" errors
"Too many connections" errors
Requests timing out

Diagnosis:

Check Connection Pool

# Database connections
# PostgreSQL:
SELECT count(*) FROM pg_stat_activity;

# MySQL:
SHOW PROCESSLIST;

# Application connection pool
# Check application metrics/logs

Red flags:

Connections = max pool size
Idle connections in transaction
Long-running queries holding connections

Common causes:

Connections not released (missing .close())
Connection leak in error path
Pool size too small
Long-running queries

Mitigation:

// 1. Always close connections
async function queryDatabase() {
  const conn = await pool.connect();
  try {
    const result = await conn.query('SELECT * FROM users');
    return result;
  } finally {
    conn.release(); // CRITICAL
  }
}

// 2. Use connection pool wrapper
const pool = new Pool({
  max: 20, // max connections
  idleTimeoutMillis: 30000,
  connectionTimeoutMillis: 2000,
});

// 3. Monitor pool metrics
pool.on('error', (err) => {
  console.error('Pool error:', err);
});

// 4. Increase pool size (if needed)
// But investigate leaks first!

Backend Performance Metrics

Response Time:

p50: <100ms
p95: <500ms
p99: <1s

Throughput:

Requests per second (RPS)
Requests per minute (RPM)

Error Rate:

Target: <0.1%
4xx errors: Client errors (validation)
5xx errors: Server errors (bugs, downtime)

Resource Usage:

CPU: <70% average
Memory: <80% of limit
Connections: <80% of pool size

Availability:

Target: 99.9% (8.76 hours downtime/year)
99.99%: 52.6 minutes downtime/year
99.999%: 5.26 minutes downtime/year

Backend Diagnostic Checklist

When diagnosing slow backend:

Check application logs for errors
Check CPU usage (target: <70%)
Check memory usage (target: <80%)
Check database query time (see database-diagnostics.md)
Check external API calls (timeout, errors)
Check connection pool (target: <80% used)
Check error rate (target: <0.1%)
Check response time percentiles (p95, p99)
Check for thread leaks (increasing thread count)
Check for memory leaks (increasing memory over time)

Tools:

Application logs
APM tools (New Relic, DataDog, AppDynamics)
top, htop, ps, lsof
curl with timing
Profilers (node --prof, jstack, py-spy)

SKILL.md - Main SRE agent
database-diagnostics.md - Database troubleshooting
infrastructure.md - Server/network troubleshooting
monitoring.md - Observability tools

9.7 KiB Raw Blame History

Backend/API Diagnostics

Common Backend Issues

1. Slow API Response

Check Application Logs

Check Application Metrics

Check Database Query Time

Check External API Calls

2. 5xx Errors (500, 502, 503, 504)

500 Internal Server Error

502 Bad Gateway

503 Service Unavailable

504 Gateway Timeout

3. Memory Leak (Backend)

Monitor Memory Over Time

Common Causes

4. High CPU Usage

Identify CPU-heavy Process

5. Connection Pool Exhausted

Check Connection Pool

Backend Performance Metrics

Backend Diagnostic Checklist

Related Documentation

9.7 KiB

Raw Blame History