gh-anton-abyzov-specweave-p…/agents/sre/modules/backend-diagnostics.md

# Backend/API Diagnostics

**Purpose**: Troubleshoot backend services, APIs, and application-level performance issues.

## Common Backend Issues

### 1. Slow API Response

**Symptoms**:
- API response time >1 second
- Users report slow loading
- Timeout errors

**Diagnosis**:

#### Check Application Logs
```bash
# Check for slow requests
grep "duration" /var/log/application.log | awk '{if ($5 > 1000) print}'

# Check error rate
grep "ERROR" /var/log/application.log | wc -l

# Check recent errors
tail -f /var/log/application.log | grep "ERROR"
```

**Red flags**:
- Repeated errors for same endpoint
- Increasing response times
- Timeout errors

---

#### Check Application Metrics
```bash
# CPU usage
top -bn1 | grep "node\|java\|python"

# Memory usage
ps aux | grep "node\|java\|python" | awk '{print $4, $11}'

# Thread count
ps -eLf | grep "node\|java\|python" | wc -l

# Open file descriptors
lsof -p <PID> | wc -l
```

**Red flags**:
- CPU >80%
- Memory increasing over time
- Thread count increasing (thread leak)
- File descriptors increasing (connection leak)

---

#### Check Database Query Time
```bash
# If slow, likely database issue
# See database-diagnostics.md

# Check if query time matches API response time
# API response time = Query time + Application processing
```

---

#### Check External API Calls
```bash
# Check if calling external APIs
grep "http.request" /var/log/application.log

# Check external API response time
# Use APM tools or custom instrumentation
```

**Red flags**:
- External API taking >500ms
- External API rate limiting (429 errors)
- External API errors (5xx errors)

**Mitigation**:
- Cache external API responses
- Add timeout (don't wait >5s)
- Circuit breaker pattern
- Fallback data

---

### 2. 5xx Errors (500, 502, 503, 504)

**Symptoms**:
- Users getting error messages
- Monitoring alerts for error rate
- Some/all requests failing

**Diagnosis by Error Code**:

#### 500 Internal Server Error
**Cause**: Application code error

**Diagnosis**:
```bash
# Check application logs for exceptions
grep "Exception\|Error" /var/log/application.log | tail -20

# Check stack traces
tail -100 /var/log/application.log
```

**Common causes**:
- NullPointerException / TypeError
- Unhandled promise rejection
- Database connection error
- Missing environment variable

**Mitigation**:
- Fix bug in code
- Add error handling
- Add input validation
- Add monitoring for this error

---

#### 502 Bad Gateway
**Cause**: Reverse proxy can't reach backend

**Diagnosis**:
```bash
# Check if application is running
ps aux | grep "node\|java\|python"

# Check application port
netstat -tlnp | grep <PORT>

# Check reverse proxy logs (nginx, apache)
tail -f /var/log/nginx/error.log
```

**Common causes**:
- Application crashed
- Application not listening on expected port
- Firewall blocking connection
- Reverse proxy misconfigured

**Mitigation**:
- Restart application
- Check application logs for crash reason
- Verify port configuration
- Check reverse proxy config

---

#### 503 Service Unavailable
**Cause**: Application overloaded or unhealthy

**Diagnosis**:
```bash
# Check application health
curl http://localhost:<PORT>/health

# Check connection pool
# Database connections, HTTP connections

# Check queue depth
# Message queues, task queues
```

**Common causes**:
- Too many concurrent requests
- Database connection pool exhausted
- Dependency service down
- Health check failing

**Mitigation**:
- Scale horizontally (add more instances)
- Increase connection pool size
- Rate limiting
- Circuit breaker for dependencies

---

#### 504 Gateway Timeout
**Cause**: Application took too long to respond

**Diagnosis**:
```bash
# Check what's slow
# Database query? External API? Long computation?

# Check application logs for slow operations
grep "slow\|timeout" /var/log/application.log
```

**Common causes**:
- Slow database query
- Slow external API call
- Long-running computation
- Deadlock

**Mitigation**:
- Optimize slow operation
- Add timeout to prevent indefinite wait
- Async processing (return 202 Accepted)
- Increase timeout (last resort)

---

### 3. Memory Leak (Backend)

**Symptoms**:
- Memory usage increasing over time
- Application crashes with OutOfMemoryError
- Performance degrades over time

**Diagnosis**:

#### Monitor Memory Over Time
```bash
# Linux
watch -n 5 'ps aux | grep <PROCESS> | awk "{print \$4, \$5, \$6}"'

# Get heap dump (Java)
jmap -dump:format=b,file=heap.bin <PID>

# Get heap snapshot (Node.js)
node --inspect index.js
# Chrome DevTools → Memory → Take heap snapshot
```

**Red flags**:
- Memory increasing linearly
- Memory not released after GC
- Large arrays/objects in heap dump

---

#### Common Causes
```javascript
// 1. Event listeners not removed
emitter.on('event', handler); // Never removed

// 2. Timers not cleared
setInterval(() => { /* ... */ }, 1000); // Never cleared

// 3. Global variables growing
global.cache = {}; // Grows forever

// 4. Closures holding references
function createHandler() {
  const largeData = new Array(1000000);
  return () => {
    // Closure keeps largeData in memory
  };
}

// 5. Connection leaks
const conn = await db.connect();
// Never closed → connection pool exhausted
```

**Mitigation**:
```javascript
// 1. Remove event listeners
const handler = () => { /* ... */ };
emitter.on('event', handler);
// Later:
emitter.off('event', handler);

// 2. Clear timers
const intervalId = setInterval(() => { /* ... */ }, 1000);
// Later:
clearInterval(intervalId);

// 3. Use LRU cache
const LRU = require('lru-cache');
const cache = new LRU({ max: 1000 });

// 4. Be careful with closures
function createHandler() {
  return () => {
    const largeData = loadData(); // Load when needed
  };
}

// 5. Always close connections
const conn = await db.connect();
try {
  await conn.query(/* ... */);
} finally {
  await conn.close();
}
```

---

### 4. High CPU Usage

**Symptoms**:
- CPU at 100%
- Slow response times
- Server becomes unresponsive

**Diagnosis**:

#### Identify CPU-heavy Process
```bash
# Top CPU processes
top -bn1 | head -20

# CPU per thread (Java)
top -H -p <PID>

# Profile application (Node.js)
node --prof index.js
node --prof-process isolate-*.log
```

**Common causes**:
- Infinite loop
- Heavy computation (parsing, encryption)
- Regular expression catastrophic backtracking
- Large JSON parsing

**Mitigation**:
```javascript
// 1. Break up heavy computation
async function processLargeArray(items) {
  for (let i = 0; i < items.length; i++) {
    await processItem(items[i]);

    // Yield to event loop
    if (i % 100 === 0) {
      await new Promise(resolve => setImmediate(resolve));
    }
  }
}

// 2. Use worker threads (Node.js)
const { Worker } = require('worker_threads');
const worker = new Worker('./heavy-computation.js');

// 3. Cache results
const cache = new Map();
function expensiveOperation(input) {
  if (cache.has(input)) return cache.get(input);
  const result = /* heavy computation */;
  cache.set(input, result);
  return result;
}

// 4. Fix regex
// Bad: /(.+)*/ (catastrophic backtracking)
// Good: /(.+?)/ (non-greedy)
```

---

### 5. Connection Pool Exhausted

**Symptoms**:
- "Connection pool exhausted" errors
- "Too many connections" errors
- Requests timing out

**Diagnosis**:

#### Check Connection Pool
```bash
# Database connections
# PostgreSQL:
SELECT count(*) FROM pg_stat_activity;

# MySQL:
SHOW PROCESSLIST;

# Application connection pool
# Check application metrics/logs
```

**Red flags**:
- Connections = max pool size
- Idle connections in transaction
- Long-running queries holding connections

**Common causes**:
- Connections not released (missing .close())
- Connection leak in error path
- Pool size too small
- Long-running queries

**Mitigation**:
```javascript
// 1. Always close connections
async function queryDatabase() {
  const conn = await pool.connect();
  try {
    const result = await conn.query('SELECT * FROM users');
    return result;
  } finally {
    conn.release(); // CRITICAL
  }
}

// 2. Use connection pool wrapper
const pool = new Pool({
  max: 20, // max connections
  idleTimeoutMillis: 30000,
  connectionTimeoutMillis: 2000,
});

// 3. Monitor pool metrics
pool.on('error', (err) => {
  console.error('Pool error:', err);
});

// 4. Increase pool size (if needed)
// But investigate leaks first!
```

---

## Backend Performance Metrics

**Response Time**:
- p50: <100ms
- p95: <500ms
- p99: <1s

**Throughput**:
- Requests per second (RPS)
- Requests per minute (RPM)

**Error Rate**:
- Target: <0.1%
- 4xx errors: Client errors (validation)
- 5xx errors: Server errors (bugs, downtime)

**Resource Usage**:
- CPU: <70% average
- Memory: <80% of limit
- Connections: <80% of pool size

**Availability**:
- Target: 99.9% (8.76 hours downtime/year)
- 99.99%: 52.6 minutes downtime/year
- 99.999%: 5.26 minutes downtime/year

---

## Backend Diagnostic Checklist

**When diagnosing slow backend**:

- [ ] Check application logs for errors
- [ ] Check CPU usage (target: <70%)
- [ ] Check memory usage (target: <80%)
- [ ] Check database query time (see database-diagnostics.md)
- [ ] Check external API calls (timeout, errors)
- [ ] Check connection pool (target: <80% used)
- [ ] Check error rate (target: <0.1%)
- [ ] Check response time percentiles (p95, p99)
- [ ] Check for thread leaks (increasing thread count)
- [ ] Check for memory leaks (increasing memory over time)

**Tools**:
- Application logs
- APM tools (New Relic, DataDog, AppDynamics)
- `top`, `htop`, `ps`, `lsof`
- `curl` with timing
- Profilers (node --prof, jstack, py-spy)

---

## Related Documentation

- [SKILL.md](../SKILL.md) - Main SRE agent
- [database-diagnostics.md](database-diagnostics.md) - Database troubleshooting
- [infrastructure.md](infrastructure.md) - Server/network troubleshooting
- [monitoring.md](monitoring.md) - Observability tools