Files
gh-anton-abyzov-specweave-p…/agents/sre/modules/backend-diagnostics.md
2025-11-29 17:56:41 +08:00

482 lines
9.7 KiB
Markdown

# Backend/API Diagnostics
**Purpose**: Troubleshoot backend services, APIs, and application-level performance issues.
## Common Backend Issues
### 1. Slow API Response
**Symptoms**:
- API response time >1 second
- Users report slow loading
- Timeout errors
**Diagnosis**:
#### Check Application Logs
```bash
# Check for slow requests
grep "duration" /var/log/application.log | awk '{if ($5 > 1000) print}'
# Check error rate
grep "ERROR" /var/log/application.log | wc -l
# Check recent errors
tail -f /var/log/application.log | grep "ERROR"
```
**Red flags**:
- Repeated errors for same endpoint
- Increasing response times
- Timeout errors
---
#### Check Application Metrics
```bash
# CPU usage
top -bn1 | grep "node\|java\|python"
# Memory usage
ps aux | grep "node\|java\|python" | awk '{print $4, $11}'
# Thread count
ps -eLf | grep "node\|java\|python" | wc -l
# Open file descriptors
lsof -p <PID> | wc -l
```
**Red flags**:
- CPU >80%
- Memory increasing over time
- Thread count increasing (thread leak)
- File descriptors increasing (connection leak)
---
#### Check Database Query Time
```bash
# If slow, likely database issue
# See database-diagnostics.md
# Check if query time matches API response time
# API response time = Query time + Application processing
```
---
#### Check External API Calls
```bash
# Check if calling external APIs
grep "http.request" /var/log/application.log
# Check external API response time
# Use APM tools or custom instrumentation
```
**Red flags**:
- External API taking >500ms
- External API rate limiting (429 errors)
- External API errors (5xx errors)
**Mitigation**:
- Cache external API responses
- Add timeout (don't wait >5s)
- Circuit breaker pattern
- Fallback data
---
### 2. 5xx Errors (500, 502, 503, 504)
**Symptoms**:
- Users getting error messages
- Monitoring alerts for error rate
- Some/all requests failing
**Diagnosis by Error Code**:
#### 500 Internal Server Error
**Cause**: Application code error
**Diagnosis**:
```bash
# Check application logs for exceptions
grep "Exception\|Error" /var/log/application.log | tail -20
# Check stack traces
tail -100 /var/log/application.log
```
**Common causes**:
- NullPointerException / TypeError
- Unhandled promise rejection
- Database connection error
- Missing environment variable
**Mitigation**:
- Fix bug in code
- Add error handling
- Add input validation
- Add monitoring for this error
---
#### 502 Bad Gateway
**Cause**: Reverse proxy can't reach backend
**Diagnosis**:
```bash
# Check if application is running
ps aux | grep "node\|java\|python"
# Check application port
netstat -tlnp | grep <PORT>
# Check reverse proxy logs (nginx, apache)
tail -f /var/log/nginx/error.log
```
**Common causes**:
- Application crashed
- Application not listening on expected port
- Firewall blocking connection
- Reverse proxy misconfigured
**Mitigation**:
- Restart application
- Check application logs for crash reason
- Verify port configuration
- Check reverse proxy config
---
#### 503 Service Unavailable
**Cause**: Application overloaded or unhealthy
**Diagnosis**:
```bash
# Check application health
curl http://localhost:<PORT>/health
# Check connection pool
# Database connections, HTTP connections
# Check queue depth
# Message queues, task queues
```
**Common causes**:
- Too many concurrent requests
- Database connection pool exhausted
- Dependency service down
- Health check failing
**Mitigation**:
- Scale horizontally (add more instances)
- Increase connection pool size
- Rate limiting
- Circuit breaker for dependencies
---
#### 504 Gateway Timeout
**Cause**: Application took too long to respond
**Diagnosis**:
```bash
# Check what's slow
# Database query? External API? Long computation?
# Check application logs for slow operations
grep "slow\|timeout" /var/log/application.log
```
**Common causes**:
- Slow database query
- Slow external API call
- Long-running computation
- Deadlock
**Mitigation**:
- Optimize slow operation
- Add timeout to prevent indefinite wait
- Async processing (return 202 Accepted)
- Increase timeout (last resort)
---
### 3. Memory Leak (Backend)
**Symptoms**:
- Memory usage increasing over time
- Application crashes with OutOfMemoryError
- Performance degrades over time
**Diagnosis**:
#### Monitor Memory Over Time
```bash
# Linux
watch -n 5 'ps aux | grep <PROCESS> | awk "{print \$4, \$5, \$6}"'
# Get heap dump (Java)
jmap -dump:format=b,file=heap.bin <PID>
# Get heap snapshot (Node.js)
node --inspect index.js
# Chrome DevTools → Memory → Take heap snapshot
```
**Red flags**:
- Memory increasing linearly
- Memory not released after GC
- Large arrays/objects in heap dump
---
#### Common Causes
```javascript
// 1. Event listeners not removed
emitter.on('event', handler); // Never removed
// 2. Timers not cleared
setInterval(() => { /* ... */ }, 1000); // Never cleared
// 3. Global variables growing
global.cache = {}; // Grows forever
// 4. Closures holding references
function createHandler() {
const largeData = new Array(1000000);
return () => {
// Closure keeps largeData in memory
};
}
// 5. Connection leaks
const conn = await db.connect();
// Never closed → connection pool exhausted
```
**Mitigation**:
```javascript
// 1. Remove event listeners
const handler = () => { /* ... */ };
emitter.on('event', handler);
// Later:
emitter.off('event', handler);
// 2. Clear timers
const intervalId = setInterval(() => { /* ... */ }, 1000);
// Later:
clearInterval(intervalId);
// 3. Use LRU cache
const LRU = require('lru-cache');
const cache = new LRU({ max: 1000 });
// 4. Be careful with closures
function createHandler() {
return () => {
const largeData = loadData(); // Load when needed
};
}
// 5. Always close connections
const conn = await db.connect();
try {
await conn.query(/* ... */);
} finally {
await conn.close();
}
```
---
### 4. High CPU Usage
**Symptoms**:
- CPU at 100%
- Slow response times
- Server becomes unresponsive
**Diagnosis**:
#### Identify CPU-heavy Process
```bash
# Top CPU processes
top -bn1 | head -20
# CPU per thread (Java)
top -H -p <PID>
# Profile application (Node.js)
node --prof index.js
node --prof-process isolate-*.log
```
**Common causes**:
- Infinite loop
- Heavy computation (parsing, encryption)
- Regular expression catastrophic backtracking
- Large JSON parsing
**Mitigation**:
```javascript
// 1. Break up heavy computation
async function processLargeArray(items) {
for (let i = 0; i < items.length; i++) {
await processItem(items[i]);
// Yield to event loop
if (i % 100 === 0) {
await new Promise(resolve => setImmediate(resolve));
}
}
}
// 2. Use worker threads (Node.js)
const { Worker } = require('worker_threads');
const worker = new Worker('./heavy-computation.js');
// 3. Cache results
const cache = new Map();
function expensiveOperation(input) {
if (cache.has(input)) return cache.get(input);
const result = /* heavy computation */;
cache.set(input, result);
return result;
}
// 4. Fix regex
// Bad: /(.+)*/ (catastrophic backtracking)
// Good: /(.+?)/ (non-greedy)
```
---
### 5. Connection Pool Exhausted
**Symptoms**:
- "Connection pool exhausted" errors
- "Too many connections" errors
- Requests timing out
**Diagnosis**:
#### Check Connection Pool
```bash
# Database connections
# PostgreSQL:
SELECT count(*) FROM pg_stat_activity;
# MySQL:
SHOW PROCESSLIST;
# Application connection pool
# Check application metrics/logs
```
**Red flags**:
- Connections = max pool size
- Idle connections in transaction
- Long-running queries holding connections
**Common causes**:
- Connections not released (missing .close())
- Connection leak in error path
- Pool size too small
- Long-running queries
**Mitigation**:
```javascript
// 1. Always close connections
async function queryDatabase() {
const conn = await pool.connect();
try {
const result = await conn.query('SELECT * FROM users');
return result;
} finally {
conn.release(); // CRITICAL
}
}
// 2. Use connection pool wrapper
const pool = new Pool({
max: 20, // max connections
idleTimeoutMillis: 30000,
connectionTimeoutMillis: 2000,
});
// 3. Monitor pool metrics
pool.on('error', (err) => {
console.error('Pool error:', err);
});
// 4. Increase pool size (if needed)
// But investigate leaks first!
```
---
## Backend Performance Metrics
**Response Time**:
- p50: <100ms
- p95: <500ms
- p99: <1s
**Throughput**:
- Requests per second (RPS)
- Requests per minute (RPM)
**Error Rate**:
- Target: <0.1%
- 4xx errors: Client errors (validation)
- 5xx errors: Server errors (bugs, downtime)
**Resource Usage**:
- CPU: <70% average
- Memory: <80% of limit
- Connections: <80% of pool size
**Availability**:
- Target: 99.9% (8.76 hours downtime/year)
- 99.99%: 52.6 minutes downtime/year
- 99.999%: 5.26 minutes downtime/year
---
## Backend Diagnostic Checklist
**When diagnosing slow backend**:
- [ ] Check application logs for errors
- [ ] Check CPU usage (target: <70%)
- [ ] Check memory usage (target: <80%)
- [ ] Check database query time (see database-diagnostics.md)
- [ ] Check external API calls (timeout, errors)
- [ ] Check connection pool (target: <80% used)
- [ ] Check error rate (target: <0.1%)
- [ ] Check response time percentiles (p95, p99)
- [ ] Check for thread leaks (increasing thread count)
- [ ] Check for memory leaks (increasing memory over time)
**Tools**:
- Application logs
- APM tools (New Relic, DataDog, AppDynamics)
- `top`, `htop`, `ps`, `lsof`
- `curl` with timing
- Profilers (node --prof, jstack, py-spy)
---
## Related Documentation
- [SKILL.md](../SKILL.md) - Main SRE agent
- [database-diagnostics.md](database-diagnostics.md) - Database troubleshooting
- [infrastructure.md](infrastructure.md) - Server/network troubleshooting
- [monitoring.md](monitoring.md) - Observability tools