Initial commit
This commit is contained in:
481
agents/sre/modules/backend-diagnostics.md
Normal file
481
agents/sre/modules/backend-diagnostics.md
Normal file
@@ -0,0 +1,481 @@
|
||||
# Backend/API Diagnostics
|
||||
|
||||
**Purpose**: Troubleshoot backend services, APIs, and application-level performance issues.
|
||||
|
||||
## Common Backend Issues
|
||||
|
||||
### 1. Slow API Response
|
||||
|
||||
**Symptoms**:
|
||||
- API response time >1 second
|
||||
- Users report slow loading
|
||||
- Timeout errors
|
||||
|
||||
**Diagnosis**:
|
||||
|
||||
#### Check Application Logs
|
||||
```bash
|
||||
# Check for slow requests
|
||||
grep "duration" /var/log/application.log | awk '{if ($5 > 1000) print}'
|
||||
|
||||
# Check error rate
|
||||
grep "ERROR" /var/log/application.log | wc -l
|
||||
|
||||
# Check recent errors
|
||||
tail -f /var/log/application.log | grep "ERROR"
|
||||
```
|
||||
|
||||
**Red flags**:
|
||||
- Repeated errors for same endpoint
|
||||
- Increasing response times
|
||||
- Timeout errors
|
||||
|
||||
---
|
||||
|
||||
#### Check Application Metrics
|
||||
```bash
|
||||
# CPU usage
|
||||
top -bn1 | grep "node\|java\|python"
|
||||
|
||||
# Memory usage
|
||||
ps aux | grep "node\|java\|python" | awk '{print $4, $11}'
|
||||
|
||||
# Thread count
|
||||
ps -eLf | grep "node\|java\|python" | wc -l
|
||||
|
||||
# Open file descriptors
|
||||
lsof -p <PID> | wc -l
|
||||
```
|
||||
|
||||
**Red flags**:
|
||||
- CPU >80%
|
||||
- Memory increasing over time
|
||||
- Thread count increasing (thread leak)
|
||||
- File descriptors increasing (connection leak)
|
||||
|
||||
---
|
||||
|
||||
#### Check Database Query Time
|
||||
```bash
|
||||
# If slow, likely database issue
|
||||
# See database-diagnostics.md
|
||||
|
||||
# Check if query time matches API response time
|
||||
# API response time = Query time + Application processing
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
#### Check External API Calls
|
||||
```bash
|
||||
# Check if calling external APIs
|
||||
grep "http.request" /var/log/application.log
|
||||
|
||||
# Check external API response time
|
||||
# Use APM tools or custom instrumentation
|
||||
```
|
||||
|
||||
**Red flags**:
|
||||
- External API taking >500ms
|
||||
- External API rate limiting (429 errors)
|
||||
- External API errors (5xx errors)
|
||||
|
||||
**Mitigation**:
|
||||
- Cache external API responses
|
||||
- Add timeout (don't wait >5s)
|
||||
- Circuit breaker pattern
|
||||
- Fallback data
|
||||
|
||||
---
|
||||
|
||||
### 2. 5xx Errors (500, 502, 503, 504)
|
||||
|
||||
**Symptoms**:
|
||||
- Users getting error messages
|
||||
- Monitoring alerts for error rate
|
||||
- Some/all requests failing
|
||||
|
||||
**Diagnosis by Error Code**:
|
||||
|
||||
#### 500 Internal Server Error
|
||||
**Cause**: Application code error
|
||||
|
||||
**Diagnosis**:
|
||||
```bash
|
||||
# Check application logs for exceptions
|
||||
grep "Exception\|Error" /var/log/application.log | tail -20
|
||||
|
||||
# Check stack traces
|
||||
tail -100 /var/log/application.log
|
||||
```
|
||||
|
||||
**Common causes**:
|
||||
- NullPointerException / TypeError
|
||||
- Unhandled promise rejection
|
||||
- Database connection error
|
||||
- Missing environment variable
|
||||
|
||||
**Mitigation**:
|
||||
- Fix bug in code
|
||||
- Add error handling
|
||||
- Add input validation
|
||||
- Add monitoring for this error
|
||||
|
||||
---
|
||||
|
||||
#### 502 Bad Gateway
|
||||
**Cause**: Reverse proxy can't reach backend
|
||||
|
||||
**Diagnosis**:
|
||||
```bash
|
||||
# Check if application is running
|
||||
ps aux | grep "node\|java\|python"
|
||||
|
||||
# Check application port
|
||||
netstat -tlnp | grep <PORT>
|
||||
|
||||
# Check reverse proxy logs (nginx, apache)
|
||||
tail -f /var/log/nginx/error.log
|
||||
```
|
||||
|
||||
**Common causes**:
|
||||
- Application crashed
|
||||
- Application not listening on expected port
|
||||
- Firewall blocking connection
|
||||
- Reverse proxy misconfigured
|
||||
|
||||
**Mitigation**:
|
||||
- Restart application
|
||||
- Check application logs for crash reason
|
||||
- Verify port configuration
|
||||
- Check reverse proxy config
|
||||
|
||||
---
|
||||
|
||||
#### 503 Service Unavailable
|
||||
**Cause**: Application overloaded or unhealthy
|
||||
|
||||
**Diagnosis**:
|
||||
```bash
|
||||
# Check application health
|
||||
curl http://localhost:<PORT>/health
|
||||
|
||||
# Check connection pool
|
||||
# Database connections, HTTP connections
|
||||
|
||||
# Check queue depth
|
||||
# Message queues, task queues
|
||||
```
|
||||
|
||||
**Common causes**:
|
||||
- Too many concurrent requests
|
||||
- Database connection pool exhausted
|
||||
- Dependency service down
|
||||
- Health check failing
|
||||
|
||||
**Mitigation**:
|
||||
- Scale horizontally (add more instances)
|
||||
- Increase connection pool size
|
||||
- Rate limiting
|
||||
- Circuit breaker for dependencies
|
||||
|
||||
---
|
||||
|
||||
#### 504 Gateway Timeout
|
||||
**Cause**: Application took too long to respond
|
||||
|
||||
**Diagnosis**:
|
||||
```bash
|
||||
# Check what's slow
|
||||
# Database query? External API? Long computation?
|
||||
|
||||
# Check application logs for slow operations
|
||||
grep "slow\|timeout" /var/log/application.log
|
||||
```
|
||||
|
||||
**Common causes**:
|
||||
- Slow database query
|
||||
- Slow external API call
|
||||
- Long-running computation
|
||||
- Deadlock
|
||||
|
||||
**Mitigation**:
|
||||
- Optimize slow operation
|
||||
- Add timeout to prevent indefinite wait
|
||||
- Async processing (return 202 Accepted)
|
||||
- Increase timeout (last resort)
|
||||
|
||||
---
|
||||
|
||||
### 3. Memory Leak (Backend)
|
||||
|
||||
**Symptoms**:
|
||||
- Memory usage increasing over time
|
||||
- Application crashes with OutOfMemoryError
|
||||
- Performance degrades over time
|
||||
|
||||
**Diagnosis**:
|
||||
|
||||
#### Monitor Memory Over Time
|
||||
```bash
|
||||
# Linux
|
||||
watch -n 5 'ps aux | grep <PROCESS> | awk "{print \$4, \$5, \$6}"'
|
||||
|
||||
# Get heap dump (Java)
|
||||
jmap -dump:format=b,file=heap.bin <PID>
|
||||
|
||||
# Get heap snapshot (Node.js)
|
||||
node --inspect index.js
|
||||
# Chrome DevTools → Memory → Take heap snapshot
|
||||
```
|
||||
|
||||
**Red flags**:
|
||||
- Memory increasing linearly
|
||||
- Memory not released after GC
|
||||
- Large arrays/objects in heap dump
|
||||
|
||||
---
|
||||
|
||||
#### Common Causes
|
||||
```javascript
|
||||
// 1. Event listeners not removed
|
||||
emitter.on('event', handler); // Never removed
|
||||
|
||||
// 2. Timers not cleared
|
||||
setInterval(() => { /* ... */ }, 1000); // Never cleared
|
||||
|
||||
// 3. Global variables growing
|
||||
global.cache = {}; // Grows forever
|
||||
|
||||
// 4. Closures holding references
|
||||
function createHandler() {
|
||||
const largeData = new Array(1000000);
|
||||
return () => {
|
||||
// Closure keeps largeData in memory
|
||||
};
|
||||
}
|
||||
|
||||
// 5. Connection leaks
|
||||
const conn = await db.connect();
|
||||
// Never closed → connection pool exhausted
|
||||
```
|
||||
|
||||
**Mitigation**:
|
||||
```javascript
|
||||
// 1. Remove event listeners
|
||||
const handler = () => { /* ... */ };
|
||||
emitter.on('event', handler);
|
||||
// Later:
|
||||
emitter.off('event', handler);
|
||||
|
||||
// 2. Clear timers
|
||||
const intervalId = setInterval(() => { /* ... */ }, 1000);
|
||||
// Later:
|
||||
clearInterval(intervalId);
|
||||
|
||||
// 3. Use LRU cache
|
||||
const LRU = require('lru-cache');
|
||||
const cache = new LRU({ max: 1000 });
|
||||
|
||||
// 4. Be careful with closures
|
||||
function createHandler() {
|
||||
return () => {
|
||||
const largeData = loadData(); // Load when needed
|
||||
};
|
||||
}
|
||||
|
||||
// 5. Always close connections
|
||||
const conn = await db.connect();
|
||||
try {
|
||||
await conn.query(/* ... */);
|
||||
} finally {
|
||||
await conn.close();
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### 4. High CPU Usage
|
||||
|
||||
**Symptoms**:
|
||||
- CPU at 100%
|
||||
- Slow response times
|
||||
- Server becomes unresponsive
|
||||
|
||||
**Diagnosis**:
|
||||
|
||||
#### Identify CPU-heavy Process
|
||||
```bash
|
||||
# Top CPU processes
|
||||
top -bn1 | head -20
|
||||
|
||||
# CPU per thread (Java)
|
||||
top -H -p <PID>
|
||||
|
||||
# Profile application (Node.js)
|
||||
node --prof index.js
|
||||
node --prof-process isolate-*.log
|
||||
```
|
||||
|
||||
**Common causes**:
|
||||
- Infinite loop
|
||||
- Heavy computation (parsing, encryption)
|
||||
- Regular expression catastrophic backtracking
|
||||
- Large JSON parsing
|
||||
|
||||
**Mitigation**:
|
||||
```javascript
|
||||
// 1. Break up heavy computation
|
||||
async function processLargeArray(items) {
|
||||
for (let i = 0; i < items.length; i++) {
|
||||
await processItem(items[i]);
|
||||
|
||||
// Yield to event loop
|
||||
if (i % 100 === 0) {
|
||||
await new Promise(resolve => setImmediate(resolve));
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
// 2. Use worker threads (Node.js)
|
||||
const { Worker } = require('worker_threads');
|
||||
const worker = new Worker('./heavy-computation.js');
|
||||
|
||||
// 3. Cache results
|
||||
const cache = new Map();
|
||||
function expensiveOperation(input) {
|
||||
if (cache.has(input)) return cache.get(input);
|
||||
const result = /* heavy computation */;
|
||||
cache.set(input, result);
|
||||
return result;
|
||||
}
|
||||
|
||||
// 4. Fix regex
|
||||
// Bad: /(.+)*/ (catastrophic backtracking)
|
||||
// Good: /(.+?)/ (non-greedy)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### 5. Connection Pool Exhausted
|
||||
|
||||
**Symptoms**:
|
||||
- "Connection pool exhausted" errors
|
||||
- "Too many connections" errors
|
||||
- Requests timing out
|
||||
|
||||
**Diagnosis**:
|
||||
|
||||
#### Check Connection Pool
|
||||
```bash
|
||||
# Database connections
|
||||
# PostgreSQL:
|
||||
SELECT count(*) FROM pg_stat_activity;
|
||||
|
||||
# MySQL:
|
||||
SHOW PROCESSLIST;
|
||||
|
||||
# Application connection pool
|
||||
# Check application metrics/logs
|
||||
```
|
||||
|
||||
**Red flags**:
|
||||
- Connections = max pool size
|
||||
- Idle connections in transaction
|
||||
- Long-running queries holding connections
|
||||
|
||||
**Common causes**:
|
||||
- Connections not released (missing .close())
|
||||
- Connection leak in error path
|
||||
- Pool size too small
|
||||
- Long-running queries
|
||||
|
||||
**Mitigation**:
|
||||
```javascript
|
||||
// 1. Always close connections
|
||||
async function queryDatabase() {
|
||||
const conn = await pool.connect();
|
||||
try {
|
||||
const result = await conn.query('SELECT * FROM users');
|
||||
return result;
|
||||
} finally {
|
||||
conn.release(); // CRITICAL
|
||||
}
|
||||
}
|
||||
|
||||
// 2. Use connection pool wrapper
|
||||
const pool = new Pool({
|
||||
max: 20, // max connections
|
||||
idleTimeoutMillis: 30000,
|
||||
connectionTimeoutMillis: 2000,
|
||||
});
|
||||
|
||||
// 3. Monitor pool metrics
|
||||
pool.on('error', (err) => {
|
||||
console.error('Pool error:', err);
|
||||
});
|
||||
|
||||
// 4. Increase pool size (if needed)
|
||||
// But investigate leaks first!
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Backend Performance Metrics
|
||||
|
||||
**Response Time**:
|
||||
- p50: <100ms
|
||||
- p95: <500ms
|
||||
- p99: <1s
|
||||
|
||||
**Throughput**:
|
||||
- Requests per second (RPS)
|
||||
- Requests per minute (RPM)
|
||||
|
||||
**Error Rate**:
|
||||
- Target: <0.1%
|
||||
- 4xx errors: Client errors (validation)
|
||||
- 5xx errors: Server errors (bugs, downtime)
|
||||
|
||||
**Resource Usage**:
|
||||
- CPU: <70% average
|
||||
- Memory: <80% of limit
|
||||
- Connections: <80% of pool size
|
||||
|
||||
**Availability**:
|
||||
- Target: 99.9% (8.76 hours downtime/year)
|
||||
- 99.99%: 52.6 minutes downtime/year
|
||||
- 99.999%: 5.26 minutes downtime/year
|
||||
|
||||
---
|
||||
|
||||
## Backend Diagnostic Checklist
|
||||
|
||||
**When diagnosing slow backend**:
|
||||
|
||||
- [ ] Check application logs for errors
|
||||
- [ ] Check CPU usage (target: <70%)
|
||||
- [ ] Check memory usage (target: <80%)
|
||||
- [ ] Check database query time (see database-diagnostics.md)
|
||||
- [ ] Check external API calls (timeout, errors)
|
||||
- [ ] Check connection pool (target: <80% used)
|
||||
- [ ] Check error rate (target: <0.1%)
|
||||
- [ ] Check response time percentiles (p95, p99)
|
||||
- [ ] Check for thread leaks (increasing thread count)
|
||||
- [ ] Check for memory leaks (increasing memory over time)
|
||||
|
||||
**Tools**:
|
||||
- Application logs
|
||||
- APM tools (New Relic, DataDog, AppDynamics)
|
||||
- `top`, `htop`, `ps`, `lsof`
|
||||
- `curl` with timing
|
||||
- Profilers (node --prof, jstack, py-spy)
|
||||
|
||||
---
|
||||
|
||||
## Related Documentation
|
||||
|
||||
- [SKILL.md](../SKILL.md) - Main SRE agent
|
||||
- [database-diagnostics.md](database-diagnostics.md) - Database troubleshooting
|
||||
- [infrastructure.md](infrastructure.md) - Server/network troubleshooting
|
||||
- [monitoring.md](monitoring.md) - Observability tools
|
||||
509
agents/sre/modules/database-diagnostics.md
Normal file
509
agents/sre/modules/database-diagnostics.md
Normal file
@@ -0,0 +1,509 @@
|
||||
# Database Diagnostics
|
||||
|
||||
**Purpose**: Troubleshoot database performance, slow queries, deadlocks, and connection issues.
|
||||
|
||||
## Common Database Issues
|
||||
|
||||
### 1. Slow Query
|
||||
|
||||
**Symptoms**:
|
||||
- API response time high
|
||||
- Specific endpoint slow
|
||||
- Database CPU high
|
||||
|
||||
**Diagnosis**:
|
||||
|
||||
#### Enable Slow Query Log (PostgreSQL)
|
||||
```sql
|
||||
-- Set slow query threshold (1 second)
|
||||
ALTER SYSTEM SET log_min_duration_statement = 1000;
|
||||
SELECT pg_reload_conf();
|
||||
|
||||
-- Check slow query log
|
||||
-- /var/log/postgresql/postgresql.log
|
||||
```
|
||||
|
||||
#### Enable Slow Query Log (MySQL)
|
||||
```sql
|
||||
-- Enable slow query log
|
||||
SET GLOBAL slow_query_log = 'ON';
|
||||
SET GLOBAL long_query_time = 1;
|
||||
|
||||
-- Check slow query log
|
||||
-- /var/log/mysql/mysql-slow.log
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
#### Analyze Query with EXPLAIN
|
||||
```sql
|
||||
-- PostgreSQL
|
||||
EXPLAIN ANALYZE
|
||||
SELECT users.*, posts.*
|
||||
FROM users
|
||||
LEFT JOIN posts ON posts.user_id = users.id
|
||||
WHERE users.last_login_at > NOW() - INTERVAL '30 days';
|
||||
|
||||
-- Look for:
|
||||
-- - Seq Scan (sequential scan = BAD for large tables)
|
||||
-- - High cost numbers
|
||||
-- - High actual time
|
||||
```
|
||||
|
||||
**Red flags in EXPLAIN output**:
|
||||
- **Seq Scan** on large table (>10k rows) → Missing index
|
||||
- **Nested Loop** with large outer table → Missing index
|
||||
- **Hash Join** with large tables → Consider index
|
||||
- **Actual time** >> **Planned time** → Statistics outdated
|
||||
|
||||
**Example Bad Query**:
|
||||
```
|
||||
Seq Scan on users (cost=0.00..100000 rows=10000000)
|
||||
Filter: (last_login_at > '2025-09-26'::date)
|
||||
Rows Removed by Filter: 9900000
|
||||
```
|
||||
→ **Missing index on last_login_at**
|
||||
|
||||
---
|
||||
|
||||
#### Check Missing Indexes
|
||||
```sql
|
||||
-- PostgreSQL: Find missing indexes
|
||||
SELECT
|
||||
schemaname,
|
||||
tablename,
|
||||
seq_scan,
|
||||
seq_tup_read,
|
||||
idx_scan,
|
||||
seq_tup_read / seq_scan AS avg_seq_read
|
||||
FROM pg_stat_user_tables
|
||||
WHERE seq_scan > 0
|
||||
ORDER BY seq_tup_read DESC
|
||||
LIMIT 20;
|
||||
|
||||
-- Tables with high seq_scan and low idx_scan need indexes
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
#### Create Index
|
||||
```sql
|
||||
-- PostgreSQL (CONCURRENTLY = no table lock)
|
||||
CREATE INDEX CONCURRENTLY idx_users_last_login_at
|
||||
ON users(last_login_at);
|
||||
|
||||
-- Verify index is used
|
||||
EXPLAIN ANALYZE
|
||||
SELECT * FROM users WHERE last_login_at > NOW() - INTERVAL '30 days';
|
||||
-- Should show: Index Scan using idx_users_last_login_at
|
||||
```
|
||||
|
||||
**Impact**:
|
||||
- Before: 7.8 seconds (Seq Scan)
|
||||
- After: 50ms (Index Scan)
|
||||
|
||||
---
|
||||
|
||||
### 2. Database Deadlock
|
||||
|
||||
**Symptoms**:
|
||||
- "Deadlock detected" errors
|
||||
- Transactions timing out
|
||||
- API 500 errors
|
||||
|
||||
**Diagnosis**:
|
||||
|
||||
#### Check for Deadlocks (PostgreSQL)
|
||||
```sql
|
||||
-- Check currently locked queries
|
||||
SELECT
|
||||
blocked_locks.pid AS blocked_pid,
|
||||
blocked_activity.usename AS blocked_user,
|
||||
blocking_locks.pid AS blocking_pid,
|
||||
blocking_activity.usename AS blocking_user,
|
||||
blocked_activity.query AS blocked_statement,
|
||||
blocking_activity.query AS blocking_statement
|
||||
FROM pg_catalog.pg_locks blocked_locks
|
||||
JOIN pg_catalog.pg_stat_activity blocked_activity ON blocked_activity.pid = blocked_locks.pid
|
||||
JOIN pg_catalog.pg_locks blocking_locks
|
||||
ON blocking_locks.locktype = blocked_locks.locktype
|
||||
AND blocking_locks.database IS NOT DISTINCT FROM blocked_locks.database
|
||||
AND blocking_locks.relation IS NOT DISTINCT FROM blocked_locks.relation
|
||||
AND blocking_locks.pid != blocked_locks.pid
|
||||
JOIN pg_catalog.pg_stat_activity blocking_activity ON blocking_activity.pid = blocking_locks.pid
|
||||
WHERE NOT blocked_locks.granted;
|
||||
```
|
||||
|
||||
#### Check for Deadlocks (MySQL)
|
||||
```sql
|
||||
-- Show InnoDB status (includes deadlock info)
|
||||
SHOW ENGINE INNODB STATUS\G
|
||||
|
||||
-- Look for "LATEST DETECTED DEADLOCK" section
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
#### Common Deadlock Patterns
|
||||
```sql
|
||||
-- Pattern 1: Lock order mismatch
|
||||
-- Transaction 1:
|
||||
BEGIN;
|
||||
UPDATE accounts SET balance = balance - 100 WHERE id = 1;
|
||||
UPDATE accounts SET balance = balance + 100 WHERE id = 2;
|
||||
COMMIT;
|
||||
|
||||
-- Transaction 2 (runs concurrently):
|
||||
BEGIN;
|
||||
UPDATE accounts SET balance = balance - 50 WHERE id = 2; -- Locks id=2
|
||||
UPDATE accounts SET balance = balance + 50 WHERE id = 1; -- Waits for id=1 (deadlock!)
|
||||
COMMIT;
|
||||
```
|
||||
|
||||
**Fix**: Always lock in same order
|
||||
```sql
|
||||
-- Both transactions lock in order: id=1, then id=2
|
||||
BEGIN;
|
||||
UPDATE accounts SET balance = balance - 100 WHERE id = LEAST(1, 2);
|
||||
UPDATE accounts SET balance = balance + 100 WHERE id = GREATEST(1, 2);
|
||||
COMMIT;
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
#### Immediate Mitigation
|
||||
```sql
|
||||
-- PostgreSQL: Kill blocking query
|
||||
SELECT pg_terminate_backend(<blocking_pid>);
|
||||
|
||||
-- PostgreSQL: Kill idle transactions
|
||||
SELECT pg_terminate_backend(pid)
|
||||
FROM pg_stat_activity
|
||||
WHERE state = 'idle in transaction'
|
||||
AND state_change < NOW() - INTERVAL '5 minutes';
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### 3. Connection Pool Exhausted
|
||||
|
||||
**Symptoms**:
|
||||
- "Too many connections" errors
|
||||
- "Connection pool exhausted" errors
|
||||
- New connections timing out
|
||||
|
||||
**Diagnosis**:
|
||||
|
||||
#### Check Active Connections (PostgreSQL)
|
||||
```sql
|
||||
-- Count connections by state
|
||||
SELECT state, count(*)
|
||||
FROM pg_stat_activity
|
||||
GROUP BY state;
|
||||
|
||||
-- Show all connections
|
||||
SELECT pid, usename, application_name, state, query
|
||||
FROM pg_stat_activity
|
||||
WHERE state != 'idle';
|
||||
|
||||
-- Check max connections
|
||||
SHOW max_connections;
|
||||
```
|
||||
|
||||
#### Check Active Connections (MySQL)
|
||||
```sql
|
||||
-- Show all connections
|
||||
SHOW PROCESSLIST;
|
||||
|
||||
-- Count connections by state
|
||||
SELECT state, COUNT(*)
|
||||
FROM information_schema.processlist
|
||||
GROUP BY state;
|
||||
|
||||
-- Check max connections
|
||||
SHOW VARIABLES LIKE 'max_connections';
|
||||
```
|
||||
|
||||
**Red flags**:
|
||||
- Connections = max_connections
|
||||
- Many "idle in transaction" (connections held but not used)
|
||||
- Long-running queries holding connections
|
||||
|
||||
---
|
||||
|
||||
#### Immediate Mitigation
|
||||
```sql
|
||||
-- PostgreSQL: Kill idle connections
|
||||
SELECT pg_terminate_backend(pid)
|
||||
FROM pg_stat_activity
|
||||
WHERE state = 'idle'
|
||||
AND state_change < NOW() - INTERVAL '10 minutes';
|
||||
|
||||
-- Increase max_connections (temporary)
|
||||
ALTER SYSTEM SET max_connections = 200;
|
||||
SELECT pg_reload_conf();
|
||||
```
|
||||
|
||||
**Long-term Fix**:
|
||||
- Fix connection leaks in application code
|
||||
- Increase connection pool size (if needed)
|
||||
- Add connection timeout
|
||||
- Use connection pooler (PgBouncer, ProxySQL)
|
||||
|
||||
---
|
||||
|
||||
### 4. High Database CPU
|
||||
|
||||
**Symptoms**:
|
||||
- Database CPU >80%
|
||||
- All queries slow
|
||||
- Server overload
|
||||
|
||||
**Diagnosis**:
|
||||
|
||||
#### Find CPU-heavy Queries (PostgreSQL)
|
||||
```sql
|
||||
-- Top queries by total time
|
||||
SELECT
|
||||
query,
|
||||
calls,
|
||||
total_exec_time,
|
||||
mean_exec_time,
|
||||
max_exec_time
|
||||
FROM pg_stat_statements
|
||||
ORDER BY total_exec_time DESC
|
||||
LIMIT 10;
|
||||
|
||||
-- Requires: CREATE EXTENSION pg_stat_statements;
|
||||
```
|
||||
|
||||
#### Find CPU-heavy Queries (MySQL)
|
||||
```sql
|
||||
-- Enable performance schema
|
||||
SET GLOBAL performance_schema = ON;
|
||||
|
||||
-- Top queries by execution time
|
||||
SELECT
|
||||
DIGEST_TEXT,
|
||||
COUNT_STAR,
|
||||
SUM_TIMER_WAIT,
|
||||
AVG_TIMER_WAIT
|
||||
FROM performance_schema.events_statements_summary_by_digest
|
||||
ORDER BY SUM_TIMER_WAIT DESC
|
||||
LIMIT 10;
|
||||
```
|
||||
|
||||
**Common causes**:
|
||||
- Missing indexes (Seq Scan)
|
||||
- Complex queries (many JOINs)
|
||||
- Aggregations on large tables
|
||||
- Full table scans
|
||||
|
||||
**Mitigation**:
|
||||
- Add missing indexes
|
||||
- Optimize queries (reduce JOINs)
|
||||
- Add query caching
|
||||
- Scale database (read replicas)
|
||||
|
||||
---
|
||||
|
||||
### 5. Disk Full
|
||||
|
||||
**Symptoms**:
|
||||
- "No space left on device" errors
|
||||
- Database refuses writes
|
||||
- Application crashes
|
||||
|
||||
**Diagnosis**:
|
||||
|
||||
#### Check Disk Usage
|
||||
```bash
|
||||
# Linux
|
||||
df -h
|
||||
|
||||
# Database data directory
|
||||
du -sh /var/lib/postgresql/data/*
|
||||
du -sh /var/lib/mysql/*
|
||||
|
||||
# Find large tables
|
||||
# PostgreSQL:
|
||||
SELECT
|
||||
schemaname,
|
||||
tablename,
|
||||
pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename)) AS size
|
||||
FROM pg_tables
|
||||
ORDER BY pg_total_relation_size(schemaname||'.'||tablename) DESC
|
||||
LIMIT 20;
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
#### Immediate Mitigation
|
||||
```bash
|
||||
# 1. Clean up logs
|
||||
rm /var/log/postgresql/postgresql-*.log.1
|
||||
rm /var/log/mysql/mysql-slow.log.1
|
||||
|
||||
# 2. Vacuum database (PostgreSQL)
|
||||
VACUUM FULL;
|
||||
|
||||
# 3. Archive old data
|
||||
# Move old records to archive table or backup
|
||||
|
||||
# 4. Expand disk (cloud)
|
||||
# AWS: Modify EBS volume size
|
||||
# Azure: Expand managed disk
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### 6. Replication Lag
|
||||
|
||||
**Symptoms**:
|
||||
- Stale data on read replicas
|
||||
- Monitoring alerts for lag
|
||||
- Eventually consistent reads
|
||||
|
||||
**Diagnosis**:
|
||||
|
||||
#### Check Replication Lag (PostgreSQL)
|
||||
```sql
|
||||
-- On primary:
|
||||
SELECT * FROM pg_stat_replication;
|
||||
|
||||
-- On replica:
|
||||
SELECT
|
||||
now() - pg_last_xact_replay_timestamp() AS replication_lag;
|
||||
```
|
||||
|
||||
#### Check Replication Lag (MySQL)
|
||||
```sql
|
||||
-- On replica:
|
||||
SHOW SLAVE STATUS\G
|
||||
|
||||
-- Look for: Seconds_Behind_Master
|
||||
```
|
||||
|
||||
**Red flags**:
|
||||
- Lag >1 minute
|
||||
- Lag increasing over time
|
||||
|
||||
**Common causes**:
|
||||
- High write load on primary
|
||||
- Replica under-provisioned
|
||||
- Network latency
|
||||
- Long-running query blocking replay
|
||||
|
||||
**Mitigation**:
|
||||
- Scale up replica (more CPU, memory)
|
||||
- Optimize slow queries on primary
|
||||
- Increase network bandwidth
|
||||
- Add more replicas (distribute read load)
|
||||
|
||||
---
|
||||
|
||||
## Database Performance Metrics
|
||||
|
||||
**Query Performance**:
|
||||
- p50 query time: <10ms
|
||||
- p95 query time: <100ms
|
||||
- p99 query time: <500ms
|
||||
|
||||
**Resource Usage**:
|
||||
- CPU: <70% average
|
||||
- Memory: <80% of available
|
||||
- Disk I/O: <80% of throughput
|
||||
- Connections: <80% of max
|
||||
|
||||
**Availability**:
|
||||
- Uptime: 99.99% (52.6 min downtime/year)
|
||||
- Replication lag: <1 second
|
||||
|
||||
---
|
||||
|
||||
## Database Diagnostic Checklist
|
||||
|
||||
**When diagnosing slow database**:
|
||||
|
||||
- [ ] Check slow query log
|
||||
- [ ] Run EXPLAIN ANALYZE on slow queries
|
||||
- [ ] Check for missing indexes (seq_scan > idx_scan)
|
||||
- [ ] Check for deadlocks
|
||||
- [ ] Check connection count (target: <80% of max)
|
||||
- [ ] Check database CPU (target: <70%)
|
||||
- [ ] Check disk space (target: <80% used)
|
||||
- [ ] Check replication lag (target: <1s)
|
||||
- [ ] Check for long-running queries (>30s)
|
||||
- [ ] Check for idle transactions (>5 min)
|
||||
|
||||
**Tools**:
|
||||
- `EXPLAIN ANALYZE`
|
||||
- `pg_stat_statements` (PostgreSQL)
|
||||
- Performance Schema (MySQL)
|
||||
- `pg_stat_activity` (PostgreSQL)
|
||||
- `SHOW PROCESSLIST` (MySQL)
|
||||
- Database monitoring (CloudWatch, DataDog)
|
||||
|
||||
---
|
||||
|
||||
## Database Anti-Patterns
|
||||
|
||||
### 1. N+1 Query Problem
|
||||
```javascript
|
||||
// BAD: N+1 queries
|
||||
const users = await db.query('SELECT * FROM users');
|
||||
for (const user of users) {
|
||||
const posts = await db.query('SELECT * FROM posts WHERE user_id = ?', [user.id]);
|
||||
}
|
||||
// 1 query + N queries = N+1
|
||||
|
||||
// GOOD: Single query with JOIN
|
||||
const usersWithPosts = await db.query(`
|
||||
SELECT users.*, posts.*
|
||||
FROM users
|
||||
LEFT JOIN posts ON posts.user_id = users.id
|
||||
`);
|
||||
```
|
||||
|
||||
### 2. SELECT *
|
||||
```sql
|
||||
-- BAD: Fetches all columns (inefficient)
|
||||
SELECT * FROM users WHERE id = 1;
|
||||
|
||||
-- GOOD: Fetch only needed columns
|
||||
SELECT id, name, email FROM users WHERE id = 1;
|
||||
```
|
||||
|
||||
### 3. Missing Indexes
|
||||
```sql
|
||||
-- BAD: No index on frequently queried column
|
||||
SELECT * FROM users WHERE email = 'user@example.com';
|
||||
-- Seq Scan on users
|
||||
|
||||
-- GOOD: Add index
|
||||
CREATE INDEX idx_users_email ON users(email);
|
||||
-- Index Scan using idx_users_email
|
||||
```
|
||||
|
||||
### 4. Long Transactions
|
||||
```javascript
|
||||
// BAD: Long transaction holding locks
|
||||
BEGIN;
|
||||
const user = await db.query('SELECT * FROM users WHERE id = 1 FOR UPDATE');
|
||||
await sendEmail(user.email); // External API call (slow!)
|
||||
await db.query('UPDATE users SET last_email_sent = NOW() WHERE id = 1');
|
||||
COMMIT;
|
||||
|
||||
// GOOD: Keep transactions short
|
||||
const user = await db.query('SELECT * FROM users WHERE id = 1');
|
||||
await sendEmail(user.email); // Outside transaction
|
||||
await db.query('UPDATE users SET last_email_sent = NOW() WHERE id = 1');
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Related Documentation
|
||||
|
||||
- [SKILL.md](../SKILL.md) - Main SRE agent
|
||||
- [backend-diagnostics.md](backend-diagnostics.md) - Backend troubleshooting
|
||||
- [infrastructure.md](infrastructure.md) - Server/network troubleshooting
|
||||
561
agents/sre/modules/infrastructure.md
Normal file
561
agents/sre/modules/infrastructure.md
Normal file
@@ -0,0 +1,561 @@
|
||||
# Infrastructure Diagnostics
|
||||
|
||||
**Purpose**: Troubleshoot server, network, disk, and cloud infrastructure issues.
|
||||
|
||||
## Common Infrastructure Issues
|
||||
|
||||
### 1. High CPU Usage (Server)
|
||||
|
||||
**Symptoms**:
|
||||
- Server CPU at 100%
|
||||
- Applications slow
|
||||
- SSH lag
|
||||
|
||||
**Diagnosis**:
|
||||
|
||||
#### Check CPU Usage
|
||||
```bash
|
||||
# Overall CPU usage
|
||||
top -bn1 | grep "Cpu(s)"
|
||||
|
||||
# Top CPU processes
|
||||
top -bn1 | head -20
|
||||
|
||||
# CPU usage per core
|
||||
mpstat -P ALL 1 5
|
||||
|
||||
# Historical CPU (if sar installed)
|
||||
sar -u 1 10
|
||||
```
|
||||
|
||||
**Red flags**:
|
||||
- CPU at 100% for >5 minutes
|
||||
- Single process using >80% CPU
|
||||
- iowait >20% (disk bottleneck)
|
||||
- System CPU >30% (kernel overhead)
|
||||
|
||||
---
|
||||
|
||||
#### Identify CPU-heavy Process
|
||||
```bash
|
||||
# Top CPU process
|
||||
ps aux | sort -nrk 3,3 | head -10
|
||||
|
||||
# CPU per thread
|
||||
top -H
|
||||
|
||||
# Process tree
|
||||
pstree -p
|
||||
```
|
||||
|
||||
**Common causes**:
|
||||
- Application bug (infinite loop)
|
||||
- Heavy computation
|
||||
- Crypto mining malware
|
||||
- Backup/compression running
|
||||
|
||||
---
|
||||
|
||||
#### Immediate Mitigation
|
||||
```bash
|
||||
# 1. Limit process CPU (nice)
|
||||
renice +10 <PID> # Lower priority
|
||||
|
||||
# 2. Kill process (last resort)
|
||||
kill -TERM <PID> # Graceful
|
||||
kill -KILL <PID> # Force kill
|
||||
|
||||
# 3. Scale horizontally (add servers)
|
||||
# Cloud: Auto-scaling group
|
||||
|
||||
# 4. Scale vertically (bigger instance)
|
||||
# Cloud: Resize instance
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### 2. Out of Memory (OOM)
|
||||
|
||||
**Symptoms**:
|
||||
- "Out of memory" errors
|
||||
- OOM Killer triggered
|
||||
- Applications crash
|
||||
- Swap usage high
|
||||
|
||||
**Diagnosis**:
|
||||
|
||||
#### Check Memory Usage
|
||||
```bash
|
||||
# Current memory usage
|
||||
free -h
|
||||
|
||||
# Memory per process
|
||||
ps aux | sort -nrk 4,4 | head -10
|
||||
|
||||
# Check OOM killer logs
|
||||
dmesg | grep -i "out of memory\|oom"
|
||||
grep "Out of memory" /var/log/syslog
|
||||
|
||||
# Check swap usage
|
||||
swapon -s
|
||||
```
|
||||
|
||||
**Red flags**:
|
||||
- Available memory <10%
|
||||
- Swap usage >80%
|
||||
- OOM killer active
|
||||
- Single process using >50% memory
|
||||
|
||||
---
|
||||
|
||||
#### Immediate Mitigation
|
||||
```bash
|
||||
# 1. Free page cache (safe)
|
||||
sync && echo 3 > /proc/sys/vm/drop_caches
|
||||
|
||||
# 2. Kill memory-heavy process
|
||||
kill -9 <PID>
|
||||
|
||||
# 3. Increase swap (temporary)
|
||||
dd if=/dev/zero of=/swapfile bs=1M count=2048
|
||||
mkswap /swapfile
|
||||
swapon /swapfile
|
||||
|
||||
# 4. Scale up (more RAM)
|
||||
# Cloud: Resize instance
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### 3. Disk Full
|
||||
|
||||
**Symptoms**:
|
||||
- "No space left on device" errors
|
||||
- Applications can't write files
|
||||
- Database refuses writes
|
||||
- Logs not being written
|
||||
|
||||
**Diagnosis**:
|
||||
|
||||
#### Check Disk Usage
|
||||
```bash
|
||||
# Disk usage by partition
|
||||
df -h
|
||||
|
||||
# Disk usage by directory
|
||||
du -sh /*
|
||||
du -sh /var/*
|
||||
|
||||
# Find large files
|
||||
find / -type f -size +100M -exec ls -lh {} \;
|
||||
|
||||
# Find files using deleted space
|
||||
lsof | grep deleted
|
||||
```
|
||||
|
||||
**Red flags**:
|
||||
- Disk usage >90%
|
||||
- /var/log full (runaway logs)
|
||||
- /tmp full (temp files not cleaned)
|
||||
- Deleted files still holding space (process has handle)
|
||||
|
||||
---
|
||||
|
||||
#### Immediate Mitigation
|
||||
```bash
|
||||
# 1. Clean up logs
|
||||
find /var/log -name "*.log.*" -mtime +7 -delete
|
||||
journalctl --vacuum-time=7d
|
||||
|
||||
# 2. Clean up temp files
|
||||
rm -rf /tmp/*
|
||||
rm -rf /var/tmp/*
|
||||
|
||||
# 3. Find and remove deleted files holding space
|
||||
lsof | grep deleted | awk '{print $2}' | xargs kill -9
|
||||
|
||||
# 4. Compress logs
|
||||
gzip /var/log/*.log
|
||||
|
||||
# 5. Expand disk (cloud)
|
||||
# AWS: Modify EBS volume size
|
||||
# Azure: Expand managed disk
|
||||
# After expanding:
|
||||
resize2fs /dev/xvda1 # ext4
|
||||
xfs_growfs / # xfs
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### 4. Network Issues
|
||||
|
||||
**Symptoms**:
|
||||
- Slow network performance
|
||||
- Timeouts
|
||||
- Connection refused
|
||||
- High latency
|
||||
|
||||
**Diagnosis**:
|
||||
|
||||
#### Check Network Connectivity
|
||||
```bash
|
||||
# Ping test
|
||||
ping -c 5 google.com
|
||||
|
||||
# DNS resolution
|
||||
nslookup example.com
|
||||
dig example.com
|
||||
|
||||
# Traceroute
|
||||
traceroute example.com
|
||||
|
||||
# Check network interfaces
|
||||
ip addr show
|
||||
ifconfig
|
||||
|
||||
# Check routing table
|
||||
ip route show
|
||||
route -n
|
||||
```
|
||||
|
||||
**Red flags**:
|
||||
- Packet loss >1%
|
||||
- Latency >100ms (same region)
|
||||
- DNS resolution failures
|
||||
- Interface down
|
||||
|
||||
---
|
||||
|
||||
#### Check Network Bandwidth
|
||||
```bash
|
||||
# Current bandwidth usage
|
||||
iftop -i eth0
|
||||
|
||||
# Network stats
|
||||
netstat -i
|
||||
|
||||
# Historical bandwidth (if vnstat installed)
|
||||
vnstat -l
|
||||
|
||||
# Check for bandwidth limits (cloud)
|
||||
# AWS: Check CloudWatch NetworkIn/NetworkOut
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
#### Check Firewall Rules
|
||||
```bash
|
||||
# Check iptables rules
|
||||
iptables -L -n -v
|
||||
|
||||
# Check firewalld (RHEL/CentOS)
|
||||
firewall-cmd --list-all
|
||||
|
||||
# Check UFW (Ubuntu)
|
||||
ufw status verbose
|
||||
|
||||
# Check security groups (cloud)
|
||||
# AWS: EC2 → Security Groups
|
||||
# Azure: Network Security Groups
|
||||
```
|
||||
|
||||
**Common causes**:
|
||||
- Firewall blocking traffic
|
||||
- Security group misconfigured
|
||||
- MTU mismatch
|
||||
- Network congestion
|
||||
- DDoS attack
|
||||
|
||||
---
|
||||
|
||||
#### Immediate Mitigation
|
||||
```bash
|
||||
# 1. Check firewall allows traffic
|
||||
iptables -A INPUT -p tcp --dport 80 -j ACCEPT
|
||||
iptables -A INPUT -p tcp --dport 443 -j ACCEPT
|
||||
|
||||
# 2. Restart networking
|
||||
systemctl restart networking
|
||||
systemctl restart NetworkManager
|
||||
|
||||
# 3. Flush DNS cache
|
||||
systemd-resolve --flush-caches
|
||||
|
||||
# 4. Check cloud network ACLs
|
||||
# Ensure subnet has route to internet gateway
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### 5. High Disk I/O (Slow Disk)
|
||||
|
||||
**Symptoms**:
|
||||
- Applications slow
|
||||
- High iowait CPU
|
||||
- Disk latency high
|
||||
|
||||
**Diagnosis**:
|
||||
|
||||
#### Check Disk I/O
|
||||
```bash
|
||||
# Disk I/O stats
|
||||
iostat -x 1 5
|
||||
|
||||
# Look for:
|
||||
# - %util >80% (disk saturated)
|
||||
# - await >100ms (high latency)
|
||||
|
||||
# Top I/O processes
|
||||
iotop -o
|
||||
|
||||
# Historical I/O (if sar installed)
|
||||
sar -d 1 10
|
||||
```
|
||||
|
||||
**Red flags**:
|
||||
- %util at 100%
|
||||
- await >100ms
|
||||
- iowait CPU >20%
|
||||
- Queue size (avgqu-sz) >10
|
||||
|
||||
---
|
||||
|
||||
#### Common Causes
|
||||
```bash
|
||||
# 1. Database without indexes (Seq Scan)
|
||||
# See database-diagnostics.md
|
||||
|
||||
# 2. Log rotation running
|
||||
# Large logs being compressed
|
||||
|
||||
# 3. Backup running
|
||||
# Database dump, file backup
|
||||
|
||||
# 4. Disk issue (bad sectors)
|
||||
dmesg | grep -i "I/O error"
|
||||
smartctl -a /dev/sda # SMART status
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
#### Immediate Mitigation
|
||||
```bash
|
||||
# 1. Reduce I/O pressure
|
||||
# Stop non-critical processes (backup, log rotation)
|
||||
|
||||
# 2. Add read cache
|
||||
# Enable query caching (database)
|
||||
# Add Redis for application cache
|
||||
|
||||
# 3. Scale disk IOPS (cloud)
|
||||
# AWS: Change EBS volume type (gp2 → gp3 → io1)
|
||||
# Azure: Change disk tier
|
||||
|
||||
# 4. Move to SSD (if on HDD)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### 6. Service Down / Process Crashed
|
||||
|
||||
**Symptoms**:
|
||||
- Service not responding
|
||||
- Health check failures
|
||||
- 502 Bad Gateway
|
||||
|
||||
**Diagnosis**:
|
||||
|
||||
#### Check Service Status
|
||||
```bash
|
||||
# Systemd services
|
||||
systemctl status nginx
|
||||
systemctl status postgresql
|
||||
systemctl status application
|
||||
|
||||
# Check if process running
|
||||
ps aux | grep nginx
|
||||
pidof nginx
|
||||
|
||||
# Check service logs
|
||||
journalctl -u nginx -n 50
|
||||
tail -f /var/log/nginx/error.log
|
||||
```
|
||||
|
||||
**Red flags**:
|
||||
- Service: inactive (dead)
|
||||
- Process not found
|
||||
- Recent crash in logs
|
||||
|
||||
---
|
||||
|
||||
#### Check Why Service Crashed
|
||||
```bash
|
||||
# Check system logs
|
||||
dmesg | tail -50
|
||||
grep "error\|segfault\|killed" /var/log/syslog
|
||||
|
||||
# Check application logs
|
||||
tail -100 /var/log/application.log
|
||||
|
||||
# Check for OOM killer
|
||||
dmesg | grep -i "killed process"
|
||||
|
||||
# Check core dumps
|
||||
ls -l /var/crash/
|
||||
ls -l /tmp/core*
|
||||
```
|
||||
|
||||
**Common causes**:
|
||||
- Out of memory (OOM Killer)
|
||||
- Segmentation fault (code bug)
|
||||
- Unhandled exception
|
||||
- Dependency service down
|
||||
- Configuration error
|
||||
|
||||
---
|
||||
|
||||
#### Immediate Mitigation
|
||||
```bash
|
||||
# 1. Restart service
|
||||
systemctl restart nginx
|
||||
|
||||
# 2. Check if started successfully
|
||||
systemctl status nginx
|
||||
curl http://localhost
|
||||
|
||||
# 3. If startup fails, check config
|
||||
nginx -t # Test nginx config
|
||||
postgresql -D /var/lib/postgresql/data --config-test
|
||||
|
||||
# 4. Enable auto-restart (systemd)
|
||||
# Add to service file:
|
||||
[Service]
|
||||
Restart=always
|
||||
RestartSec=10
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### 7. Cloud Infrastructure Issues
|
||||
|
||||
#### AWS-Specific
|
||||
|
||||
**Instance Issues**:
|
||||
```bash
|
||||
# Check instance health
|
||||
aws ec2 describe-instance-status --instance-ids i-1234567890abcdef0
|
||||
|
||||
# Check system logs
|
||||
aws ec2 get-console-output --instance-id i-1234567890abcdef0
|
||||
|
||||
# Check CloudWatch metrics
|
||||
aws cloudwatch get-metric-statistics \
|
||||
--namespace AWS/EC2 \
|
||||
--metric-name CPUUtilization \
|
||||
--dimensions Name=InstanceId,Value=i-1234567890abcdef0
|
||||
```
|
||||
|
||||
**EBS Volume Issues**:
|
||||
```bash
|
||||
# Check volume status
|
||||
aws ec2 describe-volumes --volume-ids vol-1234567890abcdef0
|
||||
|
||||
# Increase IOPS (gp3)
|
||||
aws ec2 modify-volume \
|
||||
--volume-id vol-1234567890abcdef0 \
|
||||
--iops 3000
|
||||
|
||||
# Check volume metrics
|
||||
aws cloudwatch get-metric-statistics \
|
||||
--namespace AWS/EBS \
|
||||
--metric-name VolumeReadOps
|
||||
```
|
||||
|
||||
**Network Issues**:
|
||||
```bash
|
||||
# Check security groups
|
||||
aws ec2 describe-security-groups --group-ids sg-1234567890abcdef0
|
||||
|
||||
# Check network ACLs
|
||||
aws ec2 describe-network-acls --network-acl-ids acl-1234567890abcdef0
|
||||
|
||||
# Check route tables
|
||||
aws ec2 describe-route-tables --route-table-ids rtb-1234567890abcdef0
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
#### Azure-Specific
|
||||
|
||||
**VM Issues**:
|
||||
```bash
|
||||
# Check VM status
|
||||
az vm get-instance-view --name myVM --resource-group myRG
|
||||
|
||||
# Restart VM
|
||||
az vm restart --name myVM --resource-group myRG
|
||||
|
||||
# Resize VM
|
||||
az vm resize --name myVM --resource-group myRG --size Standard_D4s_v3
|
||||
```
|
||||
|
||||
**Disk Issues**:
|
||||
```bash
|
||||
# Check disk status
|
||||
az disk show --name myDisk --resource-group myRG
|
||||
|
||||
# Expand disk
|
||||
az disk update --name myDisk --resource-group myRG --size-gb 256
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Infrastructure Performance Metrics
|
||||
|
||||
**Server Health**:
|
||||
- CPU: <70% average, <90% peak
|
||||
- Memory: <80% usage
|
||||
- Disk: <80% usage, <80% IOPS
|
||||
- Network: <70% bandwidth
|
||||
|
||||
**Uptime**:
|
||||
- Target: 99.9% (8.76 hours downtime/year)
|
||||
- Monitoring: Check every 1 minute
|
||||
|
||||
**Response Time**:
|
||||
- Ping latency: <50ms (same region)
|
||||
- HTTP response: <200ms
|
||||
|
||||
---
|
||||
|
||||
## Infrastructure Diagnostic Checklist
|
||||
|
||||
**When diagnosing infrastructure issues**:
|
||||
|
||||
- [ ] Check CPU usage (target: <70%)
|
||||
- [ ] Check memory usage (target: <80%)
|
||||
- [ ] Check disk usage (target: <80%)
|
||||
- [ ] Check disk I/O (%util, await)
|
||||
- [ ] Check network connectivity (ping, traceroute)
|
||||
- [ ] Check firewall rules (iptables, security groups)
|
||||
- [ ] Check service status (systemd, ps)
|
||||
- [ ] Check system logs (dmesg, /var/log/syslog)
|
||||
- [ ] Check cloud metrics (CloudWatch, Azure Monitor)
|
||||
- [ ] Check for hardware issues (SMART, dmesg errors)
|
||||
|
||||
**Tools**:
|
||||
- `top`, `htop` - CPU, memory
|
||||
- `df`, `du` - Disk usage
|
||||
- `iostat` - Disk I/O
|
||||
- `iftop`, `netstat` - Network
|
||||
- `dmesg`, `journalctl` - System logs
|
||||
- Cloud dashboards (AWS, Azure, GCP)
|
||||
|
||||
---
|
||||
|
||||
## Related Documentation
|
||||
|
||||
- [SKILL.md](../SKILL.md) - Main SRE agent
|
||||
- [backend-diagnostics.md](backend-diagnostics.md) - Application-level troubleshooting
|
||||
- [database-diagnostics.md](database-diagnostics.md) - Database performance
|
||||
- [security-incidents.md](security-incidents.md) - Security response
|
||||
439
agents/sre/modules/monitoring.md
Normal file
439
agents/sre/modules/monitoring.md
Normal file
@@ -0,0 +1,439 @@
|
||||
# Monitoring & Observability
|
||||
|
||||
**Purpose**: Set up monitoring, alerting, and observability to detect incidents early.
|
||||
|
||||
## Observability Pillars
|
||||
|
||||
### 1. Metrics
|
||||
|
||||
**What to Monitor**:
|
||||
- **Application**: Response time, error rate, throughput
|
||||
- **Infrastructure**: CPU, memory, disk, network
|
||||
- **Database**: Query time, connections, deadlocks
|
||||
- **Business**: User signups, revenue, conversions
|
||||
|
||||
**Tools**:
|
||||
- Prometheus + Grafana
|
||||
- DataDog
|
||||
- New Relic
|
||||
- CloudWatch (AWS)
|
||||
- Azure Monitor
|
||||
|
||||
---
|
||||
|
||||
#### Key Metrics by Layer
|
||||
|
||||
**Application Metrics**:
|
||||
```
|
||||
http_requests_total # Total requests
|
||||
http_request_duration_seconds # Response time (histogram)
|
||||
http_requests_errors_total # Error count
|
||||
http_requests_in_flight # Concurrent requests
|
||||
```
|
||||
|
||||
**Infrastructure Metrics**:
|
||||
```
|
||||
node_cpu_seconds_total # CPU usage
|
||||
node_memory_usage_bytes # Memory usage
|
||||
node_disk_usage_bytes # Disk usage
|
||||
node_network_receive_bytes_total # Network in
|
||||
```
|
||||
|
||||
**Database Metrics**:
|
||||
```
|
||||
pg_stat_database_tup_returned # Rows returned
|
||||
pg_stat_database_tup_fetched # Rows fetched
|
||||
pg_stat_database_deadlocks # Deadlock count
|
||||
pg_stat_activity_connections # Active connections
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### 2. Logs
|
||||
|
||||
**What to Log**:
|
||||
- **Application logs**: Errors, warnings, info
|
||||
- **Access logs**: HTTP requests (nginx, apache)
|
||||
- **System logs**: Kernel, systemd, auth
|
||||
- **Audit logs**: Security events, data access
|
||||
|
||||
**Log Levels**:
|
||||
- **ERROR**: Application errors, exceptions
|
||||
- **WARN**: Potential issues (deprecated API, high latency)
|
||||
- **INFO**: Normal operations (user login, job completed)
|
||||
- **DEBUG**: Detailed troubleshooting (only in dev)
|
||||
|
||||
**Tools**:
|
||||
- ELK Stack (Elasticsearch, Logstash, Kibana)
|
||||
- Splunk
|
||||
- CloudWatch Logs
|
||||
- Azure Log Analytics
|
||||
|
||||
---
|
||||
|
||||
#### Structured Logging
|
||||
|
||||
**BAD** (unstructured):
|
||||
```javascript
|
||||
console.log("User logged in: " + userId);
|
||||
```
|
||||
|
||||
**GOOD** (structured JSON):
|
||||
```javascript
|
||||
logger.info("User logged in", {
|
||||
userId: 123,
|
||||
ip: "192.168.1.1",
|
||||
timestamp: "2025-10-26T12:00:00Z",
|
||||
userAgent: "Mozilla/5.0...",
|
||||
});
|
||||
|
||||
// Output:
|
||||
// {"level":"info","message":"User logged in","userId":123,"ip":"192.168.1.1",...}
|
||||
```
|
||||
|
||||
**Benefits**:
|
||||
- Queryable (filter by userId)
|
||||
- Machine-readable
|
||||
- Consistent format
|
||||
|
||||
---
|
||||
|
||||
### 3. Traces
|
||||
|
||||
**Purpose**: Track request flow through distributed systems
|
||||
|
||||
**Example**:
|
||||
```
|
||||
User Request → API Gateway → Auth Service → Payment Service → Database
|
||||
1ms 2ms 50ms 100ms 30ms
|
||||
↑ SLOW SPAN
|
||||
```
|
||||
|
||||
**Tools**:
|
||||
- Jaeger
|
||||
- Zipkin
|
||||
- AWS X-Ray
|
||||
- DataDog APM
|
||||
- New Relic
|
||||
|
||||
**When to Use**:
|
||||
- Microservices architecture
|
||||
- Slow requests (which service is slow?)
|
||||
- Debugging distributed systems
|
||||
|
||||
---
|
||||
|
||||
## Alerting Best Practices
|
||||
|
||||
### Alert on Symptoms, Not Causes
|
||||
|
||||
**BAD** (cause-based):
|
||||
- Alert: "CPU usage >80%"
|
||||
- Problem: CPU can be high without user impact
|
||||
|
||||
**GOOD** (symptom-based):
|
||||
- Alert: "API response time >1s"
|
||||
- Why: Users actually experiencing slowness
|
||||
|
||||
---
|
||||
|
||||
### Alert Severity Levels
|
||||
|
||||
**P1 (SEV1) - Page On-Call**:
|
||||
- Service down (availability <99%)
|
||||
- Data loss
|
||||
- Security breach
|
||||
- Response time >5s (unusable)
|
||||
|
||||
**P2 (SEV2) - Notify During Business Hours**:
|
||||
- Degraded performance (response time >1s)
|
||||
- Error rate >1%
|
||||
- Disk >90% full
|
||||
|
||||
**P3 (SEV3) - Email/Slack**:
|
||||
- Warning signs (disk >80%, memory >80%)
|
||||
- Non-critical errors
|
||||
- Monitoring gaps
|
||||
|
||||
---
|
||||
|
||||
### Alert Fatigue Prevention
|
||||
|
||||
**Rules**:
|
||||
1. **Actionable**: Every alert must have clear action
|
||||
2. **Meaningful**: Alert only on real problems
|
||||
3. **Context**: Include relevant info (which server, which metric)
|
||||
4. **Deduplicate**: Don't alert 100 times for same issue
|
||||
5. **Escalate**: Auto-escalate if not acknowledged
|
||||
|
||||
**Example Bad Alert**:
|
||||
```
|
||||
Subject: Alert
|
||||
Body: Server is down
|
||||
```
|
||||
|
||||
**Example Good Alert**:
|
||||
```
|
||||
Subject: [P1] API Server Down - Production
|
||||
Body:
|
||||
- Service: api.example.com
|
||||
- Issue: Health check failing for 5 minutes
|
||||
- Impact: All users affected (100%)
|
||||
- Runbook: https://wiki.example.com/runbook/api-down
|
||||
- Dashboard: https://grafana.example.com/d/api
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Monitoring Setup
|
||||
|
||||
### Application Monitoring
|
||||
|
||||
#### Prometheus + Grafana
|
||||
|
||||
**Install Prometheus Client** (Node.js):
|
||||
```javascript
|
||||
const client = require('prom-client');
|
||||
|
||||
// Enable default metrics (CPU, memory, etc.)
|
||||
client.collectDefaultMetrics();
|
||||
|
||||
// Custom metrics
|
||||
const httpRequestDuration = new client.Histogram({
|
||||
name: 'http_request_duration_seconds',
|
||||
help: 'HTTP request duration in seconds',
|
||||
labelNames: ['method', 'route', 'status'],
|
||||
});
|
||||
|
||||
// Instrument code
|
||||
app.use((req, res, next) => {
|
||||
const end = httpRequestDuration.startTimer();
|
||||
res.on('finish', () => {
|
||||
end({ method: req.method, route: req.route.path, status: res.statusCode });
|
||||
});
|
||||
next();
|
||||
});
|
||||
|
||||
// Expose metrics endpoint
|
||||
app.get('/metrics', (req, res) => {
|
||||
res.set('Content-Type', client.register.contentType);
|
||||
res.end(client.register.metrics());
|
||||
});
|
||||
```
|
||||
|
||||
**Prometheus Config** (prometheus.yml):
|
||||
```yaml
|
||||
scrape_configs:
|
||||
- job_name: 'api-server'
|
||||
static_configs:
|
||||
- targets: ['localhost:3000']
|
||||
scrape_interval: 15s
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Log Aggregation
|
||||
|
||||
#### ELK Stack
|
||||
|
||||
**Application** (send logs to Logstash):
|
||||
```javascript
|
||||
const winston = require('winston');
|
||||
const LogstashTransport = require('winston-logstash-transport').LogstashTransport;
|
||||
|
||||
const logger = winston.createLogger({
|
||||
transports: [
|
||||
new LogstashTransport({
|
||||
host: 'logstash.example.com',
|
||||
port: 5000,
|
||||
}),
|
||||
],
|
||||
});
|
||||
|
||||
logger.info('User logged in', { userId: 123, ip: '192.168.1.1' });
|
||||
```
|
||||
|
||||
**Logstash Config**:
|
||||
```
|
||||
input {
|
||||
tcp {
|
||||
port => 5000
|
||||
codec => json
|
||||
}
|
||||
}
|
||||
|
||||
output {
|
||||
elasticsearch {
|
||||
hosts => ["elasticsearch:9200"]
|
||||
index => "application-logs-%{+YYYY.MM.dd}"
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Health Checks
|
||||
|
||||
**Purpose**: Check if service is healthy and ready to serve traffic
|
||||
|
||||
**Types**:
|
||||
1. **Liveness**: Is the service running? (restart if fails)
|
||||
2. **Readiness**: Is the service ready to serve traffic? (remove from load balancer if fails)
|
||||
|
||||
**Example** (Express.js):
|
||||
```javascript
|
||||
// Liveness probe (simple check)
|
||||
app.get('/healthz', (req, res) => {
|
||||
res.status(200).send('OK');
|
||||
});
|
||||
|
||||
// Readiness probe (check dependencies)
|
||||
app.get('/ready', async (req, res) => {
|
||||
try {
|
||||
// Check database
|
||||
await db.query('SELECT 1');
|
||||
|
||||
// Check Redis
|
||||
await redis.ping();
|
||||
|
||||
// Check external API
|
||||
await fetch('https://api.external.com/health');
|
||||
|
||||
res.status(200).send('Ready');
|
||||
} catch (error) {
|
||||
res.status(503).send('Not ready');
|
||||
}
|
||||
});
|
||||
```
|
||||
|
||||
**Kubernetes**:
|
||||
```yaml
|
||||
livenessProbe:
|
||||
httpGet:
|
||||
path: /healthz
|
||||
port: 3000
|
||||
initialDelaySeconds: 30
|
||||
periodSeconds: 10
|
||||
|
||||
readinessProbe:
|
||||
httpGet:
|
||||
path: /ready
|
||||
port: 3000
|
||||
initialDelaySeconds: 10
|
||||
periodSeconds: 5
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### SLI, SLO, SLA
|
||||
|
||||
**SLI** (Service Level Indicator):
|
||||
- Metrics that measure service quality
|
||||
- Examples: Response time, error rate, availability
|
||||
|
||||
**SLO** (Service Level Objective):
|
||||
- Target for SLI
|
||||
- Examples: "99.9% availability", "p95 response time <500ms"
|
||||
|
||||
**SLA** (Service Level Agreement):
|
||||
- Contract with users (with penalties)
|
||||
- Examples: "99.9% uptime or refund"
|
||||
|
||||
**Example**:
|
||||
```
|
||||
SLI: Availability = (successful requests / total requests) * 100
|
||||
SLO: Availability must be ≥99.9% per month
|
||||
SLA: If availability <99.9%, users get 10% refund
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Monitoring Checklist
|
||||
|
||||
**Application**:
|
||||
- [ ] Response time metrics (p50, p95, p99)
|
||||
- [ ] Error rate metrics (4xx, 5xx)
|
||||
- [ ] Throughput metrics (requests per second)
|
||||
- [ ] Health check endpoint (/healthz, /ready)
|
||||
- [ ] Structured logging (JSON format)
|
||||
- [ ] Distributed tracing (if microservices)
|
||||
|
||||
**Infrastructure**:
|
||||
- [ ] CPU, memory, disk, network metrics
|
||||
- [ ] System logs (syslog, journalctl)
|
||||
- [ ] Cloud metrics (CloudWatch, Azure Monitor)
|
||||
- [ ] Disk I/O metrics (iostat)
|
||||
|
||||
**Database**:
|
||||
- [ ] Query performance metrics
|
||||
- [ ] Connection pool metrics
|
||||
- [ ] Slow query log enabled
|
||||
- [ ] Deadlock monitoring
|
||||
|
||||
**Alerts**:
|
||||
- [ ] P1 alerts for critical issues (page on-call)
|
||||
- [ ] P2 alerts for degraded performance
|
||||
- [ ] Runbook linked in alerts
|
||||
- [ ] Dashboard linked in alerts
|
||||
- [ ] Escalation policy configured
|
||||
|
||||
**Dashboards**:
|
||||
- [ ] Overview dashboard (RED metrics: Rate, Errors, Duration)
|
||||
- [ ] Infrastructure dashboard (CPU, memory, disk)
|
||||
- [ ] Database dashboard (queries, connections)
|
||||
- [ ] Business metrics dashboard (signups, revenue)
|
||||
|
||||
---
|
||||
|
||||
## Common Monitoring Patterns
|
||||
|
||||
### RED Method (for services)
|
||||
|
||||
**Rate**: Requests per second
|
||||
**Errors**: Error rate (%)
|
||||
**Duration**: Response time (p50, p95, p99)
|
||||
|
||||
**Dashboard**:
|
||||
```
|
||||
+-----------------+ +-----------------+ +-----------------+
|
||||
| Rate | | Errors | | Duration |
|
||||
| 1000 req/s | | 0.5% | | p95: 250ms |
|
||||
+-----------------+ +-----------------+ +-----------------+
|
||||
```
|
||||
|
||||
### USE Method (for resources)
|
||||
|
||||
**Utilization**: % of resource used (CPU, memory, disk)
|
||||
**Saturation**: Queue depth, backlog
|
||||
**Errors**: Error count
|
||||
|
||||
**Dashboard**:
|
||||
```
|
||||
CPU: 70% utilization, 0.5 load average, 0 errors
|
||||
Memory: 80% utilization, 0 swap, 0 OOM kills
|
||||
Disk: 60% utilization, 5ms latency, 0 I/O errors
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Tools Comparison
|
||||
|
||||
| Tool | Type | Best For | Cost |
|
||||
|------|------|----------|------|
|
||||
| Prometheus + Grafana | Metrics | Self-hosted, cost-effective | Free |
|
||||
| DataDog | Metrics, Logs, APM | All-in-one, easy setup | $15/host/month |
|
||||
| New Relic | APM | Application performance | $99/user/month |
|
||||
| ELK Stack | Logs | Log aggregation | Free (self-hosted) |
|
||||
| Splunk | Logs | Enterprise log analysis | $1800/GB/year |
|
||||
| Jaeger | Traces | Distributed tracing | Free |
|
||||
| CloudWatch | Metrics, Logs | AWS-native | $0.30/metric/month |
|
||||
| Azure Monitor | Metrics, Logs | Azure-native | $0.25/metric/month |
|
||||
|
||||
---
|
||||
|
||||
## Related Documentation
|
||||
|
||||
- [SKILL.md](../SKILL.md) - Main SRE agent
|
||||
- [backend-diagnostics.md](backend-diagnostics.md) - Application troubleshooting
|
||||
- [database-diagnostics.md](database-diagnostics.md) - Database monitoring
|
||||
- [infrastructure.md](infrastructure.md) - Infrastructure monitoring
|
||||
421
agents/sre/modules/security-incidents.md
Normal file
421
agents/sre/modules/security-incidents.md
Normal file
@@ -0,0 +1,421 @@
|
||||
# Security Incidents
|
||||
|
||||
**Purpose**: Respond to security breaches, DDoS attacks, and unauthorized access attempts.
|
||||
|
||||
**IMPORTANT**: For security incidents, SRE Agent collaborates with `security-agent` skill.
|
||||
|
||||
## Incident Response Protocol
|
||||
|
||||
### SEV1 Security Incidents (CRITICAL)
|
||||
|
||||
**Immediate Actions** (First 5 minutes):
|
||||
1. **Isolate** affected systems
|
||||
2. **Preserve** evidence (logs, snapshots)
|
||||
3. **Notify** security team and management
|
||||
4. **Assess** scope of breach
|
||||
5. **Document** timeline
|
||||
|
||||
**DO NOT**:
|
||||
- Delete logs (preserve evidence)
|
||||
- Reboot systems (unless absolutely necessary)
|
||||
- Make changes without documenting
|
||||
|
||||
---
|
||||
|
||||
## Common Security Incidents
|
||||
|
||||
### 1. DDoS Attack
|
||||
|
||||
**Symptoms**:
|
||||
- Sudden traffic spike (10x-100x normal)
|
||||
- Legitimate users can't access service
|
||||
- High bandwidth usage
|
||||
- Server overload
|
||||
|
||||
**Diagnosis**:
|
||||
|
||||
#### Check Traffic Patterns
|
||||
```bash
|
||||
# Check connections by IP
|
||||
netstat -ntu | awk '{print $5}' | cut -d: -f1 | sort | uniq -c | sort -nr | head -20
|
||||
|
||||
# Check HTTP requests by IP (nginx)
|
||||
awk '{print $1}' /var/log/nginx/access.log | sort | uniq -c | sort -nr | head -20
|
||||
|
||||
# Check requests per second
|
||||
tail -f /var/log/nginx/access.log | awk '{print $4}' | uniq -c
|
||||
```
|
||||
|
||||
**Red flags**:
|
||||
- Single IP making thousands of requests
|
||||
- Requests from suspicious IPs (botnets)
|
||||
- High rate of 4xx errors (probing)
|
||||
- Unusual traffic patterns
|
||||
|
||||
---
|
||||
|
||||
#### Immediate Mitigation
|
||||
```bash
|
||||
# 1. Rate limiting (nginx)
|
||||
# Add to nginx.conf:
|
||||
limit_req_zone $binary_remote_addr zone=one:10m rate=10r/s;
|
||||
limit_req zone=one burst=20 nodelay;
|
||||
|
||||
# 2. Block suspicious IPs (iptables)
|
||||
iptables -A INPUT -s <ATTACKER_IP> -j DROP
|
||||
|
||||
# 3. Enable DDoS protection (CloudFlare, AWS Shield)
|
||||
# CloudFlare: Enable "I'm Under Attack" mode
|
||||
# AWS: Enable AWS Shield Standard/Advanced
|
||||
|
||||
# 4. Increase capacity (auto-scaling)
|
||||
# Scale up to handle traffic (if legitimate)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### 2. Unauthorized Access / Data Breach
|
||||
|
||||
**Symptoms**:
|
||||
- Alerts for failed login attempts
|
||||
- Successful login from unusual location
|
||||
- Unusual data access patterns
|
||||
- Data exfiltration detected
|
||||
|
||||
**Diagnosis**:
|
||||
|
||||
#### Check Access Logs
|
||||
```bash
|
||||
# Check authentication logs (Linux)
|
||||
grep "Failed password" /var/log/auth.log | tail -50
|
||||
|
||||
# Check successful logins
|
||||
grep "Accepted password" /var/log/auth.log | tail -50
|
||||
|
||||
# Check login attempts by IP
|
||||
awk '/Failed password/ {print $(NF-3)}' /var/log/auth.log | sort | uniq -c | sort -nr
|
||||
```
|
||||
|
||||
**Red flags**:
|
||||
- Hundreds of failed login attempts (brute force)
|
||||
- Successful login from suspicious IP/location
|
||||
- Login at unusual time (3am)
|
||||
- Multiple accounts accessed from same IP
|
||||
|
||||
---
|
||||
|
||||
#### Immediate Response (SEV1)
|
||||
```bash
|
||||
# 1. ISOLATE: Disable compromised account
|
||||
# Application-level:
|
||||
UPDATE users SET disabled = true WHERE id = <COMPROMISED_USER_ID>;
|
||||
|
||||
# System-level:
|
||||
passwd -l <username> # Lock account
|
||||
|
||||
# 2. PRESERVE: Copy logs for forensics
|
||||
cp /var/log/auth.log /forensics/auth.log.$(date +%Y%m%d)
|
||||
cp /var/log/nginx/access.log /forensics/access.log.$(date +%Y%m%d)
|
||||
|
||||
# 3. ASSESS: Check what was accessed
|
||||
# Database audit logs
|
||||
# Application logs
|
||||
# File access logs
|
||||
|
||||
# 4. NOTIFY: Alert security team
|
||||
# Email, Slack, PagerDuty
|
||||
|
||||
# 5. DOCUMENT: Create incident timeline
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
#### Long-term Mitigation
|
||||
- Force password reset for all users
|
||||
- Enable 2FA/MFA
|
||||
- Review access controls
|
||||
- Conduct security audit
|
||||
- Update security policies
|
||||
- Train users on security
|
||||
|
||||
---
|
||||
|
||||
### 3. SQL Injection Attempt
|
||||
|
||||
**Symptoms**:
|
||||
- Unusual SQL queries in logs
|
||||
- 500 errors with SQL syntax messages
|
||||
- Alerts from WAF (Web Application Firewall)
|
||||
|
||||
**Diagnosis**:
|
||||
|
||||
#### Check Application Logs
|
||||
```bash
|
||||
# Look for SQL injection patterns
|
||||
grep -E "(SELECT|INSERT|UPDATE|DELETE).*FROM.*WHERE" /var/log/application.log
|
||||
|
||||
# Look for SQL errors
|
||||
grep "SQLException\|SQL syntax" /var/log/application.log
|
||||
|
||||
# Check for malicious patterns
|
||||
grep -E "(\'\s*OR\s*\'|\-\-|UNION\s+SELECT)" /var/log/nginx/access.log
|
||||
```
|
||||
|
||||
**Example Malicious Request**:
|
||||
```
|
||||
GET /api/users?id=1' OR '1'='1
|
||||
GET /api/users?id=1; DROP TABLE users;--
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
#### Immediate Response
|
||||
```bash
|
||||
# 1. Block attacker IP
|
||||
iptables -A INPUT -s <ATTACKER_IP> -j DROP
|
||||
|
||||
# 2. Enable WAF rule (ModSecurity, AWS WAF)
|
||||
# Block requests with SQL keywords
|
||||
|
||||
# 3. Check database for unauthorized changes
|
||||
# Compare current schema with backup
|
||||
# Check audit logs for suspicious queries
|
||||
|
||||
# 4. Review application code
|
||||
# Use parameterized queries, not string concatenation
|
||||
```
|
||||
|
||||
**Long-term Fix**:
|
||||
```javascript
|
||||
// BAD: SQL injection vulnerable
|
||||
const query = `SELECT * FROM users WHERE id = ${req.query.id}`;
|
||||
|
||||
// GOOD: Parameterized query
|
||||
const query = 'SELECT * FROM users WHERE id = ?';
|
||||
db.query(query, [req.query.id]);
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### 4. Malware / Crypto Mining
|
||||
|
||||
**Symptoms**:
|
||||
- High CPU usage (100%)
|
||||
- Unusual network traffic (to crypto pool)
|
||||
- Unknown processes running
|
||||
- Server slow
|
||||
|
||||
**Diagnosis**:
|
||||
|
||||
#### Check Running Processes
|
||||
```bash
|
||||
# Check CPU usage by process
|
||||
top -bn1 | head -20
|
||||
|
||||
# Check all processes
|
||||
ps aux | sort -nrk 3,3 | head -20
|
||||
|
||||
# Check for suspicious processes
|
||||
ps aux | grep -v -E "^(root|www-data|mysql|postgres)"
|
||||
|
||||
# Check network connections
|
||||
netstat -tunap | grep ESTABLISHED
|
||||
```
|
||||
|
||||
**Red flags**:
|
||||
- Unknown process using 100% CPU
|
||||
- Connections to crypto mining pools
|
||||
- Processes running as unexpected user
|
||||
- Processes with random names (xmrig, minerd)
|
||||
|
||||
---
|
||||
|
||||
#### Immediate Response
|
||||
```bash
|
||||
# 1. Kill malicious process
|
||||
kill -9 <PID>
|
||||
|
||||
# 2. Find and remove malware
|
||||
find / -name "<PROCESS_NAME>" -delete
|
||||
|
||||
# 3. Check for persistence mechanisms
|
||||
crontab -l # Cron jobs
|
||||
cat /etc/rc.local # Startup scripts
|
||||
systemctl list-unit-files # Systemd services
|
||||
|
||||
# 4. Change all credentials
|
||||
# Root password
|
||||
# SSH keys
|
||||
# Database passwords
|
||||
# API keys
|
||||
|
||||
# 5. Restore from clean backup (if available)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### 5. Insider Threat / Data Exfiltration
|
||||
|
||||
**Symptoms**:
|
||||
- Large data downloads
|
||||
- Database dump exports
|
||||
- Unusual file transfers
|
||||
- After-hours access
|
||||
|
||||
**Diagnosis**:
|
||||
|
||||
#### Check Data Access Logs
|
||||
```bash
|
||||
# Check database queries (large exports)
|
||||
grep "SELECT.*FROM" /var/log/postgresql/postgresql.log | grep -E "LIMIT\s+[0-9]{5,}"
|
||||
|
||||
# Check file downloads (nginx)
|
||||
awk '$10 > 10000000 {print $1, $7, $10}' /var/log/nginx/access.log
|
||||
|
||||
# Check SSH file transfers
|
||||
grep "sftp\|scp" /var/log/auth.log
|
||||
```
|
||||
|
||||
**Red flags**:
|
||||
- SELECT with no LIMIT (full table export)
|
||||
- Large file downloads (>10MB)
|
||||
- Multiple consecutive downloads
|
||||
- Access from unusual location
|
||||
|
||||
---
|
||||
|
||||
#### Immediate Response
|
||||
```bash
|
||||
# 1. Disable account
|
||||
UPDATE users SET disabled = true WHERE id = <USER_ID>;
|
||||
|
||||
# 2. Preserve evidence
|
||||
cp /var/log/* /forensics/
|
||||
|
||||
# 3. Assess damage
|
||||
# What data was accessed?
|
||||
# What data was exported?
|
||||
# What systems were compromised?
|
||||
|
||||
# 4. Legal/compliance notification
|
||||
# GDPR: Notify within 72 hours
|
||||
# HIPAA: Notify within 60 days
|
||||
# PCI-DSS: Immediate notification
|
||||
|
||||
# 5. Incident report
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Security Incident Checklist
|
||||
|
||||
**When security incident detected**:
|
||||
|
||||
### Phase 1: Immediate Response (0-5 min)
|
||||
- [ ] Classify severity (SEV1/SEV2/SEV3)
|
||||
- [ ] Isolate affected systems
|
||||
- [ ] Preserve evidence (logs, snapshots)
|
||||
- [ ] Notify security team
|
||||
- [ ] Document timeline (start timestamp)
|
||||
|
||||
### Phase 2: Assessment (5-30 min)
|
||||
- [ ] Identify attack vector
|
||||
- [ ] Assess scope (what was compromised?)
|
||||
- [ ] Check for data exfiltration
|
||||
- [ ] Identify attacker (IP, location, identity)
|
||||
- [ ] Determine if ongoing or stopped
|
||||
|
||||
### Phase 3: Containment (30 min - 2 hours)
|
||||
- [ ] Block attacker access
|
||||
- [ ] Close vulnerability
|
||||
- [ ] Revoke compromised credentials
|
||||
- [ ] Remove malware/backdoors
|
||||
- [ ] Restore from clean backup (if needed)
|
||||
|
||||
### Phase 4: Recovery (2 hours - days)
|
||||
- [ ] Restore normal operations
|
||||
- [ ] Verify no persistence mechanisms
|
||||
- [ ] Monitor for re-infection
|
||||
- [ ] Change all credentials
|
||||
- [ ] Apply security patches
|
||||
|
||||
### Phase 5: Post-Incident (1 week)
|
||||
- [ ] Complete post-mortem
|
||||
- [ ] Legal/compliance notifications
|
||||
- [ ] Security audit
|
||||
- [ ] Update security policies
|
||||
- [ ] Train team on lessons learned
|
||||
|
||||
---
|
||||
|
||||
## Collaboration with Security Agent
|
||||
|
||||
**SRE Agent Role**:
|
||||
- Initial detection and triage
|
||||
- Immediate containment
|
||||
- Preserve evidence
|
||||
- Restore service
|
||||
|
||||
**Security Agent Role** (handoff):
|
||||
- Forensic analysis
|
||||
- Legal compliance
|
||||
- Security audit
|
||||
- Policy updates
|
||||
|
||||
**Handoff Protocol**:
|
||||
```
|
||||
SRE: Detects security incident → Immediate containment
|
||||
SRE: Preserves evidence → Creates incident report
|
||||
SRE: Hands off to Security Agent
|
||||
Security Agent: Forensic analysis → Legal compliance → Long-term fixes
|
||||
SRE: Implements security fixes → Updates runbook
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Security Metrics
|
||||
|
||||
**Detection Time**:
|
||||
- SEV1: <5 minutes from first indicator
|
||||
- SEV2: <30 minutes
|
||||
- SEV3: <24 hours
|
||||
|
||||
**Response Time**:
|
||||
- SEV1: Containment within 30 minutes
|
||||
- SEV2: Containment within 2 hours
|
||||
- SEV3: Containment within 24 hours
|
||||
|
||||
**False Positives**:
|
||||
- Target: <5% of security alerts
|
||||
|
||||
---
|
||||
|
||||
## Related Documentation
|
||||
|
||||
- [SKILL.md](../SKILL.md) - Main SRE agent
|
||||
- [infrastructure.md](infrastructure.md) - Server security hardening
|
||||
- [monitoring.md](monitoring.md) - Security monitoring setup
|
||||
- `security-agent` skill - Full security expertise (handoff for forensics)
|
||||
|
||||
---
|
||||
|
||||
## Important Notes
|
||||
|
||||
**For SRE Agent**:
|
||||
- Focus on IMMEDIATE containment and service restoration
|
||||
- Preserve evidence (don't delete logs!)
|
||||
- Hand off to `security-agent` for forensic analysis
|
||||
- Document everything with timestamps
|
||||
- Blameless post-mortem (focus on systems, not people)
|
||||
|
||||
**Legal Compliance**:
|
||||
- GDPR: Notify within 72 hours of breach
|
||||
- HIPAA: Notify within 60 days
|
||||
- PCI-DSS: Immediate notification to card brands
|
||||
- SOC 2: Document in audit trail
|
||||
|
||||
**Evidence Preservation**:
|
||||
- Copy logs before any changes
|
||||
- Take disk/memory snapshots
|
||||
- Document all actions taken
|
||||
- Preserve chain of custody
|
||||
302
agents/sre/modules/ui-diagnostics.md
Normal file
302
agents/sre/modules/ui-diagnostics.md
Normal file
@@ -0,0 +1,302 @@
|
||||
# UI/Frontend Diagnostics
|
||||
|
||||
**Purpose**: Troubleshoot frontend performance, rendering, and user experience issues.
|
||||
|
||||
## Common UI Issues
|
||||
|
||||
### 1. Slow Page Load
|
||||
|
||||
**Symptoms**:
|
||||
- Users report long loading times
|
||||
- Lighthouse score <50
|
||||
- Time to Interactive (TTI) >5 seconds
|
||||
|
||||
**Diagnosis**:
|
||||
|
||||
#### Check Bundle Size
|
||||
```bash
|
||||
# Check JavaScript bundle size
|
||||
ls -lh dist/*.js
|
||||
|
||||
# Analyze bundle composition
|
||||
npx webpack-bundle-analyzer dist/stats.json
|
||||
|
||||
# Check for large dependencies
|
||||
npm ls --depth=0
|
||||
```
|
||||
|
||||
**Red flags**:
|
||||
- Main bundle >500KB
|
||||
- Unused dependencies in bundle
|
||||
- Multiple copies of same library
|
||||
|
||||
**Mitigation**:
|
||||
- Code splitting: `import()` for dynamic imports
|
||||
- Tree shaking: Remove unused code
|
||||
- Lazy loading: Load components on demand
|
||||
|
||||
---
|
||||
|
||||
#### Check Network Requests
|
||||
```bash
|
||||
# Chrome DevTools → Network tab
|
||||
# Look for:
|
||||
# - Number of requests (>100 = too many)
|
||||
# - Large assets (images >200KB)
|
||||
# - Slow API calls (>1s)
|
||||
```
|
||||
|
||||
**Red flags**:
|
||||
- Waterfall pattern (sequential loading)
|
||||
- Large uncompressed images
|
||||
- Blocking requests
|
||||
|
||||
**Mitigation**:
|
||||
- Image optimization: WebP, lazy loading
|
||||
- HTTP/2: Multiplexing
|
||||
- CDN: Cache static assets
|
||||
|
||||
---
|
||||
|
||||
#### Check Render Performance
|
||||
```bash
|
||||
# Chrome DevTools → Performance tab
|
||||
# Record page load, check:
|
||||
# - Long tasks (>50ms)
|
||||
# - Layout thrashing
|
||||
# - JavaScript execution time
|
||||
```
|
||||
|
||||
**Red flags**:
|
||||
- Long tasks blocking main thread
|
||||
- Multiple layout recalculations
|
||||
- Heavy JavaScript computation
|
||||
|
||||
**Mitigation**:
|
||||
- Web Workers: Move heavy computation off main thread
|
||||
- requestIdleCallback: Defer non-critical work
|
||||
- Virtual scrolling: Render only visible items
|
||||
|
||||
---
|
||||
|
||||
### 2. Memory Leak (UI)
|
||||
|
||||
**Symptoms**:
|
||||
- Browser tab becomes slow over time
|
||||
- Memory usage increases continuously
|
||||
- Browser eventually crashes
|
||||
|
||||
**Diagnosis**:
|
||||
|
||||
#### Chrome DevTools → Memory
|
||||
```bash
|
||||
# Take heap snapshot before/after user interaction
|
||||
# Compare snapshots
|
||||
# Look for:
|
||||
# - Detached DOM nodes
|
||||
# - Event listeners not removed
|
||||
# - Growing arrays/objects
|
||||
```
|
||||
|
||||
**Red flags**:
|
||||
- Detached DOM elements increasing
|
||||
- Event listeners not garbage collected
|
||||
- Timers/intervals not cleared
|
||||
|
||||
**Mitigation**:
|
||||
```javascript
|
||||
// Clean up event listeners
|
||||
componentWillUnmount() {
|
||||
element.removeEventListener('click', handler);
|
||||
clearInterval(this.intervalId);
|
||||
clearTimeout(this.timeoutId);
|
||||
}
|
||||
|
||||
// Use WeakMap for DOM references
|
||||
const cache = new WeakMap();
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### 3. Unresponsive UI
|
||||
|
||||
**Symptoms**:
|
||||
- Clicks don't register
|
||||
- Input lag
|
||||
- Frozen UI
|
||||
|
||||
**Diagnosis**:
|
||||
|
||||
#### Check Main Thread
|
||||
```bash
|
||||
# Chrome DevTools → Performance
|
||||
# Look for:
|
||||
# - Long tasks (>50ms)
|
||||
# - Blocking JavaScript
|
||||
# - Forced synchronous layout
|
||||
```
|
||||
|
||||
**Red flags**:
|
||||
- JavaScript blocking >100ms
|
||||
- Synchronous XHR requests
|
||||
- Layout thrashing (read → write → read)
|
||||
|
||||
**Mitigation**:
|
||||
```javascript
|
||||
// Break up long tasks
|
||||
async function processLargeArray(items) {
|
||||
for (let i = 0; i < items.length; i++) {
|
||||
await processItem(items[i]);
|
||||
|
||||
// Yield to main thread every 100 items
|
||||
if (i % 100 === 0) {
|
||||
await new Promise(resolve => setTimeout(resolve, 0));
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
// Use requestIdleCallback
|
||||
requestIdleCallback(() => {
|
||||
// Non-critical work
|
||||
});
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### 4. White Screen / Failed Render
|
||||
|
||||
**Symptoms**:
|
||||
- Blank page
|
||||
- Error boundary triggered
|
||||
- Console errors
|
||||
|
||||
**Diagnosis**:
|
||||
|
||||
#### Check Console Errors
|
||||
```bash
|
||||
# Chrome DevTools → Console
|
||||
# Look for:
|
||||
# - Uncaught exceptions
|
||||
# - Network errors (failed chunks)
|
||||
# - CORS errors
|
||||
```
|
||||
|
||||
**Common causes**:
|
||||
- JavaScript error in render
|
||||
- Failed to load chunk (code splitting)
|
||||
- CORS blocking API calls
|
||||
- Missing dependencies
|
||||
|
||||
**Mitigation**:
|
||||
```javascript
|
||||
// Error boundary
|
||||
class ErrorBoundary extends React.Component {
|
||||
componentDidCatch(error, errorInfo) {
|
||||
logErrorToService(error, errorInfo);
|
||||
}
|
||||
|
||||
render() {
|
||||
if (this.state.hasError) {
|
||||
return <ErrorFallback />;
|
||||
}
|
||||
return this.props.children;
|
||||
}
|
||||
}
|
||||
|
||||
// Retry failed chunk loads
|
||||
const retryImport = (fn, retriesLeft = 3) => {
|
||||
return new Promise((resolve, reject) => {
|
||||
fn()
|
||||
.then(resolve)
|
||||
.catch(error => {
|
||||
if (retriesLeft === 0) {
|
||||
reject(error);
|
||||
} else {
|
||||
setTimeout(() => {
|
||||
retryImport(fn, retriesLeft - 1).then(resolve, reject);
|
||||
}, 1000);
|
||||
}
|
||||
});
|
||||
});
|
||||
};
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## UI Performance Metrics
|
||||
|
||||
**Core Web Vitals**:
|
||||
- **LCP** (Largest Contentful Paint): <2.5s (good), <4s (needs improvement), >4s (poor)
|
||||
- **FID** (First Input Delay): <100ms (good), <300ms (needs improvement), >300ms (poor)
|
||||
- **CLS** (Cumulative Layout Shift): <0.1 (good), <0.25 (needs improvement), >0.25 (poor)
|
||||
|
||||
**Other Metrics**:
|
||||
- **TTFB** (Time to First Byte): <200ms
|
||||
- **FCP** (First Contentful Paint): <1.8s
|
||||
- **TTI** (Time to Interactive): <3.8s
|
||||
|
||||
**Measurement**:
|
||||
```javascript
|
||||
// Web Vitals library
|
||||
import {getLCP, getFID, getCLS} from 'web-vitals';
|
||||
|
||||
getLCP(console.log);
|
||||
getFID(console.log);
|
||||
getCLS(console.log);
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Common UI Anti-Patterns
|
||||
|
||||
### 1. Render Everything Upfront
|
||||
**Problem**: Rendering 10,000 items at once
|
||||
**Solution**: Virtual scrolling, pagination, infinite scroll
|
||||
|
||||
### 2. No Code Splitting
|
||||
**Problem**: 5MB JavaScript bundle loaded upfront
|
||||
**Solution**: Route-based code splitting, lazy loading
|
||||
|
||||
### 3. Large Images
|
||||
**Problem**: 5MB PNG images
|
||||
**Solution**: WebP, compression, lazy loading, responsive images
|
||||
|
||||
### 4. Blocking JavaScript
|
||||
**Problem**: Heavy computation on main thread
|
||||
**Solution**: Web Workers, requestIdleCallback, async/await
|
||||
|
||||
### 5. Memory Leaks
|
||||
**Problem**: Event listeners not removed, timers not cleared
|
||||
**Solution**: Cleanup in componentWillUnmount, WeakMap
|
||||
|
||||
---
|
||||
|
||||
## UI Diagnostic Checklist
|
||||
|
||||
**When diagnosing slow UI**:
|
||||
|
||||
- [ ] Check bundle size (target: <500KB gzipped)
|
||||
- [ ] Check number of network requests (target: <50)
|
||||
- [ ] Check Core Web Vitals (LCP <2.5s, FID <100ms, CLS <0.1)
|
||||
- [ ] Check for JavaScript errors in console
|
||||
- [ ] Check render performance (no long tasks >50ms)
|
||||
- [ ] Check memory usage (no continuous growth)
|
||||
- [ ] Check for CORS errors
|
||||
- [ ] Check for failed chunk loads
|
||||
- [ ] Check image sizes (target: <200KB per image)
|
||||
- [ ] Check for blocking resources
|
||||
|
||||
**Tools**:
|
||||
- Chrome DevTools (Network, Performance, Memory, Console)
|
||||
- Lighthouse
|
||||
- Web Vitals library
|
||||
- webpack-bundle-analyzer
|
||||
- React DevTools Profiler
|
||||
|
||||
---
|
||||
|
||||
## Related Documentation
|
||||
|
||||
- [SKILL.md](../SKILL.md) - Main SRE agent
|
||||
- [backend-diagnostics.md](backend-diagnostics.md) - Backend troubleshooting
|
||||
- [monitoring.md](monitoring.md) - Observability tools
|
||||
Reference in New Issue
Block a user