Initial commit

This commit is contained in:
Zhongwei Li
2025-11-29 17:56:41 +08:00
commit 9427ed1eea
40 changed files with 15189 additions and 0 deletions

View File

@@ -0,0 +1,481 @@
# Backend/API Diagnostics
**Purpose**: Troubleshoot backend services, APIs, and application-level performance issues.
## Common Backend Issues
### 1. Slow API Response
**Symptoms**:
- API response time >1 second
- Users report slow loading
- Timeout errors
**Diagnosis**:
#### Check Application Logs
```bash
# Check for slow requests
grep "duration" /var/log/application.log | awk '{if ($5 > 1000) print}'
# Check error rate
grep "ERROR" /var/log/application.log | wc -l
# Check recent errors
tail -f /var/log/application.log | grep "ERROR"
```
**Red flags**:
- Repeated errors for same endpoint
- Increasing response times
- Timeout errors
---
#### Check Application Metrics
```bash
# CPU usage
top -bn1 | grep "node\|java\|python"
# Memory usage
ps aux | grep "node\|java\|python" | awk '{print $4, $11}'
# Thread count
ps -eLf | grep "node\|java\|python" | wc -l
# Open file descriptors
lsof -p <PID> | wc -l
```
**Red flags**:
- CPU >80%
- Memory increasing over time
- Thread count increasing (thread leak)
- File descriptors increasing (connection leak)
---
#### Check Database Query Time
```bash
# If slow, likely database issue
# See database-diagnostics.md
# Check if query time matches API response time
# API response time = Query time + Application processing
```
---
#### Check External API Calls
```bash
# Check if calling external APIs
grep "http.request" /var/log/application.log
# Check external API response time
# Use APM tools or custom instrumentation
```
**Red flags**:
- External API taking >500ms
- External API rate limiting (429 errors)
- External API errors (5xx errors)
**Mitigation**:
- Cache external API responses
- Add timeout (don't wait >5s)
- Circuit breaker pattern
- Fallback data
---
### 2. 5xx Errors (500, 502, 503, 504)
**Symptoms**:
- Users getting error messages
- Monitoring alerts for error rate
- Some/all requests failing
**Diagnosis by Error Code**:
#### 500 Internal Server Error
**Cause**: Application code error
**Diagnosis**:
```bash
# Check application logs for exceptions
grep "Exception\|Error" /var/log/application.log | tail -20
# Check stack traces
tail -100 /var/log/application.log
```
**Common causes**:
- NullPointerException / TypeError
- Unhandled promise rejection
- Database connection error
- Missing environment variable
**Mitigation**:
- Fix bug in code
- Add error handling
- Add input validation
- Add monitoring for this error
---
#### 502 Bad Gateway
**Cause**: Reverse proxy can't reach backend
**Diagnosis**:
```bash
# Check if application is running
ps aux | grep "node\|java\|python"
# Check application port
netstat -tlnp | grep <PORT>
# Check reverse proxy logs (nginx, apache)
tail -f /var/log/nginx/error.log
```
**Common causes**:
- Application crashed
- Application not listening on expected port
- Firewall blocking connection
- Reverse proxy misconfigured
**Mitigation**:
- Restart application
- Check application logs for crash reason
- Verify port configuration
- Check reverse proxy config
---
#### 503 Service Unavailable
**Cause**: Application overloaded or unhealthy
**Diagnosis**:
```bash
# Check application health
curl http://localhost:<PORT>/health
# Check connection pool
# Database connections, HTTP connections
# Check queue depth
# Message queues, task queues
```
**Common causes**:
- Too many concurrent requests
- Database connection pool exhausted
- Dependency service down
- Health check failing
**Mitigation**:
- Scale horizontally (add more instances)
- Increase connection pool size
- Rate limiting
- Circuit breaker for dependencies
---
#### 504 Gateway Timeout
**Cause**: Application took too long to respond
**Diagnosis**:
```bash
# Check what's slow
# Database query? External API? Long computation?
# Check application logs for slow operations
grep "slow\|timeout" /var/log/application.log
```
**Common causes**:
- Slow database query
- Slow external API call
- Long-running computation
- Deadlock
**Mitigation**:
- Optimize slow operation
- Add timeout to prevent indefinite wait
- Async processing (return 202 Accepted)
- Increase timeout (last resort)
---
### 3. Memory Leak (Backend)
**Symptoms**:
- Memory usage increasing over time
- Application crashes with OutOfMemoryError
- Performance degrades over time
**Diagnosis**:
#### Monitor Memory Over Time
```bash
# Linux
watch -n 5 'ps aux | grep <PROCESS> | awk "{print \$4, \$5, \$6}"'
# Get heap dump (Java)
jmap -dump:format=b,file=heap.bin <PID>
# Get heap snapshot (Node.js)
node --inspect index.js
# Chrome DevTools → Memory → Take heap snapshot
```
**Red flags**:
- Memory increasing linearly
- Memory not released after GC
- Large arrays/objects in heap dump
---
#### Common Causes
```javascript
// 1. Event listeners not removed
emitter.on('event', handler); // Never removed
// 2. Timers not cleared
setInterval(() => { /* ... */ }, 1000); // Never cleared
// 3. Global variables growing
global.cache = {}; // Grows forever
// 4. Closures holding references
function createHandler() {
const largeData = new Array(1000000);
return () => {
// Closure keeps largeData in memory
};
}
// 5. Connection leaks
const conn = await db.connect();
// Never closed → connection pool exhausted
```
**Mitigation**:
```javascript
// 1. Remove event listeners
const handler = () => { /* ... */ };
emitter.on('event', handler);
// Later:
emitter.off('event', handler);
// 2. Clear timers
const intervalId = setInterval(() => { /* ... */ }, 1000);
// Later:
clearInterval(intervalId);
// 3. Use LRU cache
const LRU = require('lru-cache');
const cache = new LRU({ max: 1000 });
// 4. Be careful with closures
function createHandler() {
return () => {
const largeData = loadData(); // Load when needed
};
}
// 5. Always close connections
const conn = await db.connect();
try {
await conn.query(/* ... */);
} finally {
await conn.close();
}
```
---
### 4. High CPU Usage
**Symptoms**:
- CPU at 100%
- Slow response times
- Server becomes unresponsive
**Diagnosis**:
#### Identify CPU-heavy Process
```bash
# Top CPU processes
top -bn1 | head -20
# CPU per thread (Java)
top -H -p <PID>
# Profile application (Node.js)
node --prof index.js
node --prof-process isolate-*.log
```
**Common causes**:
- Infinite loop
- Heavy computation (parsing, encryption)
- Regular expression catastrophic backtracking
- Large JSON parsing
**Mitigation**:
```javascript
// 1. Break up heavy computation
async function processLargeArray(items) {
for (let i = 0; i < items.length; i++) {
await processItem(items[i]);
// Yield to event loop
if (i % 100 === 0) {
await new Promise(resolve => setImmediate(resolve));
}
}
}
// 2. Use worker threads (Node.js)
const { Worker } = require('worker_threads');
const worker = new Worker('./heavy-computation.js');
// 3. Cache results
const cache = new Map();
function expensiveOperation(input) {
if (cache.has(input)) return cache.get(input);
const result = /* heavy computation */;
cache.set(input, result);
return result;
}
// 4. Fix regex
// Bad: /(.+)*/ (catastrophic backtracking)
// Good: /(.+?)/ (non-greedy)
```
---
### 5. Connection Pool Exhausted
**Symptoms**:
- "Connection pool exhausted" errors
- "Too many connections" errors
- Requests timing out
**Diagnosis**:
#### Check Connection Pool
```bash
# Database connections
# PostgreSQL:
SELECT count(*) FROM pg_stat_activity;
# MySQL:
SHOW PROCESSLIST;
# Application connection pool
# Check application metrics/logs
```
**Red flags**:
- Connections = max pool size
- Idle connections in transaction
- Long-running queries holding connections
**Common causes**:
- Connections not released (missing .close())
- Connection leak in error path
- Pool size too small
- Long-running queries
**Mitigation**:
```javascript
// 1. Always close connections
async function queryDatabase() {
const conn = await pool.connect();
try {
const result = await conn.query('SELECT * FROM users');
return result;
} finally {
conn.release(); // CRITICAL
}
}
// 2. Use connection pool wrapper
const pool = new Pool({
max: 20, // max connections
idleTimeoutMillis: 30000,
connectionTimeoutMillis: 2000,
});
// 3. Monitor pool metrics
pool.on('error', (err) => {
console.error('Pool error:', err);
});
// 4. Increase pool size (if needed)
// But investigate leaks first!
```
---
## Backend Performance Metrics
**Response Time**:
- p50: <100ms
- p95: <500ms
- p99: <1s
**Throughput**:
- Requests per second (RPS)
- Requests per minute (RPM)
**Error Rate**:
- Target: <0.1%
- 4xx errors: Client errors (validation)
- 5xx errors: Server errors (bugs, downtime)
**Resource Usage**:
- CPU: <70% average
- Memory: <80% of limit
- Connections: <80% of pool size
**Availability**:
- Target: 99.9% (8.76 hours downtime/year)
- 99.99%: 52.6 minutes downtime/year
- 99.999%: 5.26 minutes downtime/year
---
## Backend Diagnostic Checklist
**When diagnosing slow backend**:
- [ ] Check application logs for errors
- [ ] Check CPU usage (target: <70%)
- [ ] Check memory usage (target: <80%)
- [ ] Check database query time (see database-diagnostics.md)
- [ ] Check external API calls (timeout, errors)
- [ ] Check connection pool (target: <80% used)
- [ ] Check error rate (target: <0.1%)
- [ ] Check response time percentiles (p95, p99)
- [ ] Check for thread leaks (increasing thread count)
- [ ] Check for memory leaks (increasing memory over time)
**Tools**:
- Application logs
- APM tools (New Relic, DataDog, AppDynamics)
- `top`, `htop`, `ps`, `lsof`
- `curl` with timing
- Profilers (node --prof, jstack, py-spy)
---
## Related Documentation
- [SKILL.md](../SKILL.md) - Main SRE agent
- [database-diagnostics.md](database-diagnostics.md) - Database troubleshooting
- [infrastructure.md](infrastructure.md) - Server/network troubleshooting
- [monitoring.md](monitoring.md) - Observability tools

View File

@@ -0,0 +1,509 @@
# Database Diagnostics
**Purpose**: Troubleshoot database performance, slow queries, deadlocks, and connection issues.
## Common Database Issues
### 1. Slow Query
**Symptoms**:
- API response time high
- Specific endpoint slow
- Database CPU high
**Diagnosis**:
#### Enable Slow Query Log (PostgreSQL)
```sql
-- Set slow query threshold (1 second)
ALTER SYSTEM SET log_min_duration_statement = 1000;
SELECT pg_reload_conf();
-- Check slow query log
-- /var/log/postgresql/postgresql.log
```
#### Enable Slow Query Log (MySQL)
```sql
-- Enable slow query log
SET GLOBAL slow_query_log = 'ON';
SET GLOBAL long_query_time = 1;
-- Check slow query log
-- /var/log/mysql/mysql-slow.log
```
---
#### Analyze Query with EXPLAIN
```sql
-- PostgreSQL
EXPLAIN ANALYZE
SELECT users.*, posts.*
FROM users
LEFT JOIN posts ON posts.user_id = users.id
WHERE users.last_login_at > NOW() - INTERVAL '30 days';
-- Look for:
-- - Seq Scan (sequential scan = BAD for large tables)
-- - High cost numbers
-- - High actual time
```
**Red flags in EXPLAIN output**:
- **Seq Scan** on large table (>10k rows) → Missing index
- **Nested Loop** with large outer table → Missing index
- **Hash Join** with large tables → Consider index
- **Actual time** >> **Planned time** → Statistics outdated
**Example Bad Query**:
```
Seq Scan on users (cost=0.00..100000 rows=10000000)
Filter: (last_login_at > '2025-09-26'::date)
Rows Removed by Filter: 9900000
```
**Missing index on last_login_at**
---
#### Check Missing Indexes
```sql
-- PostgreSQL: Find missing indexes
SELECT
schemaname,
tablename,
seq_scan,
seq_tup_read,
idx_scan,
seq_tup_read / seq_scan AS avg_seq_read
FROM pg_stat_user_tables
WHERE seq_scan > 0
ORDER BY seq_tup_read DESC
LIMIT 20;
-- Tables with high seq_scan and low idx_scan need indexes
```
---
#### Create Index
```sql
-- PostgreSQL (CONCURRENTLY = no table lock)
CREATE INDEX CONCURRENTLY idx_users_last_login_at
ON users(last_login_at);
-- Verify index is used
EXPLAIN ANALYZE
SELECT * FROM users WHERE last_login_at > NOW() - INTERVAL '30 days';
-- Should show: Index Scan using idx_users_last_login_at
```
**Impact**:
- Before: 7.8 seconds (Seq Scan)
- After: 50ms (Index Scan)
---
### 2. Database Deadlock
**Symptoms**:
- "Deadlock detected" errors
- Transactions timing out
- API 500 errors
**Diagnosis**:
#### Check for Deadlocks (PostgreSQL)
```sql
-- Check currently locked queries
SELECT
blocked_locks.pid AS blocked_pid,
blocked_activity.usename AS blocked_user,
blocking_locks.pid AS blocking_pid,
blocking_activity.usename AS blocking_user,
blocked_activity.query AS blocked_statement,
blocking_activity.query AS blocking_statement
FROM pg_catalog.pg_locks blocked_locks
JOIN pg_catalog.pg_stat_activity blocked_activity ON blocked_activity.pid = blocked_locks.pid
JOIN pg_catalog.pg_locks blocking_locks
ON blocking_locks.locktype = blocked_locks.locktype
AND blocking_locks.database IS NOT DISTINCT FROM blocked_locks.database
AND blocking_locks.relation IS NOT DISTINCT FROM blocked_locks.relation
AND blocking_locks.pid != blocked_locks.pid
JOIN pg_catalog.pg_stat_activity blocking_activity ON blocking_activity.pid = blocking_locks.pid
WHERE NOT blocked_locks.granted;
```
#### Check for Deadlocks (MySQL)
```sql
-- Show InnoDB status (includes deadlock info)
SHOW ENGINE INNODB STATUS\G
-- Look for "LATEST DETECTED DEADLOCK" section
```
---
#### Common Deadlock Patterns
```sql
-- Pattern 1: Lock order mismatch
-- Transaction 1:
BEGIN;
UPDATE accounts SET balance = balance - 100 WHERE id = 1;
UPDATE accounts SET balance = balance + 100 WHERE id = 2;
COMMIT;
-- Transaction 2 (runs concurrently):
BEGIN;
UPDATE accounts SET balance = balance - 50 WHERE id = 2; -- Locks id=2
UPDATE accounts SET balance = balance + 50 WHERE id = 1; -- Waits for id=1 (deadlock!)
COMMIT;
```
**Fix**: Always lock in same order
```sql
-- Both transactions lock in order: id=1, then id=2
BEGIN;
UPDATE accounts SET balance = balance - 100 WHERE id = LEAST(1, 2);
UPDATE accounts SET balance = balance + 100 WHERE id = GREATEST(1, 2);
COMMIT;
```
---
#### Immediate Mitigation
```sql
-- PostgreSQL: Kill blocking query
SELECT pg_terminate_backend(<blocking_pid>);
-- PostgreSQL: Kill idle transactions
SELECT pg_terminate_backend(pid)
FROM pg_stat_activity
WHERE state = 'idle in transaction'
AND state_change < NOW() - INTERVAL '5 minutes';
```
---
### 3. Connection Pool Exhausted
**Symptoms**:
- "Too many connections" errors
- "Connection pool exhausted" errors
- New connections timing out
**Diagnosis**:
#### Check Active Connections (PostgreSQL)
```sql
-- Count connections by state
SELECT state, count(*)
FROM pg_stat_activity
GROUP BY state;
-- Show all connections
SELECT pid, usename, application_name, state, query
FROM pg_stat_activity
WHERE state != 'idle';
-- Check max connections
SHOW max_connections;
```
#### Check Active Connections (MySQL)
```sql
-- Show all connections
SHOW PROCESSLIST;
-- Count connections by state
SELECT state, COUNT(*)
FROM information_schema.processlist
GROUP BY state;
-- Check max connections
SHOW VARIABLES LIKE 'max_connections';
```
**Red flags**:
- Connections = max_connections
- Many "idle in transaction" (connections held but not used)
- Long-running queries holding connections
---
#### Immediate Mitigation
```sql
-- PostgreSQL: Kill idle connections
SELECT pg_terminate_backend(pid)
FROM pg_stat_activity
WHERE state = 'idle'
AND state_change < NOW() - INTERVAL '10 minutes';
-- Increase max_connections (temporary)
ALTER SYSTEM SET max_connections = 200;
SELECT pg_reload_conf();
```
**Long-term Fix**:
- Fix connection leaks in application code
- Increase connection pool size (if needed)
- Add connection timeout
- Use connection pooler (PgBouncer, ProxySQL)
---
### 4. High Database CPU
**Symptoms**:
- Database CPU >80%
- All queries slow
- Server overload
**Diagnosis**:
#### Find CPU-heavy Queries (PostgreSQL)
```sql
-- Top queries by total time
SELECT
query,
calls,
total_exec_time,
mean_exec_time,
max_exec_time
FROM pg_stat_statements
ORDER BY total_exec_time DESC
LIMIT 10;
-- Requires: CREATE EXTENSION pg_stat_statements;
```
#### Find CPU-heavy Queries (MySQL)
```sql
-- Enable performance schema
SET GLOBAL performance_schema = ON;
-- Top queries by execution time
SELECT
DIGEST_TEXT,
COUNT_STAR,
SUM_TIMER_WAIT,
AVG_TIMER_WAIT
FROM performance_schema.events_statements_summary_by_digest
ORDER BY SUM_TIMER_WAIT DESC
LIMIT 10;
```
**Common causes**:
- Missing indexes (Seq Scan)
- Complex queries (many JOINs)
- Aggregations on large tables
- Full table scans
**Mitigation**:
- Add missing indexes
- Optimize queries (reduce JOINs)
- Add query caching
- Scale database (read replicas)
---
### 5. Disk Full
**Symptoms**:
- "No space left on device" errors
- Database refuses writes
- Application crashes
**Diagnosis**:
#### Check Disk Usage
```bash
# Linux
df -h
# Database data directory
du -sh /var/lib/postgresql/data/*
du -sh /var/lib/mysql/*
# Find large tables
# PostgreSQL:
SELECT
schemaname,
tablename,
pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename)) AS size
FROM pg_tables
ORDER BY pg_total_relation_size(schemaname||'.'||tablename) DESC
LIMIT 20;
```
---
#### Immediate Mitigation
```bash
# 1. Clean up logs
rm /var/log/postgresql/postgresql-*.log.1
rm /var/log/mysql/mysql-slow.log.1
# 2. Vacuum database (PostgreSQL)
VACUUM FULL;
# 3. Archive old data
# Move old records to archive table or backup
# 4. Expand disk (cloud)
# AWS: Modify EBS volume size
# Azure: Expand managed disk
```
---
### 6. Replication Lag
**Symptoms**:
- Stale data on read replicas
- Monitoring alerts for lag
- Eventually consistent reads
**Diagnosis**:
#### Check Replication Lag (PostgreSQL)
```sql
-- On primary:
SELECT * FROM pg_stat_replication;
-- On replica:
SELECT
now() - pg_last_xact_replay_timestamp() AS replication_lag;
```
#### Check Replication Lag (MySQL)
```sql
-- On replica:
SHOW SLAVE STATUS\G
-- Look for: Seconds_Behind_Master
```
**Red flags**:
- Lag >1 minute
- Lag increasing over time
**Common causes**:
- High write load on primary
- Replica under-provisioned
- Network latency
- Long-running query blocking replay
**Mitigation**:
- Scale up replica (more CPU, memory)
- Optimize slow queries on primary
- Increase network bandwidth
- Add more replicas (distribute read load)
---
## Database Performance Metrics
**Query Performance**:
- p50 query time: <10ms
- p95 query time: <100ms
- p99 query time: <500ms
**Resource Usage**:
- CPU: <70% average
- Memory: <80% of available
- Disk I/O: <80% of throughput
- Connections: <80% of max
**Availability**:
- Uptime: 99.99% (52.6 min downtime/year)
- Replication lag: <1 second
---
## Database Diagnostic Checklist
**When diagnosing slow database**:
- [ ] Check slow query log
- [ ] Run EXPLAIN ANALYZE on slow queries
- [ ] Check for missing indexes (seq_scan > idx_scan)
- [ ] Check for deadlocks
- [ ] Check connection count (target: <80% of max)
- [ ] Check database CPU (target: <70%)
- [ ] Check disk space (target: <80% used)
- [ ] Check replication lag (target: <1s)
- [ ] Check for long-running queries (>30s)
- [ ] Check for idle transactions (>5 min)
**Tools**:
- `EXPLAIN ANALYZE`
- `pg_stat_statements` (PostgreSQL)
- Performance Schema (MySQL)
- `pg_stat_activity` (PostgreSQL)
- `SHOW PROCESSLIST` (MySQL)
- Database monitoring (CloudWatch, DataDog)
---
## Database Anti-Patterns
### 1. N+1 Query Problem
```javascript
// BAD: N+1 queries
const users = await db.query('SELECT * FROM users');
for (const user of users) {
const posts = await db.query('SELECT * FROM posts WHERE user_id = ?', [user.id]);
}
// 1 query + N queries = N+1
// GOOD: Single query with JOIN
const usersWithPosts = await db.query(`
SELECT users.*, posts.*
FROM users
LEFT JOIN posts ON posts.user_id = users.id
`);
```
### 2. SELECT *
```sql
-- BAD: Fetches all columns (inefficient)
SELECT * FROM users WHERE id = 1;
-- GOOD: Fetch only needed columns
SELECT id, name, email FROM users WHERE id = 1;
```
### 3. Missing Indexes
```sql
-- BAD: No index on frequently queried column
SELECT * FROM users WHERE email = 'user@example.com';
-- Seq Scan on users
-- GOOD: Add index
CREATE INDEX idx_users_email ON users(email);
-- Index Scan using idx_users_email
```
### 4. Long Transactions
```javascript
// BAD: Long transaction holding locks
BEGIN;
const user = await db.query('SELECT * FROM users WHERE id = 1 FOR UPDATE');
await sendEmail(user.email); // External API call (slow!)
await db.query('UPDATE users SET last_email_sent = NOW() WHERE id = 1');
COMMIT;
// GOOD: Keep transactions short
const user = await db.query('SELECT * FROM users WHERE id = 1');
await sendEmail(user.email); // Outside transaction
await db.query('UPDATE users SET last_email_sent = NOW() WHERE id = 1');
```
---
## Related Documentation
- [SKILL.md](../SKILL.md) - Main SRE agent
- [backend-diagnostics.md](backend-diagnostics.md) - Backend troubleshooting
- [infrastructure.md](infrastructure.md) - Server/network troubleshooting

View File

@@ -0,0 +1,561 @@
# Infrastructure Diagnostics
**Purpose**: Troubleshoot server, network, disk, and cloud infrastructure issues.
## Common Infrastructure Issues
### 1. High CPU Usage (Server)
**Symptoms**:
- Server CPU at 100%
- Applications slow
- SSH lag
**Diagnosis**:
#### Check CPU Usage
```bash
# Overall CPU usage
top -bn1 | grep "Cpu(s)"
# Top CPU processes
top -bn1 | head -20
# CPU usage per core
mpstat -P ALL 1 5
# Historical CPU (if sar installed)
sar -u 1 10
```
**Red flags**:
- CPU at 100% for >5 minutes
- Single process using >80% CPU
- iowait >20% (disk bottleneck)
- System CPU >30% (kernel overhead)
---
#### Identify CPU-heavy Process
```bash
# Top CPU process
ps aux | sort -nrk 3,3 | head -10
# CPU per thread
top -H
# Process tree
pstree -p
```
**Common causes**:
- Application bug (infinite loop)
- Heavy computation
- Crypto mining malware
- Backup/compression running
---
#### Immediate Mitigation
```bash
# 1. Limit process CPU (nice)
renice +10 <PID> # Lower priority
# 2. Kill process (last resort)
kill -TERM <PID> # Graceful
kill -KILL <PID> # Force kill
# 3. Scale horizontally (add servers)
# Cloud: Auto-scaling group
# 4. Scale vertically (bigger instance)
# Cloud: Resize instance
```
---
### 2. Out of Memory (OOM)
**Symptoms**:
- "Out of memory" errors
- OOM Killer triggered
- Applications crash
- Swap usage high
**Diagnosis**:
#### Check Memory Usage
```bash
# Current memory usage
free -h
# Memory per process
ps aux | sort -nrk 4,4 | head -10
# Check OOM killer logs
dmesg | grep -i "out of memory\|oom"
grep "Out of memory" /var/log/syslog
# Check swap usage
swapon -s
```
**Red flags**:
- Available memory <10%
- Swap usage >80%
- OOM killer active
- Single process using >50% memory
---
#### Immediate Mitigation
```bash
# 1. Free page cache (safe)
sync && echo 3 > /proc/sys/vm/drop_caches
# 2. Kill memory-heavy process
kill -9 <PID>
# 3. Increase swap (temporary)
dd if=/dev/zero of=/swapfile bs=1M count=2048
mkswap /swapfile
swapon /swapfile
# 4. Scale up (more RAM)
# Cloud: Resize instance
```
---
### 3. Disk Full
**Symptoms**:
- "No space left on device" errors
- Applications can't write files
- Database refuses writes
- Logs not being written
**Diagnosis**:
#### Check Disk Usage
```bash
# Disk usage by partition
df -h
# Disk usage by directory
du -sh /*
du -sh /var/*
# Find large files
find / -type f -size +100M -exec ls -lh {} \;
# Find files using deleted space
lsof | grep deleted
```
**Red flags**:
- Disk usage >90%
- /var/log full (runaway logs)
- /tmp full (temp files not cleaned)
- Deleted files still holding space (process has handle)
---
#### Immediate Mitigation
```bash
# 1. Clean up logs
find /var/log -name "*.log.*" -mtime +7 -delete
journalctl --vacuum-time=7d
# 2. Clean up temp files
rm -rf /tmp/*
rm -rf /var/tmp/*
# 3. Find and remove deleted files holding space
lsof | grep deleted | awk '{print $2}' | xargs kill -9
# 4. Compress logs
gzip /var/log/*.log
# 5. Expand disk (cloud)
# AWS: Modify EBS volume size
# Azure: Expand managed disk
# After expanding:
resize2fs /dev/xvda1 # ext4
xfs_growfs / # xfs
```
---
### 4. Network Issues
**Symptoms**:
- Slow network performance
- Timeouts
- Connection refused
- High latency
**Diagnosis**:
#### Check Network Connectivity
```bash
# Ping test
ping -c 5 google.com
# DNS resolution
nslookup example.com
dig example.com
# Traceroute
traceroute example.com
# Check network interfaces
ip addr show
ifconfig
# Check routing table
ip route show
route -n
```
**Red flags**:
- Packet loss >1%
- Latency >100ms (same region)
- DNS resolution failures
- Interface down
---
#### Check Network Bandwidth
```bash
# Current bandwidth usage
iftop -i eth0
# Network stats
netstat -i
# Historical bandwidth (if vnstat installed)
vnstat -l
# Check for bandwidth limits (cloud)
# AWS: Check CloudWatch NetworkIn/NetworkOut
```
---
#### Check Firewall Rules
```bash
# Check iptables rules
iptables -L -n -v
# Check firewalld (RHEL/CentOS)
firewall-cmd --list-all
# Check UFW (Ubuntu)
ufw status verbose
# Check security groups (cloud)
# AWS: EC2 → Security Groups
# Azure: Network Security Groups
```
**Common causes**:
- Firewall blocking traffic
- Security group misconfigured
- MTU mismatch
- Network congestion
- DDoS attack
---
#### Immediate Mitigation
```bash
# 1. Check firewall allows traffic
iptables -A INPUT -p tcp --dport 80 -j ACCEPT
iptables -A INPUT -p tcp --dport 443 -j ACCEPT
# 2. Restart networking
systemctl restart networking
systemctl restart NetworkManager
# 3. Flush DNS cache
systemd-resolve --flush-caches
# 4. Check cloud network ACLs
# Ensure subnet has route to internet gateway
```
---
### 5. High Disk I/O (Slow Disk)
**Symptoms**:
- Applications slow
- High iowait CPU
- Disk latency high
**Diagnosis**:
#### Check Disk I/O
```bash
# Disk I/O stats
iostat -x 1 5
# Look for:
# - %util >80% (disk saturated)
# - await >100ms (high latency)
# Top I/O processes
iotop -o
# Historical I/O (if sar installed)
sar -d 1 10
```
**Red flags**:
- %util at 100%
- await >100ms
- iowait CPU >20%
- Queue size (avgqu-sz) >10
---
#### Common Causes
```bash
# 1. Database without indexes (Seq Scan)
# See database-diagnostics.md
# 2. Log rotation running
# Large logs being compressed
# 3. Backup running
# Database dump, file backup
# 4. Disk issue (bad sectors)
dmesg | grep -i "I/O error"
smartctl -a /dev/sda # SMART status
```
---
#### Immediate Mitigation
```bash
# 1. Reduce I/O pressure
# Stop non-critical processes (backup, log rotation)
# 2. Add read cache
# Enable query caching (database)
# Add Redis for application cache
# 3. Scale disk IOPS (cloud)
# AWS: Change EBS volume type (gp2 → gp3 → io1)
# Azure: Change disk tier
# 4. Move to SSD (if on HDD)
```
---
### 6. Service Down / Process Crashed
**Symptoms**:
- Service not responding
- Health check failures
- 502 Bad Gateway
**Diagnosis**:
#### Check Service Status
```bash
# Systemd services
systemctl status nginx
systemctl status postgresql
systemctl status application
# Check if process running
ps aux | grep nginx
pidof nginx
# Check service logs
journalctl -u nginx -n 50
tail -f /var/log/nginx/error.log
```
**Red flags**:
- Service: inactive (dead)
- Process not found
- Recent crash in logs
---
#### Check Why Service Crashed
```bash
# Check system logs
dmesg | tail -50
grep "error\|segfault\|killed" /var/log/syslog
# Check application logs
tail -100 /var/log/application.log
# Check for OOM killer
dmesg | grep -i "killed process"
# Check core dumps
ls -l /var/crash/
ls -l /tmp/core*
```
**Common causes**:
- Out of memory (OOM Killer)
- Segmentation fault (code bug)
- Unhandled exception
- Dependency service down
- Configuration error
---
#### Immediate Mitigation
```bash
# 1. Restart service
systemctl restart nginx
# 2. Check if started successfully
systemctl status nginx
curl http://localhost
# 3. If startup fails, check config
nginx -t # Test nginx config
postgresql -D /var/lib/postgresql/data --config-test
# 4. Enable auto-restart (systemd)
# Add to service file:
[Service]
Restart=always
RestartSec=10
```
---
### 7. Cloud Infrastructure Issues
#### AWS-Specific
**Instance Issues**:
```bash
# Check instance health
aws ec2 describe-instance-status --instance-ids i-1234567890abcdef0
# Check system logs
aws ec2 get-console-output --instance-id i-1234567890abcdef0
# Check CloudWatch metrics
aws cloudwatch get-metric-statistics \
--namespace AWS/EC2 \
--metric-name CPUUtilization \
--dimensions Name=InstanceId,Value=i-1234567890abcdef0
```
**EBS Volume Issues**:
```bash
# Check volume status
aws ec2 describe-volumes --volume-ids vol-1234567890abcdef0
# Increase IOPS (gp3)
aws ec2 modify-volume \
--volume-id vol-1234567890abcdef0 \
--iops 3000
# Check volume metrics
aws cloudwatch get-metric-statistics \
--namespace AWS/EBS \
--metric-name VolumeReadOps
```
**Network Issues**:
```bash
# Check security groups
aws ec2 describe-security-groups --group-ids sg-1234567890abcdef0
# Check network ACLs
aws ec2 describe-network-acls --network-acl-ids acl-1234567890abcdef0
# Check route tables
aws ec2 describe-route-tables --route-table-ids rtb-1234567890abcdef0
```
---
#### Azure-Specific
**VM Issues**:
```bash
# Check VM status
az vm get-instance-view --name myVM --resource-group myRG
# Restart VM
az vm restart --name myVM --resource-group myRG
# Resize VM
az vm resize --name myVM --resource-group myRG --size Standard_D4s_v3
```
**Disk Issues**:
```bash
# Check disk status
az disk show --name myDisk --resource-group myRG
# Expand disk
az disk update --name myDisk --resource-group myRG --size-gb 256
```
---
## Infrastructure Performance Metrics
**Server Health**:
- CPU: <70% average, <90% peak
- Memory: <80% usage
- Disk: <80% usage, <80% IOPS
- Network: <70% bandwidth
**Uptime**:
- Target: 99.9% (8.76 hours downtime/year)
- Monitoring: Check every 1 minute
**Response Time**:
- Ping latency: <50ms (same region)
- HTTP response: <200ms
---
## Infrastructure Diagnostic Checklist
**When diagnosing infrastructure issues**:
- [ ] Check CPU usage (target: <70%)
- [ ] Check memory usage (target: <80%)
- [ ] Check disk usage (target: <80%)
- [ ] Check disk I/O (%util, await)
- [ ] Check network connectivity (ping, traceroute)
- [ ] Check firewall rules (iptables, security groups)
- [ ] Check service status (systemd, ps)
- [ ] Check system logs (dmesg, /var/log/syslog)
- [ ] Check cloud metrics (CloudWatch, Azure Monitor)
- [ ] Check for hardware issues (SMART, dmesg errors)
**Tools**:
- `top`, `htop` - CPU, memory
- `df`, `du` - Disk usage
- `iostat` - Disk I/O
- `iftop`, `netstat` - Network
- `dmesg`, `journalctl` - System logs
- Cloud dashboards (AWS, Azure, GCP)
---
## Related Documentation
- [SKILL.md](../SKILL.md) - Main SRE agent
- [backend-diagnostics.md](backend-diagnostics.md) - Application-level troubleshooting
- [database-diagnostics.md](database-diagnostics.md) - Database performance
- [security-incidents.md](security-incidents.md) - Security response

View File

@@ -0,0 +1,439 @@
# Monitoring & Observability
**Purpose**: Set up monitoring, alerting, and observability to detect incidents early.
## Observability Pillars
### 1. Metrics
**What to Monitor**:
- **Application**: Response time, error rate, throughput
- **Infrastructure**: CPU, memory, disk, network
- **Database**: Query time, connections, deadlocks
- **Business**: User signups, revenue, conversions
**Tools**:
- Prometheus + Grafana
- DataDog
- New Relic
- CloudWatch (AWS)
- Azure Monitor
---
#### Key Metrics by Layer
**Application Metrics**:
```
http_requests_total # Total requests
http_request_duration_seconds # Response time (histogram)
http_requests_errors_total # Error count
http_requests_in_flight # Concurrent requests
```
**Infrastructure Metrics**:
```
node_cpu_seconds_total # CPU usage
node_memory_usage_bytes # Memory usage
node_disk_usage_bytes # Disk usage
node_network_receive_bytes_total # Network in
```
**Database Metrics**:
```
pg_stat_database_tup_returned # Rows returned
pg_stat_database_tup_fetched # Rows fetched
pg_stat_database_deadlocks # Deadlock count
pg_stat_activity_connections # Active connections
```
---
### 2. Logs
**What to Log**:
- **Application logs**: Errors, warnings, info
- **Access logs**: HTTP requests (nginx, apache)
- **System logs**: Kernel, systemd, auth
- **Audit logs**: Security events, data access
**Log Levels**:
- **ERROR**: Application errors, exceptions
- **WARN**: Potential issues (deprecated API, high latency)
- **INFO**: Normal operations (user login, job completed)
- **DEBUG**: Detailed troubleshooting (only in dev)
**Tools**:
- ELK Stack (Elasticsearch, Logstash, Kibana)
- Splunk
- CloudWatch Logs
- Azure Log Analytics
---
#### Structured Logging
**BAD** (unstructured):
```javascript
console.log("User logged in: " + userId);
```
**GOOD** (structured JSON):
```javascript
logger.info("User logged in", {
userId: 123,
ip: "192.168.1.1",
timestamp: "2025-10-26T12:00:00Z",
userAgent: "Mozilla/5.0...",
});
// Output:
// {"level":"info","message":"User logged in","userId":123,"ip":"192.168.1.1",...}
```
**Benefits**:
- Queryable (filter by userId)
- Machine-readable
- Consistent format
---
### 3. Traces
**Purpose**: Track request flow through distributed systems
**Example**:
```
User Request → API Gateway → Auth Service → Payment Service → Database
1ms 2ms 50ms 100ms 30ms
↑ SLOW SPAN
```
**Tools**:
- Jaeger
- Zipkin
- AWS X-Ray
- DataDog APM
- New Relic
**When to Use**:
- Microservices architecture
- Slow requests (which service is slow?)
- Debugging distributed systems
---
## Alerting Best Practices
### Alert on Symptoms, Not Causes
**BAD** (cause-based):
- Alert: "CPU usage >80%"
- Problem: CPU can be high without user impact
**GOOD** (symptom-based):
- Alert: "API response time >1s"
- Why: Users actually experiencing slowness
---
### Alert Severity Levels
**P1 (SEV1) - Page On-Call**:
- Service down (availability <99%)
- Data loss
- Security breach
- Response time >5s (unusable)
**P2 (SEV2) - Notify During Business Hours**:
- Degraded performance (response time >1s)
- Error rate >1%
- Disk >90% full
**P3 (SEV3) - Email/Slack**:
- Warning signs (disk >80%, memory >80%)
- Non-critical errors
- Monitoring gaps
---
### Alert Fatigue Prevention
**Rules**:
1. **Actionable**: Every alert must have clear action
2. **Meaningful**: Alert only on real problems
3. **Context**: Include relevant info (which server, which metric)
4. **Deduplicate**: Don't alert 100 times for same issue
5. **Escalate**: Auto-escalate if not acknowledged
**Example Bad Alert**:
```
Subject: Alert
Body: Server is down
```
**Example Good Alert**:
```
Subject: [P1] API Server Down - Production
Body:
- Service: api.example.com
- Issue: Health check failing for 5 minutes
- Impact: All users affected (100%)
- Runbook: https://wiki.example.com/runbook/api-down
- Dashboard: https://grafana.example.com/d/api
```
---
## Monitoring Setup
### Application Monitoring
#### Prometheus + Grafana
**Install Prometheus Client** (Node.js):
```javascript
const client = require('prom-client');
// Enable default metrics (CPU, memory, etc.)
client.collectDefaultMetrics();
// Custom metrics
const httpRequestDuration = new client.Histogram({
name: 'http_request_duration_seconds',
help: 'HTTP request duration in seconds',
labelNames: ['method', 'route', 'status'],
});
// Instrument code
app.use((req, res, next) => {
const end = httpRequestDuration.startTimer();
res.on('finish', () => {
end({ method: req.method, route: req.route.path, status: res.statusCode });
});
next();
});
// Expose metrics endpoint
app.get('/metrics', (req, res) => {
res.set('Content-Type', client.register.contentType);
res.end(client.register.metrics());
});
```
**Prometheus Config** (prometheus.yml):
```yaml
scrape_configs:
- job_name: 'api-server'
static_configs:
- targets: ['localhost:3000']
scrape_interval: 15s
```
---
### Log Aggregation
#### ELK Stack
**Application** (send logs to Logstash):
```javascript
const winston = require('winston');
const LogstashTransport = require('winston-logstash-transport').LogstashTransport;
const logger = winston.createLogger({
transports: [
new LogstashTransport({
host: 'logstash.example.com',
port: 5000,
}),
],
});
logger.info('User logged in', { userId: 123, ip: '192.168.1.1' });
```
**Logstash Config**:
```
input {
tcp {
port => 5000
codec => json
}
}
output {
elasticsearch {
hosts => ["elasticsearch:9200"]
index => "application-logs-%{+YYYY.MM.dd}"
}
}
```
---
### Health Checks
**Purpose**: Check if service is healthy and ready to serve traffic
**Types**:
1. **Liveness**: Is the service running? (restart if fails)
2. **Readiness**: Is the service ready to serve traffic? (remove from load balancer if fails)
**Example** (Express.js):
```javascript
// Liveness probe (simple check)
app.get('/healthz', (req, res) => {
res.status(200).send('OK');
});
// Readiness probe (check dependencies)
app.get('/ready', async (req, res) => {
try {
// Check database
await db.query('SELECT 1');
// Check Redis
await redis.ping();
// Check external API
await fetch('https://api.external.com/health');
res.status(200).send('Ready');
} catch (error) {
res.status(503).send('Not ready');
}
});
```
**Kubernetes**:
```yaml
livenessProbe:
httpGet:
path: /healthz
port: 3000
initialDelaySeconds: 30
periodSeconds: 10
readinessProbe:
httpGet:
path: /ready
port: 3000
initialDelaySeconds: 10
periodSeconds: 5
```
---
### SLI, SLO, SLA
**SLI** (Service Level Indicator):
- Metrics that measure service quality
- Examples: Response time, error rate, availability
**SLO** (Service Level Objective):
- Target for SLI
- Examples: "99.9% availability", "p95 response time <500ms"
**SLA** (Service Level Agreement):
- Contract with users (with penalties)
- Examples: "99.9% uptime or refund"
**Example**:
```
SLI: Availability = (successful requests / total requests) * 100
SLO: Availability must be ≥99.9% per month
SLA: If availability <99.9%, users get 10% refund
```
---
## Monitoring Checklist
**Application**:
- [ ] Response time metrics (p50, p95, p99)
- [ ] Error rate metrics (4xx, 5xx)
- [ ] Throughput metrics (requests per second)
- [ ] Health check endpoint (/healthz, /ready)
- [ ] Structured logging (JSON format)
- [ ] Distributed tracing (if microservices)
**Infrastructure**:
- [ ] CPU, memory, disk, network metrics
- [ ] System logs (syslog, journalctl)
- [ ] Cloud metrics (CloudWatch, Azure Monitor)
- [ ] Disk I/O metrics (iostat)
**Database**:
- [ ] Query performance metrics
- [ ] Connection pool metrics
- [ ] Slow query log enabled
- [ ] Deadlock monitoring
**Alerts**:
- [ ] P1 alerts for critical issues (page on-call)
- [ ] P2 alerts for degraded performance
- [ ] Runbook linked in alerts
- [ ] Dashboard linked in alerts
- [ ] Escalation policy configured
**Dashboards**:
- [ ] Overview dashboard (RED metrics: Rate, Errors, Duration)
- [ ] Infrastructure dashboard (CPU, memory, disk)
- [ ] Database dashboard (queries, connections)
- [ ] Business metrics dashboard (signups, revenue)
---
## Common Monitoring Patterns
### RED Method (for services)
**Rate**: Requests per second
**Errors**: Error rate (%)
**Duration**: Response time (p50, p95, p99)
**Dashboard**:
```
+-----------------+ +-----------------+ +-----------------+
| Rate | | Errors | | Duration |
| 1000 req/s | | 0.5% | | p95: 250ms |
+-----------------+ +-----------------+ +-----------------+
```
### USE Method (for resources)
**Utilization**: % of resource used (CPU, memory, disk)
**Saturation**: Queue depth, backlog
**Errors**: Error count
**Dashboard**:
```
CPU: 70% utilization, 0.5 load average, 0 errors
Memory: 80% utilization, 0 swap, 0 OOM kills
Disk: 60% utilization, 5ms latency, 0 I/O errors
```
---
## Tools Comparison
| Tool | Type | Best For | Cost |
|------|------|----------|------|
| Prometheus + Grafana | Metrics | Self-hosted, cost-effective | Free |
| DataDog | Metrics, Logs, APM | All-in-one, easy setup | $15/host/month |
| New Relic | APM | Application performance | $99/user/month |
| ELK Stack | Logs | Log aggregation | Free (self-hosted) |
| Splunk | Logs | Enterprise log analysis | $1800/GB/year |
| Jaeger | Traces | Distributed tracing | Free |
| CloudWatch | Metrics, Logs | AWS-native | $0.30/metric/month |
| Azure Monitor | Metrics, Logs | Azure-native | $0.25/metric/month |
---
## Related Documentation
- [SKILL.md](../SKILL.md) - Main SRE agent
- [backend-diagnostics.md](backend-diagnostics.md) - Application troubleshooting
- [database-diagnostics.md](database-diagnostics.md) - Database monitoring
- [infrastructure.md](infrastructure.md) - Infrastructure monitoring

View File

@@ -0,0 +1,421 @@
# Security Incidents
**Purpose**: Respond to security breaches, DDoS attacks, and unauthorized access attempts.
**IMPORTANT**: For security incidents, SRE Agent collaborates with `security-agent` skill.
## Incident Response Protocol
### SEV1 Security Incidents (CRITICAL)
**Immediate Actions** (First 5 minutes):
1. **Isolate** affected systems
2. **Preserve** evidence (logs, snapshots)
3. **Notify** security team and management
4. **Assess** scope of breach
5. **Document** timeline
**DO NOT**:
- Delete logs (preserve evidence)
- Reboot systems (unless absolutely necessary)
- Make changes without documenting
---
## Common Security Incidents
### 1. DDoS Attack
**Symptoms**:
- Sudden traffic spike (10x-100x normal)
- Legitimate users can't access service
- High bandwidth usage
- Server overload
**Diagnosis**:
#### Check Traffic Patterns
```bash
# Check connections by IP
netstat -ntu | awk '{print $5}' | cut -d: -f1 | sort | uniq -c | sort -nr | head -20
# Check HTTP requests by IP (nginx)
awk '{print $1}' /var/log/nginx/access.log | sort | uniq -c | sort -nr | head -20
# Check requests per second
tail -f /var/log/nginx/access.log | awk '{print $4}' | uniq -c
```
**Red flags**:
- Single IP making thousands of requests
- Requests from suspicious IPs (botnets)
- High rate of 4xx errors (probing)
- Unusual traffic patterns
---
#### Immediate Mitigation
```bash
# 1. Rate limiting (nginx)
# Add to nginx.conf:
limit_req_zone $binary_remote_addr zone=one:10m rate=10r/s;
limit_req zone=one burst=20 nodelay;
# 2. Block suspicious IPs (iptables)
iptables -A INPUT -s <ATTACKER_IP> -j DROP
# 3. Enable DDoS protection (CloudFlare, AWS Shield)
# CloudFlare: Enable "I'm Under Attack" mode
# AWS: Enable AWS Shield Standard/Advanced
# 4. Increase capacity (auto-scaling)
# Scale up to handle traffic (if legitimate)
```
---
### 2. Unauthorized Access / Data Breach
**Symptoms**:
- Alerts for failed login attempts
- Successful login from unusual location
- Unusual data access patterns
- Data exfiltration detected
**Diagnosis**:
#### Check Access Logs
```bash
# Check authentication logs (Linux)
grep "Failed password" /var/log/auth.log | tail -50
# Check successful logins
grep "Accepted password" /var/log/auth.log | tail -50
# Check login attempts by IP
awk '/Failed password/ {print $(NF-3)}' /var/log/auth.log | sort | uniq -c | sort -nr
```
**Red flags**:
- Hundreds of failed login attempts (brute force)
- Successful login from suspicious IP/location
- Login at unusual time (3am)
- Multiple accounts accessed from same IP
---
#### Immediate Response (SEV1)
```bash
# 1. ISOLATE: Disable compromised account
# Application-level:
UPDATE users SET disabled = true WHERE id = <COMPROMISED_USER_ID>;
# System-level:
passwd -l <username> # Lock account
# 2. PRESERVE: Copy logs for forensics
cp /var/log/auth.log /forensics/auth.log.$(date +%Y%m%d)
cp /var/log/nginx/access.log /forensics/access.log.$(date +%Y%m%d)
# 3. ASSESS: Check what was accessed
# Database audit logs
# Application logs
# File access logs
# 4. NOTIFY: Alert security team
# Email, Slack, PagerDuty
# 5. DOCUMENT: Create incident timeline
```
---
#### Long-term Mitigation
- Force password reset for all users
- Enable 2FA/MFA
- Review access controls
- Conduct security audit
- Update security policies
- Train users on security
---
### 3. SQL Injection Attempt
**Symptoms**:
- Unusual SQL queries in logs
- 500 errors with SQL syntax messages
- Alerts from WAF (Web Application Firewall)
**Diagnosis**:
#### Check Application Logs
```bash
# Look for SQL injection patterns
grep -E "(SELECT|INSERT|UPDATE|DELETE).*FROM.*WHERE" /var/log/application.log
# Look for SQL errors
grep "SQLException\|SQL syntax" /var/log/application.log
# Check for malicious patterns
grep -E "(\'\s*OR\s*\'|\-\-|UNION\s+SELECT)" /var/log/nginx/access.log
```
**Example Malicious Request**:
```
GET /api/users?id=1' OR '1'='1
GET /api/users?id=1; DROP TABLE users;--
```
---
#### Immediate Response
```bash
# 1. Block attacker IP
iptables -A INPUT -s <ATTACKER_IP> -j DROP
# 2. Enable WAF rule (ModSecurity, AWS WAF)
# Block requests with SQL keywords
# 3. Check database for unauthorized changes
# Compare current schema with backup
# Check audit logs for suspicious queries
# 4. Review application code
# Use parameterized queries, not string concatenation
```
**Long-term Fix**:
```javascript
// BAD: SQL injection vulnerable
const query = `SELECT * FROM users WHERE id = ${req.query.id}`;
// GOOD: Parameterized query
const query = 'SELECT * FROM users WHERE id = ?';
db.query(query, [req.query.id]);
```
---
### 4. Malware / Crypto Mining
**Symptoms**:
- High CPU usage (100%)
- Unusual network traffic (to crypto pool)
- Unknown processes running
- Server slow
**Diagnosis**:
#### Check Running Processes
```bash
# Check CPU usage by process
top -bn1 | head -20
# Check all processes
ps aux | sort -nrk 3,3 | head -20
# Check for suspicious processes
ps aux | grep -v -E "^(root|www-data|mysql|postgres)"
# Check network connections
netstat -tunap | grep ESTABLISHED
```
**Red flags**:
- Unknown process using 100% CPU
- Connections to crypto mining pools
- Processes running as unexpected user
- Processes with random names (xmrig, minerd)
---
#### Immediate Response
```bash
# 1. Kill malicious process
kill -9 <PID>
# 2. Find and remove malware
find / -name "<PROCESS_NAME>" -delete
# 3. Check for persistence mechanisms
crontab -l # Cron jobs
cat /etc/rc.local # Startup scripts
systemctl list-unit-files # Systemd services
# 4. Change all credentials
# Root password
# SSH keys
# Database passwords
# API keys
# 5. Restore from clean backup (if available)
```
---
### 5. Insider Threat / Data Exfiltration
**Symptoms**:
- Large data downloads
- Database dump exports
- Unusual file transfers
- After-hours access
**Diagnosis**:
#### Check Data Access Logs
```bash
# Check database queries (large exports)
grep "SELECT.*FROM" /var/log/postgresql/postgresql.log | grep -E "LIMIT\s+[0-9]{5,}"
# Check file downloads (nginx)
awk '$10 > 10000000 {print $1, $7, $10}' /var/log/nginx/access.log
# Check SSH file transfers
grep "sftp\|scp" /var/log/auth.log
```
**Red flags**:
- SELECT with no LIMIT (full table export)
- Large file downloads (>10MB)
- Multiple consecutive downloads
- Access from unusual location
---
#### Immediate Response
```bash
# 1. Disable account
UPDATE users SET disabled = true WHERE id = <USER_ID>;
# 2. Preserve evidence
cp /var/log/* /forensics/
# 3. Assess damage
# What data was accessed?
# What data was exported?
# What systems were compromised?
# 4. Legal/compliance notification
# GDPR: Notify within 72 hours
# HIPAA: Notify within 60 days
# PCI-DSS: Immediate notification
# 5. Incident report
```
---
## Security Incident Checklist
**When security incident detected**:
### Phase 1: Immediate Response (0-5 min)
- [ ] Classify severity (SEV1/SEV2/SEV3)
- [ ] Isolate affected systems
- [ ] Preserve evidence (logs, snapshots)
- [ ] Notify security team
- [ ] Document timeline (start timestamp)
### Phase 2: Assessment (5-30 min)
- [ ] Identify attack vector
- [ ] Assess scope (what was compromised?)
- [ ] Check for data exfiltration
- [ ] Identify attacker (IP, location, identity)
- [ ] Determine if ongoing or stopped
### Phase 3: Containment (30 min - 2 hours)
- [ ] Block attacker access
- [ ] Close vulnerability
- [ ] Revoke compromised credentials
- [ ] Remove malware/backdoors
- [ ] Restore from clean backup (if needed)
### Phase 4: Recovery (2 hours - days)
- [ ] Restore normal operations
- [ ] Verify no persistence mechanisms
- [ ] Monitor for re-infection
- [ ] Change all credentials
- [ ] Apply security patches
### Phase 5: Post-Incident (1 week)
- [ ] Complete post-mortem
- [ ] Legal/compliance notifications
- [ ] Security audit
- [ ] Update security policies
- [ ] Train team on lessons learned
---
## Collaboration with Security Agent
**SRE Agent Role**:
- Initial detection and triage
- Immediate containment
- Preserve evidence
- Restore service
**Security Agent Role** (handoff):
- Forensic analysis
- Legal compliance
- Security audit
- Policy updates
**Handoff Protocol**:
```
SRE: Detects security incident → Immediate containment
SRE: Preserves evidence → Creates incident report
SRE: Hands off to Security Agent
Security Agent: Forensic analysis → Legal compliance → Long-term fixes
SRE: Implements security fixes → Updates runbook
```
---
## Security Metrics
**Detection Time**:
- SEV1: <5 minutes from first indicator
- SEV2: <30 minutes
- SEV3: <24 hours
**Response Time**:
- SEV1: Containment within 30 minutes
- SEV2: Containment within 2 hours
- SEV3: Containment within 24 hours
**False Positives**:
- Target: <5% of security alerts
---
## Related Documentation
- [SKILL.md](../SKILL.md) - Main SRE agent
- [infrastructure.md](infrastructure.md) - Server security hardening
- [monitoring.md](monitoring.md) - Security monitoring setup
- `security-agent` skill - Full security expertise (handoff for forensics)
---
## Important Notes
**For SRE Agent**:
- Focus on IMMEDIATE containment and service restoration
- Preserve evidence (don't delete logs!)
- Hand off to `security-agent` for forensic analysis
- Document everything with timestamps
- Blameless post-mortem (focus on systems, not people)
**Legal Compliance**:
- GDPR: Notify within 72 hours of breach
- HIPAA: Notify within 60 days
- PCI-DSS: Immediate notification to card brands
- SOC 2: Document in audit trail
**Evidence Preservation**:
- Copy logs before any changes
- Take disk/memory snapshots
- Document all actions taken
- Preserve chain of custody

View File

@@ -0,0 +1,302 @@
# UI/Frontend Diagnostics
**Purpose**: Troubleshoot frontend performance, rendering, and user experience issues.
## Common UI Issues
### 1. Slow Page Load
**Symptoms**:
- Users report long loading times
- Lighthouse score <50
- Time to Interactive (TTI) >5 seconds
**Diagnosis**:
#### Check Bundle Size
```bash
# Check JavaScript bundle size
ls -lh dist/*.js
# Analyze bundle composition
npx webpack-bundle-analyzer dist/stats.json
# Check for large dependencies
npm ls --depth=0
```
**Red flags**:
- Main bundle >500KB
- Unused dependencies in bundle
- Multiple copies of same library
**Mitigation**:
- Code splitting: `import()` for dynamic imports
- Tree shaking: Remove unused code
- Lazy loading: Load components on demand
---
#### Check Network Requests
```bash
# Chrome DevTools → Network tab
# Look for:
# - Number of requests (>100 = too many)
# - Large assets (images >200KB)
# - Slow API calls (>1s)
```
**Red flags**:
- Waterfall pattern (sequential loading)
- Large uncompressed images
- Blocking requests
**Mitigation**:
- Image optimization: WebP, lazy loading
- HTTP/2: Multiplexing
- CDN: Cache static assets
---
#### Check Render Performance
```bash
# Chrome DevTools → Performance tab
# Record page load, check:
# - Long tasks (>50ms)
# - Layout thrashing
# - JavaScript execution time
```
**Red flags**:
- Long tasks blocking main thread
- Multiple layout recalculations
- Heavy JavaScript computation
**Mitigation**:
- Web Workers: Move heavy computation off main thread
- requestIdleCallback: Defer non-critical work
- Virtual scrolling: Render only visible items
---
### 2. Memory Leak (UI)
**Symptoms**:
- Browser tab becomes slow over time
- Memory usage increases continuously
- Browser eventually crashes
**Diagnosis**:
#### Chrome DevTools → Memory
```bash
# Take heap snapshot before/after user interaction
# Compare snapshots
# Look for:
# - Detached DOM nodes
# - Event listeners not removed
# - Growing arrays/objects
```
**Red flags**:
- Detached DOM elements increasing
- Event listeners not garbage collected
- Timers/intervals not cleared
**Mitigation**:
```javascript
// Clean up event listeners
componentWillUnmount() {
element.removeEventListener('click', handler);
clearInterval(this.intervalId);
clearTimeout(this.timeoutId);
}
// Use WeakMap for DOM references
const cache = new WeakMap();
```
---
### 3. Unresponsive UI
**Symptoms**:
- Clicks don't register
- Input lag
- Frozen UI
**Diagnosis**:
#### Check Main Thread
```bash
# Chrome DevTools → Performance
# Look for:
# - Long tasks (>50ms)
# - Blocking JavaScript
# - Forced synchronous layout
```
**Red flags**:
- JavaScript blocking >100ms
- Synchronous XHR requests
- Layout thrashing (read → write → read)
**Mitigation**:
```javascript
// Break up long tasks
async function processLargeArray(items) {
for (let i = 0; i < items.length; i++) {
await processItem(items[i]);
// Yield to main thread every 100 items
if (i % 100 === 0) {
await new Promise(resolve => setTimeout(resolve, 0));
}
}
}
// Use requestIdleCallback
requestIdleCallback(() => {
// Non-critical work
});
```
---
### 4. White Screen / Failed Render
**Symptoms**:
- Blank page
- Error boundary triggered
- Console errors
**Diagnosis**:
#### Check Console Errors
```bash
# Chrome DevTools → Console
# Look for:
# - Uncaught exceptions
# - Network errors (failed chunks)
# - CORS errors
```
**Common causes**:
- JavaScript error in render
- Failed to load chunk (code splitting)
- CORS blocking API calls
- Missing dependencies
**Mitigation**:
```javascript
// Error boundary
class ErrorBoundary extends React.Component {
componentDidCatch(error, errorInfo) {
logErrorToService(error, errorInfo);
}
render() {
if (this.state.hasError) {
return <ErrorFallback />;
}
return this.props.children;
}
}
// Retry failed chunk loads
const retryImport = (fn, retriesLeft = 3) => {
return new Promise((resolve, reject) => {
fn()
.then(resolve)
.catch(error => {
if (retriesLeft === 0) {
reject(error);
} else {
setTimeout(() => {
retryImport(fn, retriesLeft - 1).then(resolve, reject);
}, 1000);
}
});
});
};
```
---
## UI Performance Metrics
**Core Web Vitals**:
- **LCP** (Largest Contentful Paint): <2.5s (good), <4s (needs improvement), >4s (poor)
- **FID** (First Input Delay): <100ms (good), <300ms (needs improvement), >300ms (poor)
- **CLS** (Cumulative Layout Shift): <0.1 (good), <0.25 (needs improvement), >0.25 (poor)
**Other Metrics**:
- **TTFB** (Time to First Byte): <200ms
- **FCP** (First Contentful Paint): <1.8s
- **TTI** (Time to Interactive): <3.8s
**Measurement**:
```javascript
// Web Vitals library
import {getLCP, getFID, getCLS} from 'web-vitals';
getLCP(console.log);
getFID(console.log);
getCLS(console.log);
```
---
## Common UI Anti-Patterns
### 1. Render Everything Upfront
**Problem**: Rendering 10,000 items at once
**Solution**: Virtual scrolling, pagination, infinite scroll
### 2. No Code Splitting
**Problem**: 5MB JavaScript bundle loaded upfront
**Solution**: Route-based code splitting, lazy loading
### 3. Large Images
**Problem**: 5MB PNG images
**Solution**: WebP, compression, lazy loading, responsive images
### 4. Blocking JavaScript
**Problem**: Heavy computation on main thread
**Solution**: Web Workers, requestIdleCallback, async/await
### 5. Memory Leaks
**Problem**: Event listeners not removed, timers not cleared
**Solution**: Cleanup in componentWillUnmount, WeakMap
---
## UI Diagnostic Checklist
**When diagnosing slow UI**:
- [ ] Check bundle size (target: <500KB gzipped)
- [ ] Check number of network requests (target: <50)
- [ ] Check Core Web Vitals (LCP <2.5s, FID <100ms, CLS <0.1)
- [ ] Check for JavaScript errors in console
- [ ] Check render performance (no long tasks >50ms)
- [ ] Check memory usage (no continuous growth)
- [ ] Check for CORS errors
- [ ] Check for failed chunk loads
- [ ] Check image sizes (target: <200KB per image)
- [ ] Check for blocking resources
**Tools**:
- Chrome DevTools (Network, Performance, Memory, Console)
- Lighthouse
- Web Vitals library
- webpack-bundle-analyzer
- React DevTools Profiler
---
## Related Documentation
- [SKILL.md](../SKILL.md) - Main SRE agent
- [backend-diagnostics.md](backend-diagnostics.md) - Backend troubleshooting
- [monitoring.md](monitoring.md) - Observability tools