Initial commit

2025-11-29 17:56:41 +08:00
commit 9427ed1eea
40 changed files with 15189 additions and 0 deletions
--- a/agents/sre/modules/backend-diagnostics.md
+++ b/agents/sre/modules/backend-diagnostics.md
@@ -0,0 +1,481 @@
+# Backend/API Diagnostics
+
+**Purpose**: Troubleshoot backend services, APIs, and application-level performance issues.
+
+## Common Backend Issues
+
+### 1. Slow API Response
+
+**Symptoms**:
+- API response time >1 second
+- Users report slow loading
+- Timeout errors
+
+**Diagnosis**:
+
+#### Check Application Logs
+```bash
+# Check for slow requests
+grep "duration" /var/log/application.log | awk '{if ($5 > 1000) print}'
+
+# Check error rate
+grep "ERROR" /var/log/application.log | wc -l
+
+# Check recent errors
+tail -f /var/log/application.log | grep "ERROR"
+```
+
+**Red flags**:
+- Repeated errors for same endpoint
+- Increasing response times
+- Timeout errors
+
+---
+
+#### Check Application Metrics
+```bash
+# CPU usage
+top -bn1 | grep "node\|java\|python"
+
+# Memory usage
+ps aux | grep "node\|java\|python" | awk '{print $4, $11}'
+
+# Thread count
+ps -eLf | grep "node\|java\|python" | wc -l
+
+# Open file descriptors
+lsof -p <PID> | wc -l
+```
+
+**Red flags**:
+- CPU >80%
+- Memory increasing over time
+- Thread count increasing (thread leak)
+- File descriptors increasing (connection leak)
+
+---
+
+#### Check Database Query Time
+```bash
+# If slow, likely database issue
+# See database-diagnostics.md
+
+# Check if query time matches API response time
+# API response time = Query time + Application processing
+```
+
+---
+
+#### Check External API Calls
+```bash
+# Check if calling external APIs
+grep "http.request" /var/log/application.log
+
+# Check external API response time
+# Use APM tools or custom instrumentation
+```
+
+**Red flags**:
+- External API taking >500ms
+- External API rate limiting (429 errors)
+- External API errors (5xx errors)
+
+**Mitigation**:
+- Cache external API responses
+- Add timeout (don't wait >5s)
+- Circuit breaker pattern
+- Fallback data
+
+---
+
+### 2. 5xx Errors (500, 502, 503, 504)
+
+**Symptoms**:
+- Users getting error messages
+- Monitoring alerts for error rate
+- Some/all requests failing
+
+**Diagnosis by Error Code**:
+
+#### 500 Internal Server Error
+**Cause**: Application code error
+
+**Diagnosis**:
+```bash
+# Check application logs for exceptions
+grep "Exception\|Error" /var/log/application.log | tail -20
+
+# Check stack traces
+tail -100 /var/log/application.log
+```
+
+**Common causes**:
+- NullPointerException / TypeError
+- Unhandled promise rejection
+- Database connection error
+- Missing environment variable
+
+**Mitigation**:
+- Fix bug in code
+- Add error handling
+- Add input validation
+- Add monitoring for this error
+
+---
+
+#### 502 Bad Gateway
+**Cause**: Reverse proxy can't reach backend
+
+**Diagnosis**:
+```bash
+# Check if application is running
+ps aux | grep "node\|java\|python"
+
+# Check application port
+netstat -tlnp | grep <PORT>
+
+# Check reverse proxy logs (nginx, apache)
+tail -f /var/log/nginx/error.log
+```
+
+**Common causes**:
+- Application crashed
+- Application not listening on expected port
+- Firewall blocking connection
+- Reverse proxy misconfigured
+
+**Mitigation**:
+- Restart application
+- Check application logs for crash reason
+- Verify port configuration
+- Check reverse proxy config
+
+---
+
+#### 503 Service Unavailable
+**Cause**: Application overloaded or unhealthy
+
+**Diagnosis**:
+```bash
+# Check application health
+curl http://localhost:<PORT>/health
+
+# Check connection pool
+# Database connections, HTTP connections
+
+# Check queue depth
+# Message queues, task queues
+```
+
+**Common causes**:
+- Too many concurrent requests
+- Database connection pool exhausted
+- Dependency service down
+- Health check failing
+
+**Mitigation**:
+- Scale horizontally (add more instances)
+- Increase connection pool size
+- Rate limiting
+- Circuit breaker for dependencies
+
+---
+
+#### 504 Gateway Timeout
+**Cause**: Application took too long to respond
+
+**Diagnosis**:
+```bash
+# Check what's slow
+# Database query? External API? Long computation?
+
+# Check application logs for slow operations
+grep "slow\|timeout" /var/log/application.log
+```
+
+**Common causes**:
+- Slow database query
+- Slow external API call
+- Long-running computation
+- Deadlock
+
+**Mitigation**:
+- Optimize slow operation
+- Add timeout to prevent indefinite wait
+- Async processing (return 202 Accepted)
+- Increase timeout (last resort)
+
+---
+
+### 3. Memory Leak (Backend)
+
+**Symptoms**:
+- Memory usage increasing over time
+- Application crashes with OutOfMemoryError
+- Performance degrades over time
+
+**Diagnosis**:
+
+#### Monitor Memory Over Time
+```bash
+# Linux
+watch -n 5 'ps aux | grep <PROCESS> | awk "{print \$4, \$5, \$6}"'
+
+# Get heap dump (Java)
+jmap -dump:format=b,file=heap.bin <PID>
+
+# Get heap snapshot (Node.js)
+node --inspect index.js
+# Chrome DevTools → Memory → Take heap snapshot
+```
+
+**Red flags**:
+- Memory increasing linearly
+- Memory not released after GC
+- Large arrays/objects in heap dump
+
+---
+
+#### Common Causes
+```javascript
+// 1. Event listeners not removed
+emitter.on('event', handler); // Never removed
+
+// 2. Timers not cleared
+setInterval(() => { /* ... */ }, 1000); // Never cleared
+
+// 3. Global variables growing
+global.cache = {}; // Grows forever
+
+// 4. Closures holding references
+function createHandler() {
+  const largeData = new Array(1000000);
+  return () => {
+    // Closure keeps largeData in memory
+  };
+}
+
+// 5. Connection leaks
+const conn = await db.connect();
+// Never closed → connection pool exhausted
+```
+
+**Mitigation**:
+```javascript
+// 1. Remove event listeners
+const handler = () => { /* ... */ };
+emitter.on('event', handler);
+// Later:
+emitter.off('event', handler);
+
+// 2. Clear timers
+const intervalId = setInterval(() => { /* ... */ }, 1000);
+// Later:
+clearInterval(intervalId);
+
+// 3. Use LRU cache
+const LRU = require('lru-cache');
+const cache = new LRU({ max: 1000 });
+
+// 4. Be careful with closures
+function createHandler() {
+  return () => {
+    const largeData = loadData(); // Load when needed
+  };
+}
+
+// 5. Always close connections
+const conn = await db.connect();
+try {
+  await conn.query(/* ... */);
+} finally {
+  await conn.close();
+}
+```
+
+---
+
+### 4. High CPU Usage
+
+**Symptoms**:
+- CPU at 100%
+- Slow response times
+- Server becomes unresponsive
+
+**Diagnosis**:
+
+#### Identify CPU-heavy Process
+```bash
+# Top CPU processes
+top -bn1 | head -20
+
+# CPU per thread (Java)
+top -H -p <PID>
+
+# Profile application (Node.js)
+node --prof index.js
+node --prof-process isolate-*.log
+```
+
+**Common causes**:
+- Infinite loop
+- Heavy computation (parsing, encryption)
+- Regular expression catastrophic backtracking
+- Large JSON parsing
+
+**Mitigation**:
+```javascript
+// 1. Break up heavy computation
+async function processLargeArray(items) {
+  for (let i = 0; i < items.length; i++) {
+    await processItem(items[i]);
+
+    // Yield to event loop
+    if (i % 100 === 0) {
+      await new Promise(resolve => setImmediate(resolve));
+    }
+  }
+}
+
+// 2. Use worker threads (Node.js)
+const { Worker } = require('worker_threads');
+const worker = new Worker('./heavy-computation.js');
+
+// 3. Cache results
+const cache = new Map();
+function expensiveOperation(input) {
+  if (cache.has(input)) return cache.get(input);
+  const result = /* heavy computation */;
+  cache.set(input, result);
+  return result;
+}
+
+// 4. Fix regex
+// Bad: /(.+)*/ (catastrophic backtracking)
+// Good: /(.+?)/ (non-greedy)
+```
+
+---
+
+### 5. Connection Pool Exhausted
+
+**Symptoms**:
+- "Connection pool exhausted" errors
+- "Too many connections" errors
+- Requests timing out
+
+**Diagnosis**:
+
+#### Check Connection Pool
+```bash
+# Database connections
+# PostgreSQL:
+SELECT count(*) FROM pg_stat_activity;
+
+# MySQL:
+SHOW PROCESSLIST;
+
+# Application connection pool
+# Check application metrics/logs
+```
+
+**Red flags**:
+- Connections = max pool size
+- Idle connections in transaction
+- Long-running queries holding connections
+
+**Common causes**:
+- Connections not released (missing .close())
+- Connection leak in error path
+- Pool size too small
+- Long-running queries
+
+**Mitigation**:
+```javascript
+// 1. Always close connections
+async function queryDatabase() {
+  const conn = await pool.connect();
+  try {
+    const result = await conn.query('SELECT * FROM users');
+    return result;
+  } finally {
+    conn.release(); // CRITICAL
+  }
+}
+
+// 2. Use connection pool wrapper
+const pool = new Pool({
+  max: 20, // max connections
+  idleTimeoutMillis: 30000,
+  connectionTimeoutMillis: 2000,
+});
+
+// 3. Monitor pool metrics
+pool.on('error', (err) => {
+  console.error('Pool error:', err);
+});
+
+// 4. Increase pool size (if needed)
+// But investigate leaks first!
+```
+
+---
+
+## Backend Performance Metrics
+
+**Response Time**:
+- p50: <100ms
+- p95: <500ms
+- p99: <1s
+
+**Throughput**:
+- Requests per second (RPS)
+- Requests per minute (RPM)
+
+**Error Rate**:
+- Target: <0.1%
+- 4xx errors: Client errors (validation)
+- 5xx errors: Server errors (bugs, downtime)
+
+**Resource Usage**:
+- CPU: <70% average
+- Memory: <80% of limit
+- Connections: <80% of pool size
+
+**Availability**:
+- Target: 99.9% (8.76 hours downtime/year)
+- 99.99%: 52.6 minutes downtime/year
+- 99.999%: 5.26 minutes downtime/year
+
+---
+
+## Backend Diagnostic Checklist
+
+**When diagnosing slow backend**:
+
+- [ ] Check application logs for errors
+- [ ] Check CPU usage (target: <70%)
+- [ ] Check memory usage (target: <80%)
+- [ ] Check database query time (see database-diagnostics.md)
+- [ ] Check external API calls (timeout, errors)
+- [ ] Check connection pool (target: <80% used)
+- [ ] Check error rate (target: <0.1%)
+- [ ] Check response time percentiles (p95, p99)
+- [ ] Check for thread leaks (increasing thread count)
+- [ ] Check for memory leaks (increasing memory over time)
+
+**Tools**:
+- Application logs
+- APM tools (New Relic, DataDog, AppDynamics)
+- `top`, `htop`, `ps`, `lsof`
+- `curl` with timing
+- Profilers (node --prof, jstack, py-spy)
+
+---
+
+## Related Documentation
+
+- [SKILL.md](../SKILL.md) - Main SRE agent
+- [database-diagnostics.md](database-diagnostics.md) - Database troubleshooting
+- [infrastructure.md](infrastructure.md) - Server/network troubleshooting
+- [monitoring.md](monitoring.md) - Observability tools
--- a/agents/sre/modules/database-diagnostics.md
+++ b/agents/sre/modules/database-diagnostics.md
@@ -0,0 +1,509 @@
+# Database Diagnostics
+
+**Purpose**: Troubleshoot database performance, slow queries, deadlocks, and connection issues.
+
+## Common Database Issues
+
+### 1. Slow Query
+
+**Symptoms**:
+- API response time high
+- Specific endpoint slow
+- Database CPU high
+
+**Diagnosis**:
+
+#### Enable Slow Query Log (PostgreSQL)
+```sql
+-- Set slow query threshold (1 second)
+ALTER SYSTEM SET log_min_duration_statement = 1000;
+SELECT pg_reload_conf();
+
+-- Check slow query log
+-- /var/log/postgresql/postgresql.log
+```
+
+#### Enable Slow Query Log (MySQL)
+```sql
+-- Enable slow query log
+SET GLOBAL slow_query_log = 'ON';
+SET GLOBAL long_query_time = 1;
+
+-- Check slow query log
+-- /var/log/mysql/mysql-slow.log
+```
+
+---
+
+#### Analyze Query with EXPLAIN
+```sql
+-- PostgreSQL
+EXPLAIN ANALYZE
+SELECT users.*, posts.*
+FROM users
+LEFT JOIN posts ON posts.user_id = users.id
+WHERE users.last_login_at > NOW() - INTERVAL '30 days';
+
+-- Look for:
+-- - Seq Scan (sequential scan = BAD for large tables)
+-- - High cost numbers
+-- - High actual time
+```
+
+**Red flags in EXPLAIN output**:
+- **Seq Scan** on large table (>10k rows) → Missing index
+- **Nested Loop** with large outer table → Missing index
+- **Hash Join** with large tables → Consider index
+- **Actual time** >> **Planned time** → Statistics outdated
+
+**Example Bad Query**:
+```
+Seq Scan on users  (cost=0.00..100000 rows=10000000)
+  Filter: (last_login_at > '2025-09-26'::date)
+  Rows Removed by Filter: 9900000
+```
+→ **Missing index on last_login_at**
+
+---
+
+#### Check Missing Indexes
+```sql
+-- PostgreSQL: Find missing indexes
+SELECT
+  schemaname,
+  tablename,
+  seq_scan,
+  seq_tup_read,
+  idx_scan,
+  seq_tup_read / seq_scan AS avg_seq_read
+FROM pg_stat_user_tables
+WHERE seq_scan > 0
+ORDER BY seq_tup_read DESC
+LIMIT 20;
+
+-- Tables with high seq_scan and low idx_scan need indexes
+```
+
+---
+
+#### Create Index
+```sql
+-- PostgreSQL (CONCURRENTLY = no table lock)
+CREATE INDEX CONCURRENTLY idx_users_last_login_at
+ON users(last_login_at);
+
+-- Verify index is used
+EXPLAIN ANALYZE
+SELECT * FROM users WHERE last_login_at > NOW() - INTERVAL '30 days';
+-- Should show: Index Scan using idx_users_last_login_at
+```
+
+**Impact**:
+- Before: 7.8 seconds (Seq Scan)
+- After: 50ms (Index Scan)
+
+---
+
+### 2. Database Deadlock
+
+**Symptoms**:
+- "Deadlock detected" errors
+- Transactions timing out
+- API 500 errors
+
+**Diagnosis**:
+
+#### Check for Deadlocks (PostgreSQL)
+```sql
+-- Check currently locked queries
+SELECT
+  blocked_locks.pid AS blocked_pid,
+  blocked_activity.usename AS blocked_user,
+  blocking_locks.pid AS blocking_pid,
+  blocking_activity.usename AS blocking_user,
+  blocked_activity.query AS blocked_statement,
+  blocking_activity.query AS blocking_statement
+FROM pg_catalog.pg_locks blocked_locks
+JOIN pg_catalog.pg_stat_activity blocked_activity ON blocked_activity.pid = blocked_locks.pid
+JOIN pg_catalog.pg_locks blocking_locks
+  ON blocking_locks.locktype = blocked_locks.locktype
+  AND blocking_locks.database IS NOT DISTINCT FROM blocked_locks.database
+  AND blocking_locks.relation IS NOT DISTINCT FROM blocked_locks.relation
+  AND blocking_locks.pid != blocked_locks.pid
+JOIN pg_catalog.pg_stat_activity blocking_activity ON blocking_activity.pid = blocking_locks.pid
+WHERE NOT blocked_locks.granted;
+```
+
+#### Check for Deadlocks (MySQL)
+```sql
+-- Show InnoDB status (includes deadlock info)
+SHOW ENGINE INNODB STATUS\G
+
+-- Look for "LATEST DETECTED DEADLOCK" section
+```
+
+---
+
+#### Common Deadlock Patterns
+```sql
+-- Pattern 1: Lock order mismatch
+-- Transaction 1:
+BEGIN;
+UPDATE accounts SET balance = balance - 100 WHERE id = 1;
+UPDATE accounts SET balance = balance + 100 WHERE id = 2;
+COMMIT;
+
+-- Transaction 2 (runs concurrently):
+BEGIN;
+UPDATE accounts SET balance = balance - 50 WHERE id = 2; -- Locks id=2
+UPDATE accounts SET balance = balance + 50 WHERE id = 1; -- Waits for id=1 (deadlock!)
+COMMIT;
+```
+
+**Fix**: Always lock in same order
+```sql
+-- Both transactions lock in order: id=1, then id=2
+BEGIN;
+UPDATE accounts SET balance = balance - 100 WHERE id = LEAST(1, 2);
+UPDATE accounts SET balance = balance + 100 WHERE id = GREATEST(1, 2);
+COMMIT;
+```
+
+---
+
+#### Immediate Mitigation
+```sql
+-- PostgreSQL: Kill blocking query
+SELECT pg_terminate_backend(<blocking_pid>);
+
+-- PostgreSQL: Kill idle transactions
+SELECT pg_terminate_backend(pid)
+FROM pg_stat_activity
+WHERE state = 'idle in transaction'
+AND state_change < NOW() - INTERVAL '5 minutes';
+```
+
+---
+
+### 3. Connection Pool Exhausted
+
+**Symptoms**:
+- "Too many connections" errors
+- "Connection pool exhausted" errors
+- New connections timing out
+
+**Diagnosis**:
+
+#### Check Active Connections (PostgreSQL)
+```sql
+-- Count connections by state
+SELECT state, count(*)
+FROM pg_stat_activity
+GROUP BY state;
+
+-- Show all connections
+SELECT pid, usename, application_name, state, query
+FROM pg_stat_activity
+WHERE state != 'idle';
+
+-- Check max connections
+SHOW max_connections;
+```
+
+#### Check Active Connections (MySQL)
+```sql
+-- Show all connections
+SHOW PROCESSLIST;
+
+-- Count connections by state
+SELECT state, COUNT(*)
+FROM information_schema.processlist
+GROUP BY state;
+
+-- Check max connections
+SHOW VARIABLES LIKE 'max_connections';
+```
+
+**Red flags**:
+- Connections = max_connections
+- Many "idle in transaction" (connections held but not used)
+- Long-running queries holding connections
+
+---
+
+#### Immediate Mitigation
+```sql
+-- PostgreSQL: Kill idle connections
+SELECT pg_terminate_backend(pid)
+FROM pg_stat_activity
+WHERE state = 'idle'
+AND state_change < NOW() - INTERVAL '10 minutes';
+
+-- Increase max_connections (temporary)
+ALTER SYSTEM SET max_connections = 200;
+SELECT pg_reload_conf();
+```
+
+**Long-term Fix**:
+- Fix connection leaks in application code
+- Increase connection pool size (if needed)
+- Add connection timeout
+- Use connection pooler (PgBouncer, ProxySQL)
+
+---
+
+### 4. High Database CPU
+
+**Symptoms**:
+- Database CPU >80%
+- All queries slow
+- Server overload
+
+**Diagnosis**:
+
+#### Find CPU-heavy Queries (PostgreSQL)
+```sql
+-- Top queries by total time
+SELECT
+  query,
+  calls,
+  total_exec_time,
+  mean_exec_time,
+  max_exec_time
+FROM pg_stat_statements
+ORDER BY total_exec_time DESC
+LIMIT 10;
+
+-- Requires: CREATE EXTENSION pg_stat_statements;
+```
+
+#### Find CPU-heavy Queries (MySQL)
+```sql
+-- Enable performance schema
+SET GLOBAL performance_schema = ON;
+
+-- Top queries by execution time
+SELECT
+  DIGEST_TEXT,
+  COUNT_STAR,
+  SUM_TIMER_WAIT,
+  AVG_TIMER_WAIT
+FROM performance_schema.events_statements_summary_by_digest
+ORDER BY SUM_TIMER_WAIT DESC
+LIMIT 10;
+```
+
+**Common causes**:
+- Missing indexes (Seq Scan)
+- Complex queries (many JOINs)
+- Aggregations on large tables
+- Full table scans
+
+**Mitigation**:
+- Add missing indexes
+- Optimize queries (reduce JOINs)
+- Add query caching
+- Scale database (read replicas)
+
+---
+
+### 5. Disk Full
+
+**Symptoms**:
+- "No space left on device" errors
+- Database refuses writes
+- Application crashes
+
+**Diagnosis**:
+
+#### Check Disk Usage
+```bash
+# Linux
+df -h
+
+# Database data directory
+du -sh /var/lib/postgresql/data/*
+du -sh /var/lib/mysql/*
+
+# Find large tables
+# PostgreSQL:
+SELECT
+  schemaname,
+  tablename,
+  pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename)) AS size
+FROM pg_tables
+ORDER BY pg_total_relation_size(schemaname||'.'||tablename) DESC
+LIMIT 20;
+```
+
+---
+
+#### Immediate Mitigation
+```bash
+# 1. Clean up logs
+rm /var/log/postgresql/postgresql-*.log.1
+rm /var/log/mysql/mysql-slow.log.1
+
+# 2. Vacuum database (PostgreSQL)
+VACUUM FULL;
+
+# 3. Archive old data
+# Move old records to archive table or backup
+
+# 4. Expand disk (cloud)
+# AWS: Modify EBS volume size
+# Azure: Expand managed disk
+```
+
+---
+
+### 6. Replication Lag
+
+**Symptoms**:
+- Stale data on read replicas
+- Monitoring alerts for lag
+- Eventually consistent reads
+
+**Diagnosis**:
+
+#### Check Replication Lag (PostgreSQL)
+```sql
+-- On primary:
+SELECT * FROM pg_stat_replication;
+
+-- On replica:
+SELECT
+  now() - pg_last_xact_replay_timestamp() AS replication_lag;
+```
+
+#### Check Replication Lag (MySQL)
+```sql
+-- On replica:
+SHOW SLAVE STATUS\G
+
+-- Look for: Seconds_Behind_Master
+```
+
+**Red flags**:
+- Lag >1 minute
+- Lag increasing over time
+
+**Common causes**:
+- High write load on primary
+- Replica under-provisioned
+- Network latency
+- Long-running query blocking replay
+
+**Mitigation**:
+- Scale up replica (more CPU, memory)
+- Optimize slow queries on primary
+- Increase network bandwidth
+- Add more replicas (distribute read load)
+
+---
+
+## Database Performance Metrics
+
+**Query Performance**:
+- p50 query time: <10ms
+- p95 query time: <100ms
+- p99 query time: <500ms
+
+**Resource Usage**:
+- CPU: <70% average
+- Memory: <80% of available
+- Disk I/O: <80% of throughput
+- Connections: <80% of max
+
+**Availability**:
+- Uptime: 99.99% (52.6 min downtime/year)
+- Replication lag: <1 second
+
+---
+
+## Database Diagnostic Checklist
+
+**When diagnosing slow database**:
+
+- [ ] Check slow query log
+- [ ] Run EXPLAIN ANALYZE on slow queries
+- [ ] Check for missing indexes (seq_scan > idx_scan)
+- [ ] Check for deadlocks
+- [ ] Check connection count (target: <80% of max)
+- [ ] Check database CPU (target: <70%)
+- [ ] Check disk space (target: <80% used)
+- [ ] Check replication lag (target: <1s)
+- [ ] Check for long-running queries (>30s)
+- [ ] Check for idle transactions (>5 min)
+
+**Tools**:
+- `EXPLAIN ANALYZE`
+- `pg_stat_statements` (PostgreSQL)
+- Performance Schema (MySQL)
+- `pg_stat_activity` (PostgreSQL)
+- `SHOW PROCESSLIST` (MySQL)
+- Database monitoring (CloudWatch, DataDog)
+
+---
+
+## Database Anti-Patterns
+
+### 1. N+1 Query Problem
+```javascript
+// BAD: N+1 queries
+const users = await db.query('SELECT * FROM users');
+for (const user of users) {
+  const posts = await db.query('SELECT * FROM posts WHERE user_id = ?', [user.id]);
+}
+// 1 query + N queries = N+1
+
+// GOOD: Single query with JOIN
+const usersWithPosts = await db.query(`
+  SELECT users.*, posts.*
+  FROM users
+  LEFT JOIN posts ON posts.user_id = users.id
+`);
+```
+
+### 2. SELECT *
+```sql
+-- BAD: Fetches all columns (inefficient)
+SELECT * FROM users WHERE id = 1;
+
+-- GOOD: Fetch only needed columns
+SELECT id, name, email FROM users WHERE id = 1;
+```
+
+### 3. Missing Indexes
+```sql
+-- BAD: No index on frequently queried column
+SELECT * FROM users WHERE email = 'user@example.com';
+-- Seq Scan on users
+
+-- GOOD: Add index
+CREATE INDEX idx_users_email ON users(email);
+-- Index Scan using idx_users_email
+```
+
+### 4. Long Transactions
+```javascript
+// BAD: Long transaction holding locks
+BEGIN;
+const user = await db.query('SELECT * FROM users WHERE id = 1 FOR UPDATE');
+await sendEmail(user.email); // External API call (slow!)
+await db.query('UPDATE users SET last_email_sent = NOW() WHERE id = 1');
+COMMIT;
+
+// GOOD: Keep transactions short
+const user = await db.query('SELECT * FROM users WHERE id = 1');
+await sendEmail(user.email); // Outside transaction
+await db.query('UPDATE users SET last_email_sent = NOW() WHERE id = 1');
+```
+
+---
+
+## Related Documentation
+
+- [SKILL.md](../SKILL.md) - Main SRE agent
+- [backend-diagnostics.md](backend-diagnostics.md) - Backend troubleshooting
+- [infrastructure.md](infrastructure.md) - Server/network troubleshooting
--- a/agents/sre/modules/infrastructure.md
+++ b/agents/sre/modules/infrastructure.md
@@ -0,0 +1,561 @@
+# Infrastructure Diagnostics
+
+**Purpose**: Troubleshoot server, network, disk, and cloud infrastructure issues.
+
+## Common Infrastructure Issues
+
+### 1. High CPU Usage (Server)
+
+**Symptoms**:
+- Server CPU at 100%
+- Applications slow
+- SSH lag
+
+**Diagnosis**:
+
+#### Check CPU Usage
+```bash
+# Overall CPU usage
+top -bn1 | grep "Cpu(s)"
+
+# Top CPU processes
+top -bn1 | head -20
+
+# CPU usage per core
+mpstat -P ALL 1 5
+
+# Historical CPU (if sar installed)
+sar -u 1 10
+```
+
+**Red flags**:
+- CPU at 100% for >5 minutes
+- Single process using >80% CPU
+- iowait >20% (disk bottleneck)
+- System CPU >30% (kernel overhead)
+
+---
+
+#### Identify CPU-heavy Process
+```bash
+# Top CPU process
+ps aux | sort -nrk 3,3 | head -10
+
+# CPU per thread
+top -H
+
+# Process tree
+pstree -p
+```
+
+**Common causes**:
+- Application bug (infinite loop)
+- Heavy computation
+- Crypto mining malware
+- Backup/compression running
+
+---
+
+#### Immediate Mitigation
+```bash
+# 1. Limit process CPU (nice)
+renice +10 <PID>  # Lower priority
+
+# 2. Kill process (last resort)
+kill -TERM <PID>  # Graceful
+kill -KILL <PID>  # Force kill
+
+# 3. Scale horizontally (add servers)
+# Cloud: Auto-scaling group
+
+# 4. Scale vertically (bigger instance)
+# Cloud: Resize instance
+```
+
+---
+
+### 2. Out of Memory (OOM)
+
+**Symptoms**:
+- "Out of memory" errors
+- OOM Killer triggered
+- Applications crash
+- Swap usage high
+
+**Diagnosis**:
+
+#### Check Memory Usage
+```bash
+# Current memory usage
+free -h
+
+# Memory per process
+ps aux | sort -nrk 4,4 | head -10
+
+# Check OOM killer logs
+dmesg | grep -i "out of memory\|oom"
+grep "Out of memory" /var/log/syslog
+
+# Check swap usage
+swapon -s
+```
+
+**Red flags**:
+- Available memory <10%
+- Swap usage >80%
+- OOM killer active
+- Single process using >50% memory
+
+---
+
+#### Immediate Mitigation
+```bash
+# 1. Free page cache (safe)
+sync && echo 3 > /proc/sys/vm/drop_caches
+
+# 2. Kill memory-heavy process
+kill -9 <PID>
+
+# 3. Increase swap (temporary)
+dd if=/dev/zero of=/swapfile bs=1M count=2048
+mkswap /swapfile
+swapon /swapfile
+
+# 4. Scale up (more RAM)
+# Cloud: Resize instance
+```
+
+---
+
+### 3. Disk Full
+
+**Symptoms**:
+- "No space left on device" errors
+- Applications can't write files
+- Database refuses writes
+- Logs not being written
+
+**Diagnosis**:
+
+#### Check Disk Usage
+```bash
+# Disk usage by partition
+df -h
+
+# Disk usage by directory
+du -sh /*
+du -sh /var/*
+
+# Find large files
+find / -type f -size +100M -exec ls -lh {} \;
+
+# Find files using deleted space
+lsof | grep deleted
+```
+
+**Red flags**:
+- Disk usage >90%
+- /var/log full (runaway logs)
+- /tmp full (temp files not cleaned)
+- Deleted files still holding space (process has handle)
+
+---
+
+#### Immediate Mitigation
+```bash
+# 1. Clean up logs
+find /var/log -name "*.log.*" -mtime +7 -delete
+journalctl --vacuum-time=7d
+
+# 2. Clean up temp files
+rm -rf /tmp/*
+rm -rf /var/tmp/*
+
+# 3. Find and remove deleted files holding space
+lsof | grep deleted | awk '{print $2}' | xargs kill -9
+
+# 4. Compress logs
+gzip /var/log/*.log
+
+# 5. Expand disk (cloud)
+# AWS: Modify EBS volume size
+# Azure: Expand managed disk
+# After expanding:
+resize2fs /dev/xvda1  # ext4
+xfs_growfs /            # xfs
+```
+
+---
+
+### 4. Network Issues
+
+**Symptoms**:
+- Slow network performance
+- Timeouts
+- Connection refused
+- High latency
+
+**Diagnosis**:
+
+#### Check Network Connectivity
+```bash
+# Ping test
+ping -c 5 google.com
+
+# DNS resolution
+nslookup example.com
+dig example.com
+
+# Traceroute
+traceroute example.com
+
+# Check network interfaces
+ip addr show
+ifconfig
+
+# Check routing table
+ip route show
+route -n
+```
+
+**Red flags**:
+- Packet loss >1%
+- Latency >100ms (same region)
+- DNS resolution failures
+- Interface down
+
+---
+
+#### Check Network Bandwidth
+```bash
+# Current bandwidth usage
+iftop -i eth0
+
+# Network stats
+netstat -i
+
+# Historical bandwidth (if vnstat installed)
+vnstat -l
+
+# Check for bandwidth limits (cloud)
+# AWS: Check CloudWatch NetworkIn/NetworkOut
+```
+
+---
+
+#### Check Firewall Rules
+```bash
+# Check iptables rules
+iptables -L -n -v
+
+# Check firewalld (RHEL/CentOS)
+firewall-cmd --list-all
+
+# Check UFW (Ubuntu)
+ufw status verbose
+
+# Check security groups (cloud)
+# AWS: EC2 → Security Groups
+# Azure: Network Security Groups
+```
+
+**Common causes**:
+- Firewall blocking traffic
+- Security group misconfigured
+- MTU mismatch
+- Network congestion
+- DDoS attack
+
+---
+
+#### Immediate Mitigation
+```bash
+# 1. Check firewall allows traffic
+iptables -A INPUT -p tcp --dport 80 -j ACCEPT
+iptables -A INPUT -p tcp --dport 443 -j ACCEPT
+
+# 2. Restart networking
+systemctl restart networking
+systemctl restart NetworkManager
+
+# 3. Flush DNS cache
+systemd-resolve --flush-caches
+
+# 4. Check cloud network ACLs
+# Ensure subnet has route to internet gateway
+```
+
+---
+
+### 5. High Disk I/O (Slow Disk)
+
+**Symptoms**:
+- Applications slow
+- High iowait CPU
+- Disk latency high
+
+**Diagnosis**:
+
+#### Check Disk I/O
+```bash
+# Disk I/O stats
+iostat -x 1 5
+
+# Look for:
+# - %util >80% (disk saturated)
+# - await >100ms (high latency)
+
+# Top I/O processes
+iotop -o
+
+# Historical I/O (if sar installed)
+sar -d 1 10
+```
+
+**Red flags**:
+- %util at 100%
+- await >100ms
+- iowait CPU >20%
+- Queue size (avgqu-sz) >10
+
+---
+
+#### Common Causes
+```bash
+# 1. Database without indexes (Seq Scan)
+# See database-diagnostics.md
+
+# 2. Log rotation running
+# Large logs being compressed
+
+# 3. Backup running
+# Database dump, file backup
+
+# 4. Disk issue (bad sectors)
+dmesg | grep -i "I/O error"
+smartctl -a /dev/sda  # SMART status
+```
+
+---
+
+#### Immediate Mitigation
+```bash
+# 1. Reduce I/O pressure
+# Stop non-critical processes (backup, log rotation)
+
+# 2. Add read cache
+# Enable query caching (database)
+# Add Redis for application cache
+
+# 3. Scale disk IOPS (cloud)
+# AWS: Change EBS volume type (gp2 → gp3 → io1)
+# Azure: Change disk tier
+
+# 4. Move to SSD (if on HDD)
+```
+
+---
+
+### 6. Service Down / Process Crashed
+
+**Symptoms**:
+- Service not responding
+- Health check failures
+- 502 Bad Gateway
+
+**Diagnosis**:
+
+#### Check Service Status
+```bash
+# Systemd services
+systemctl status nginx
+systemctl status postgresql
+systemctl status application
+
+# Check if process running
+ps aux | grep nginx
+pidof nginx
+
+# Check service logs
+journalctl -u nginx -n 50
+tail -f /var/log/nginx/error.log
+```
+
+**Red flags**:
+- Service: inactive (dead)
+- Process not found
+- Recent crash in logs
+
+---
+
+#### Check Why Service Crashed
+```bash
+# Check system logs
+dmesg | tail -50
+grep "error\|segfault\|killed" /var/log/syslog
+
+# Check application logs
+tail -100 /var/log/application.log
+
+# Check for OOM killer
+dmesg | grep -i "killed process"
+
+# Check core dumps
+ls -l /var/crash/
+ls -l /tmp/core*
+```
+
+**Common causes**:
+- Out of memory (OOM Killer)
+- Segmentation fault (code bug)
+- Unhandled exception
+- Dependency service down
+- Configuration error
+
+---
+
+#### Immediate Mitigation
+```bash
+# 1. Restart service
+systemctl restart nginx
+
+# 2. Check if started successfully
+systemctl status nginx
+curl http://localhost
+
+# 3. If startup fails, check config
+nginx -t  # Test nginx config
+postgresql -D /var/lib/postgresql/data --config-test
+
+# 4. Enable auto-restart (systemd)
+# Add to service file:
+[Service]
+Restart=always
+RestartSec=10
+```
+
+---
+
+### 7. Cloud Infrastructure Issues
+
+#### AWS-Specific
+
+**Instance Issues**:
+```bash
+# Check instance health
+aws ec2 describe-instance-status --instance-ids i-1234567890abcdef0
+
+# Check system logs
+aws ec2 get-console-output --instance-id i-1234567890abcdef0
+
+# Check CloudWatch metrics
+aws cloudwatch get-metric-statistics \
+  --namespace AWS/EC2 \
+  --metric-name CPUUtilization \
+  --dimensions Name=InstanceId,Value=i-1234567890abcdef0
+```
+
+**EBS Volume Issues**:
+```bash
+# Check volume status
+aws ec2 describe-volumes --volume-ids vol-1234567890abcdef0
+
+# Increase IOPS (gp3)
+aws ec2 modify-volume \
+  --volume-id vol-1234567890abcdef0 \
+  --iops 3000
+
+# Check volume metrics
+aws cloudwatch get-metric-statistics \
+  --namespace AWS/EBS \
+  --metric-name VolumeReadOps
+```
+
+**Network Issues**:
+```bash
+# Check security groups
+aws ec2 describe-security-groups --group-ids sg-1234567890abcdef0
+
+# Check network ACLs
+aws ec2 describe-network-acls --network-acl-ids acl-1234567890abcdef0
+
+# Check route tables
+aws ec2 describe-route-tables --route-table-ids rtb-1234567890abcdef0
+```
+
+---
+
+#### Azure-Specific
+
+**VM Issues**:
+```bash
+# Check VM status
+az vm get-instance-view --name myVM --resource-group myRG
+
+# Restart VM
+az vm restart --name myVM --resource-group myRG
+
+# Resize VM
+az vm resize --name myVM --resource-group myRG --size Standard_D4s_v3
+```
+
+**Disk Issues**:
+```bash
+# Check disk status
+az disk show --name myDisk --resource-group myRG
+
+# Expand disk
+az disk update --name myDisk --resource-group myRG --size-gb 256
+```
+
+---
+
+## Infrastructure Performance Metrics
+
+**Server Health**:
+- CPU: <70% average, <90% peak
+- Memory: <80% usage
+- Disk: <80% usage, <80% IOPS
+- Network: <70% bandwidth
+
+**Uptime**:
+- Target: 99.9% (8.76 hours downtime/year)
+- Monitoring: Check every 1 minute
+
+**Response Time**:
+- Ping latency: <50ms (same region)
+- HTTP response: <200ms
+
+---
+
+## Infrastructure Diagnostic Checklist
+
+**When diagnosing infrastructure issues**:
+
+- [ ] Check CPU usage (target: <70%)
+- [ ] Check memory usage (target: <80%)
+- [ ] Check disk usage (target: <80%)
+- [ ] Check disk I/O (%util, await)
+- [ ] Check network connectivity (ping, traceroute)
+- [ ] Check firewall rules (iptables, security groups)
+- [ ] Check service status (systemd, ps)
+- [ ] Check system logs (dmesg, /var/log/syslog)
+- [ ] Check cloud metrics (CloudWatch, Azure Monitor)
+- [ ] Check for hardware issues (SMART, dmesg errors)
+
+**Tools**:
+- `top`, `htop` - CPU, memory
+- `df`, `du` - Disk usage
+- `iostat` - Disk I/O
+- `iftop`, `netstat` - Network
+- `dmesg`, `journalctl` - System logs
+- Cloud dashboards (AWS, Azure, GCP)
+
+---
+
+## Related Documentation
+
+- [SKILL.md](../SKILL.md) - Main SRE agent
+- [backend-diagnostics.md](backend-diagnostics.md) - Application-level troubleshooting
+- [database-diagnostics.md](database-diagnostics.md) - Database performance
+- [security-incidents.md](security-incidents.md) - Security response
--- a/agents/sre/modules/monitoring.md
+++ b/agents/sre/modules/monitoring.md
@@ -0,0 +1,439 @@
+# Monitoring & Observability
+
+**Purpose**: Set up monitoring, alerting, and observability to detect incidents early.
+
+## Observability Pillars
+
+### 1. Metrics
+
+**What to Monitor**:
+- **Application**: Response time, error rate, throughput
+- **Infrastructure**: CPU, memory, disk, network
+- **Database**: Query time, connections, deadlocks
+- **Business**: User signups, revenue, conversions
+
+**Tools**:
+- Prometheus + Grafana
+- DataDog
+- New Relic
+- CloudWatch (AWS)
+- Azure Monitor
+
+---
+
+#### Key Metrics by Layer
+
+**Application Metrics**:
+```
+http_requests_total               # Total requests
+http_request_duration_seconds     # Response time (histogram)
+http_requests_errors_total        # Error count
+http_requests_in_flight           # Concurrent requests
+```
+
+**Infrastructure Metrics**:
+```
+node_cpu_seconds_total            # CPU usage
+node_memory_usage_bytes           # Memory usage
+node_disk_usage_bytes             # Disk usage
+node_network_receive_bytes_total  # Network in
+```
+
+**Database Metrics**:
+```
+pg_stat_database_tup_returned     # Rows returned
+pg_stat_database_tup_fetched      # Rows fetched
+pg_stat_database_deadlocks        # Deadlock count
+pg_stat_activity_connections      # Active connections
+```
+
+---
+
+### 2. Logs
+
+**What to Log**:
+- **Application logs**: Errors, warnings, info
+- **Access logs**: HTTP requests (nginx, apache)
+- **System logs**: Kernel, systemd, auth
+- **Audit logs**: Security events, data access
+
+**Log Levels**:
+- **ERROR**: Application errors, exceptions
+- **WARN**: Potential issues (deprecated API, high latency)
+- **INFO**: Normal operations (user login, job completed)
+- **DEBUG**: Detailed troubleshooting (only in dev)
+
+**Tools**:
+- ELK Stack (Elasticsearch, Logstash, Kibana)
+- Splunk
+- CloudWatch Logs
+- Azure Log Analytics
+
+---
+
+#### Structured Logging
+
+**BAD** (unstructured):
+```javascript
+console.log("User logged in: " + userId);
+```
+
+**GOOD** (structured JSON):
+```javascript
+logger.info("User logged in", {
+  userId: 123,
+  ip: "192.168.1.1",
+  timestamp: "2025-10-26T12:00:00Z",
+  userAgent: "Mozilla/5.0...",
+});
+
+// Output:
+// {"level":"info","message":"User logged in","userId":123,"ip":"192.168.1.1",...}
+```
+
+**Benefits**:
+- Queryable (filter by userId)
+- Machine-readable
+- Consistent format
+
+---
+
+### 3. Traces
+
+**Purpose**: Track request flow through distributed systems
+
+**Example**:
+```
+User Request → API Gateway → Auth Service → Payment Service → Database
+     1ms           2ms            50ms            100ms          30ms
+                                                  ↑ SLOW SPAN
+```
+
+**Tools**:
+- Jaeger
+- Zipkin
+- AWS X-Ray
+- DataDog APM
+- New Relic
+
+**When to Use**:
+- Microservices architecture
+- Slow requests (which service is slow?)
+- Debugging distributed systems
+
+---
+
+## Alerting Best Practices
+
+### Alert on Symptoms, Not Causes
+
+**BAD** (cause-based):
+- Alert: "CPU usage >80%"
+- Problem: CPU can be high without user impact
+
+**GOOD** (symptom-based):
+- Alert: "API response time >1s"
+- Why: Users actually experiencing slowness
+
+---
+
+### Alert Severity Levels
+
+**P1 (SEV1) - Page On-Call**:
+- Service down (availability <99%)
+- Data loss
+- Security breach
+- Response time >5s (unusable)
+
+**P2 (SEV2) - Notify During Business Hours**:
+- Degraded performance (response time >1s)
+- Error rate >1%
+- Disk >90% full
+
+**P3 (SEV3) - Email/Slack**:
+- Warning signs (disk >80%, memory >80%)
+- Non-critical errors
+- Monitoring gaps
+
+---
+
+### Alert Fatigue Prevention
+
+**Rules**:
+1. **Actionable**: Every alert must have clear action
+2. **Meaningful**: Alert only on real problems
+3. **Context**: Include relevant info (which server, which metric)
+4. **Deduplicate**: Don't alert 100 times for same issue
+5. **Escalate**: Auto-escalate if not acknowledged
+
+**Example Bad Alert**:
+```
+Subject: Alert
+Body: Server is down
+```
+
+**Example Good Alert**:
+```
+Subject: [P1] API Server Down - Production
+Body:
+- Service: api.example.com
+- Issue: Health check failing for 5 minutes
+- Impact: All users affected (100%)
+- Runbook: https://wiki.example.com/runbook/api-down
+- Dashboard: https://grafana.example.com/d/api
+```
+
+---
+
+## Monitoring Setup
+
+### Application Monitoring
+
+#### Prometheus + Grafana
+
+**Install Prometheus Client** (Node.js):
+```javascript
+const client = require('prom-client');
+
+// Enable default metrics (CPU, memory, etc.)
+client.collectDefaultMetrics();
+
+// Custom metrics
+const httpRequestDuration = new client.Histogram({
+  name: 'http_request_duration_seconds',
+  help: 'HTTP request duration in seconds',
+  labelNames: ['method', 'route', 'status'],
+});
+
+// Instrument code
+app.use((req, res, next) => {
+  const end = httpRequestDuration.startTimer();
+  res.on('finish', () => {
+    end({ method: req.method, route: req.route.path, status: res.statusCode });
+  });
+  next();
+});
+
+// Expose metrics endpoint
+app.get('/metrics', (req, res) => {
+  res.set('Content-Type', client.register.contentType);
+  res.end(client.register.metrics());
+});
+```
+
+**Prometheus Config** (prometheus.yml):
+```yaml
+scrape_configs:
+  - job_name: 'api-server'
+    static_configs:
+      - targets: ['localhost:3000']
+    scrape_interval: 15s
+```
+
+---
+
+### Log Aggregation
+
+#### ELK Stack
+
+**Application** (send logs to Logstash):
+```javascript
+const winston = require('winston');
+const LogstashTransport = require('winston-logstash-transport').LogstashTransport;
+
+const logger = winston.createLogger({
+  transports: [
+    new LogstashTransport({
+      host: 'logstash.example.com',
+      port: 5000,
+    }),
+  ],
+});
+
+logger.info('User logged in', { userId: 123, ip: '192.168.1.1' });
+```
+
+**Logstash Config**:
+```
+input {
+  tcp {
+    port => 5000
+    codec => json
+  }
+}
+
+output {
+  elasticsearch {
+    hosts => ["elasticsearch:9200"]
+    index => "application-logs-%{+YYYY.MM.dd}"
+  }
+}
+```
+
+---
+
+### Health Checks
+
+**Purpose**: Check if service is healthy and ready to serve traffic
+
+**Types**:
+1. **Liveness**: Is the service running? (restart if fails)
+2. **Readiness**: Is the service ready to serve traffic? (remove from load balancer if fails)
+
+**Example** (Express.js):
+```javascript
+// Liveness probe (simple check)
+app.get('/healthz', (req, res) => {
+  res.status(200).send('OK');
+});
+
+// Readiness probe (check dependencies)
+app.get('/ready', async (req, res) => {
+  try {
+    // Check database
+    await db.query('SELECT 1');
+
+    // Check Redis
+    await redis.ping();
+
+    // Check external API
+    await fetch('https://api.external.com/health');
+
+    res.status(200).send('Ready');
+  } catch (error) {
+    res.status(503).send('Not ready');
+  }
+});
+```
+
+**Kubernetes**:
+```yaml
+livenessProbe:
+  httpGet:
+    path: /healthz
+    port: 3000
+  initialDelaySeconds: 30
+  periodSeconds: 10
+
+readinessProbe:
+  httpGet:
+    path: /ready
+    port: 3000
+  initialDelaySeconds: 10
+  periodSeconds: 5
+```
+
+---
+
+### SLI, SLO, SLA
+
+**SLI** (Service Level Indicator):
+- Metrics that measure service quality
+- Examples: Response time, error rate, availability
+
+**SLO** (Service Level Objective):
+- Target for SLI
+- Examples: "99.9% availability", "p95 response time <500ms"
+
+**SLA** (Service Level Agreement):
+- Contract with users (with penalties)
+- Examples: "99.9% uptime or refund"
+
+**Example**:
+```
+SLI: Availability = (successful requests / total requests) * 100
+SLO: Availability must be ≥99.9% per month
+SLA: If availability <99.9%, users get 10% refund
+```
+
+---
+
+## Monitoring Checklist
+
+**Application**:
+- [ ] Response time metrics (p50, p95, p99)
+- [ ] Error rate metrics (4xx, 5xx)
+- [ ] Throughput metrics (requests per second)
+- [ ] Health check endpoint (/healthz, /ready)
+- [ ] Structured logging (JSON format)
+- [ ] Distributed tracing (if microservices)
+
+**Infrastructure**:
+- [ ] CPU, memory, disk, network metrics
+- [ ] System logs (syslog, journalctl)
+- [ ] Cloud metrics (CloudWatch, Azure Monitor)
+- [ ] Disk I/O metrics (iostat)
+
+**Database**:
+- [ ] Query performance metrics
+- [ ] Connection pool metrics
+- [ ] Slow query log enabled
+- [ ] Deadlock monitoring
+
+**Alerts**:
+- [ ] P1 alerts for critical issues (page on-call)
+- [ ] P2 alerts for degraded performance
+- [ ] Runbook linked in alerts
+- [ ] Dashboard linked in alerts
+- [ ] Escalation policy configured
+
+**Dashboards**:
+- [ ] Overview dashboard (RED metrics: Rate, Errors, Duration)
+- [ ] Infrastructure dashboard (CPU, memory, disk)
+- [ ] Database dashboard (queries, connections)
+- [ ] Business metrics dashboard (signups, revenue)
+
+---
+
+## Common Monitoring Patterns
+
+### RED Method (for services)
+
+**Rate**: Requests per second
+**Errors**: Error rate (%)
+**Duration**: Response time (p50, p95, p99)
+
+**Dashboard**:
+```
+-----------------+  +-----------------+  +-----------------+
+|      Rate       |  |     Errors      |  |    Duration     |
+|  1000 req/s     |  |      0.5%       |  | p95: 250ms      |
+-----------------+  +-----------------+  +-----------------+
+```
+
+### USE Method (for resources)
+
+**Utilization**: % of resource used (CPU, memory, disk)
+**Saturation**: Queue depth, backlog
+**Errors**: Error count
+
+**Dashboard**:
+```
+CPU: 70% utilization, 0.5 load average, 0 errors
+Memory: 80% utilization, 0 swap, 0 OOM kills
+Disk: 60% utilization, 5ms latency, 0 I/O errors
+```
+
+---
+
+## Tools Comparison
+
+| Tool | Type | Best For | Cost |
+|------|------|----------|------|
+| Prometheus + Grafana | Metrics | Self-hosted, cost-effective | Free |
+| DataDog | Metrics, Logs, APM | All-in-one, easy setup | $15/host/month |
+| New Relic | APM | Application performance | $99/user/month |
+| ELK Stack | Logs | Log aggregation | Free (self-hosted) |
+| Splunk | Logs | Enterprise log analysis | $1800/GB/year |
+| Jaeger | Traces | Distributed tracing | Free |
+| CloudWatch | Metrics, Logs | AWS-native | $0.30/metric/month |
+| Azure Monitor | Metrics, Logs | Azure-native | $0.25/metric/month |
+
+---
+
+## Related Documentation
+
+- [SKILL.md](../SKILL.md) - Main SRE agent
+- [backend-diagnostics.md](backend-diagnostics.md) - Application troubleshooting
+- [database-diagnostics.md](database-diagnostics.md) - Database monitoring
+- [infrastructure.md](infrastructure.md) - Infrastructure monitoring
--- a/agents/sre/modules/security-incidents.md
+++ b/agents/sre/modules/security-incidents.md
@@ -0,0 +1,421 @@
+# Security Incidents
+
+**Purpose**: Respond to security breaches, DDoS attacks, and unauthorized access attempts.
+
+**IMPORTANT**: For security incidents, SRE Agent collaborates with `security-agent` skill.
+
+## Incident Response Protocol
+
+### SEV1 Security Incidents (CRITICAL)
+
+**Immediate Actions** (First 5 minutes):
+1. **Isolate** affected systems
+2. **Preserve** evidence (logs, snapshots)
+3. **Notify** security team and management
+4. **Assess** scope of breach
+5. **Document** timeline
+
+**DO NOT**:
+- Delete logs (preserve evidence)
+- Reboot systems (unless absolutely necessary)
+- Make changes without documenting
+
+---
+
+## Common Security Incidents
+
+### 1. DDoS Attack
+
+**Symptoms**:
+- Sudden traffic spike (10x-100x normal)
+- Legitimate users can't access service
+- High bandwidth usage
+- Server overload
+
+**Diagnosis**:
+
+#### Check Traffic Patterns
+```bash
+# Check connections by IP
+netstat -ntu | awk '{print $5}' | cut -d: -f1 | sort | uniq -c | sort -nr | head -20
+
+# Check HTTP requests by IP (nginx)
+awk '{print $1}' /var/log/nginx/access.log | sort | uniq -c | sort -nr | head -20
+
+# Check requests per second
+tail -f /var/log/nginx/access.log | awk '{print $4}' | uniq -c
+```
+
+**Red flags**:
+- Single IP making thousands of requests
+- Requests from suspicious IPs (botnets)
+- High rate of 4xx errors (probing)
+- Unusual traffic patterns
+
+---
+
+#### Immediate Mitigation
+```bash
+# 1. Rate limiting (nginx)
+# Add to nginx.conf:
+limit_req_zone $binary_remote_addr zone=one:10m rate=10r/s;
+limit_req zone=one burst=20 nodelay;
+
+# 2. Block suspicious IPs (iptables)
+iptables -A INPUT -s <ATTACKER_IP> -j DROP
+
+# 3. Enable DDoS protection (CloudFlare, AWS Shield)
+# CloudFlare: Enable "I'm Under Attack" mode
+# AWS: Enable AWS Shield Standard/Advanced
+
+# 4. Increase capacity (auto-scaling)
+# Scale up to handle traffic (if legitimate)
+```
+
+---
+
+### 2. Unauthorized Access / Data Breach
+
+**Symptoms**:
+- Alerts for failed login attempts
+- Successful login from unusual location
+- Unusual data access patterns
+- Data exfiltration detected
+
+**Diagnosis**:
+
+#### Check Access Logs
+```bash
+# Check authentication logs (Linux)
+grep "Failed password" /var/log/auth.log | tail -50
+
+# Check successful logins
+grep "Accepted password" /var/log/auth.log | tail -50
+
+# Check login attempts by IP
+awk '/Failed password/ {print $(NF-3)}' /var/log/auth.log | sort | uniq -c | sort -nr
+```
+
+**Red flags**:
+- Hundreds of failed login attempts (brute force)
+- Successful login from suspicious IP/location
+- Login at unusual time (3am)
+- Multiple accounts accessed from same IP
+
+---
+
+#### Immediate Response (SEV1)
+```bash
+# 1. ISOLATE: Disable compromised account
+# Application-level:
+UPDATE users SET disabled = true WHERE id = <COMPROMISED_USER_ID>;
+
+# System-level:
+passwd -l <username>  # Lock account
+
+# 2. PRESERVE: Copy logs for forensics
+cp /var/log/auth.log /forensics/auth.log.$(date +%Y%m%d)
+cp /var/log/nginx/access.log /forensics/access.log.$(date +%Y%m%d)
+
+# 3. ASSESS: Check what was accessed
+# Database audit logs
+# Application logs
+# File access logs
+
+# 4. NOTIFY: Alert security team
+# Email, Slack, PagerDuty
+
+# 5. DOCUMENT: Create incident timeline
+```
+
+---
+
+#### Long-term Mitigation
+- Force password reset for all users
+- Enable 2FA/MFA
+- Review access controls
+- Conduct security audit
+- Update security policies
+- Train users on security
+
+---
+
+### 3. SQL Injection Attempt
+
+**Symptoms**:
+- Unusual SQL queries in logs
+- 500 errors with SQL syntax messages
+- Alerts from WAF (Web Application Firewall)
+
+**Diagnosis**:
+
+#### Check Application Logs
+```bash
+# Look for SQL injection patterns
+grep -E "(SELECT|INSERT|UPDATE|DELETE).*FROM.*WHERE" /var/log/application.log
+
+# Look for SQL errors
+grep "SQLException\|SQL syntax" /var/log/application.log
+
+# Check for malicious patterns
+grep -E "(\'\s*OR\s*\'|\-\-|UNION\s+SELECT)" /var/log/nginx/access.log
+```
+
+**Example Malicious Request**:
+```
+GET /api/users?id=1' OR '1'='1
+GET /api/users?id=1; DROP TABLE users;--
+```
+
+---
+
+#### Immediate Response
+```bash
+# 1. Block attacker IP
+iptables -A INPUT -s <ATTACKER_IP> -j DROP
+
+# 2. Enable WAF rule (ModSecurity, AWS WAF)
+# Block requests with SQL keywords
+
+# 3. Check database for unauthorized changes
+# Compare current schema with backup
+# Check audit logs for suspicious queries
+
+# 4. Review application code
+# Use parameterized queries, not string concatenation
+```
+
+**Long-term Fix**:
+```javascript
+// BAD: SQL injection vulnerable
+const query = `SELECT * FROM users WHERE id = ${req.query.id}`;
+
+// GOOD: Parameterized query
+const query = 'SELECT * FROM users WHERE id = ?';
+db.query(query, [req.query.id]);
+```
+
+---
+
+### 4. Malware / Crypto Mining
+
+**Symptoms**:
+- High CPU usage (100%)
+- Unusual network traffic (to crypto pool)
+- Unknown processes running
+- Server slow
+
+**Diagnosis**:
+
+#### Check Running Processes
+```bash
+# Check CPU usage by process
+top -bn1 | head -20
+
+# Check all processes
+ps aux | sort -nrk 3,3 | head -20
+
+# Check for suspicious processes
+ps aux | grep -v -E "^(root|www-data|mysql|postgres)"
+
+# Check network connections
+netstat -tunap | grep ESTABLISHED
+```
+
+**Red flags**:
+- Unknown process using 100% CPU
+- Connections to crypto mining pools
+- Processes running as unexpected user
+- Processes with random names (xmrig, minerd)
+
+---
+
+#### Immediate Response
+```bash
+# 1. Kill malicious process
+kill -9 <PID>
+
+# 2. Find and remove malware
+find / -name "<PROCESS_NAME>" -delete
+
+# 3. Check for persistence mechanisms
+crontab -l                    # Cron jobs
+cat /etc/rc.local             # Startup scripts
+systemctl list-unit-files     # Systemd services
+
+# 4. Change all credentials
+# Root password
+# SSH keys
+# Database passwords
+# API keys
+
+# 5. Restore from clean backup (if available)
+```
+
+---
+
+### 5. Insider Threat / Data Exfiltration
+
+**Symptoms**:
+- Large data downloads
+- Database dump exports
+- Unusual file transfers
+- After-hours access
+
+**Diagnosis**:
+
+#### Check Data Access Logs
+```bash
+# Check database queries (large exports)
+grep "SELECT.*FROM" /var/log/postgresql/postgresql.log | grep -E "LIMIT\s+[0-9]{5,}"
+
+# Check file downloads (nginx)
+awk '$10 > 10000000 {print $1, $7, $10}' /var/log/nginx/access.log
+
+# Check SSH file transfers
+grep "sftp\|scp" /var/log/auth.log
+```
+
+**Red flags**:
+- SELECT with no LIMIT (full table export)
+- Large file downloads (>10MB)
+- Multiple consecutive downloads
+- Access from unusual location
+
+---
+
+#### Immediate Response
+```bash
+# 1. Disable account
+UPDATE users SET disabled = true WHERE id = <USER_ID>;
+
+# 2. Preserve evidence
+cp /var/log/* /forensics/
+
+# 3. Assess damage
+# What data was accessed?
+# What data was exported?
+# What systems were compromised?
+
+# 4. Legal/compliance notification
+# GDPR: Notify within 72 hours
+# HIPAA: Notify within 60 days
+# PCI-DSS: Immediate notification
+
+# 5. Incident report
+```
+
+---
+
+## Security Incident Checklist
+
+**When security incident detected**:
+
+### Phase 1: Immediate Response (0-5 min)
+- [ ] Classify severity (SEV1/SEV2/SEV3)
+- [ ] Isolate affected systems
+- [ ] Preserve evidence (logs, snapshots)
+- [ ] Notify security team
+- [ ] Document timeline (start timestamp)
+
+### Phase 2: Assessment (5-30 min)
+- [ ] Identify attack vector
+- [ ] Assess scope (what was compromised?)
+- [ ] Check for data exfiltration
+- [ ] Identify attacker (IP, location, identity)
+- [ ] Determine if ongoing or stopped
+
+### Phase 3: Containment (30 min - 2 hours)
+- [ ] Block attacker access
+- [ ] Close vulnerability
+- [ ] Revoke compromised credentials
+- [ ] Remove malware/backdoors
+- [ ] Restore from clean backup (if needed)
+
+### Phase 4: Recovery (2 hours - days)
+- [ ] Restore normal operations
+- [ ] Verify no persistence mechanisms
+- [ ] Monitor for re-infection
+- [ ] Change all credentials
+- [ ] Apply security patches
+
+### Phase 5: Post-Incident (1 week)
+- [ ] Complete post-mortem
+- [ ] Legal/compliance notifications
+- [ ] Security audit
+- [ ] Update security policies
+- [ ] Train team on lessons learned
+
+---
+
+## Collaboration with Security Agent
+
+**SRE Agent Role**:
+- Initial detection and triage
+- Immediate containment
+- Preserve evidence
+- Restore service
+
+**Security Agent Role** (handoff):
+- Forensic analysis
+- Legal compliance
+- Security audit
+- Policy updates
+
+**Handoff Protocol**:
+```
+SRE: Detects security incident → Immediate containment
+SRE: Preserves evidence → Creates incident report
+SRE: Hands off to Security Agent
+Security Agent: Forensic analysis → Legal compliance → Long-term fixes
+SRE: Implements security fixes → Updates runbook
+```
+
+---
+
+## Security Metrics
+
+**Detection Time**:
+- SEV1: <5 minutes from first indicator
+- SEV2: <30 minutes
+- SEV3: <24 hours
+
+**Response Time**:
+- SEV1: Containment within 30 minutes
+- SEV2: Containment within 2 hours
+- SEV3: Containment within 24 hours
+
+**False Positives**:
+- Target: <5% of security alerts
+
+---
+
+## Related Documentation
+
+- [SKILL.md](../SKILL.md) - Main SRE agent
+- [infrastructure.md](infrastructure.md) - Server security hardening
+- [monitoring.md](monitoring.md) - Security monitoring setup
+- `security-agent` skill - Full security expertise (handoff for forensics)
+
+---
+
+## Important Notes
+
+**For SRE Agent**:
+- Focus on IMMEDIATE containment and service restoration
+- Preserve evidence (don't delete logs!)
+- Hand off to `security-agent` for forensic analysis
+- Document everything with timestamps
+- Blameless post-mortem (focus on systems, not people)
+
+**Legal Compliance**:
+- GDPR: Notify within 72 hours of breach
+- HIPAA: Notify within 60 days
+- PCI-DSS: Immediate notification to card brands
+- SOC 2: Document in audit trail
+
+**Evidence Preservation**:
+- Copy logs before any changes
+- Take disk/memory snapshots
+- Document all actions taken
+- Preserve chain of custody
--- a/agents/sre/modules/ui-diagnostics.md
+++ b/agents/sre/modules/ui-diagnostics.md
@@ -0,0 +1,302 @@
+# UI/Frontend Diagnostics
+
+**Purpose**: Troubleshoot frontend performance, rendering, and user experience issues.
+
+## Common UI Issues
+
+### 1. Slow Page Load
+
+**Symptoms**:
+- Users report long loading times
+- Lighthouse score <50
+- Time to Interactive (TTI) >5 seconds
+
+**Diagnosis**:
+
+#### Check Bundle Size
+```bash
+# Check JavaScript bundle size
+ls -lh dist/*.js
+
+# Analyze bundle composition
+npx webpack-bundle-analyzer dist/stats.json
+
+# Check for large dependencies
+npm ls --depth=0
+```
+
+**Red flags**:
+- Main bundle >500KB
+- Unused dependencies in bundle
+- Multiple copies of same library
+
+**Mitigation**:
+- Code splitting: `import()` for dynamic imports
+- Tree shaking: Remove unused code
+- Lazy loading: Load components on demand
+
+---
+
+#### Check Network Requests
+```bash
+# Chrome DevTools → Network tab
+# Look for:
+# - Number of requests (>100 = too many)
+# - Large assets (images >200KB)
+# - Slow API calls (>1s)
+```
+
+**Red flags**:
+- Waterfall pattern (sequential loading)
+- Large uncompressed images
+- Blocking requests
+
+**Mitigation**:
+- Image optimization: WebP, lazy loading
+- HTTP/2: Multiplexing
+- CDN: Cache static assets
+
+---
+
+#### Check Render Performance
+```bash
+# Chrome DevTools → Performance tab
+# Record page load, check:
+# - Long tasks (>50ms)
+# - Layout thrashing
+# - JavaScript execution time
+```
+
+**Red flags**:
+- Long tasks blocking main thread
+- Multiple layout recalculations
+- Heavy JavaScript computation
+
+**Mitigation**:
+- Web Workers: Move heavy computation off main thread
+- requestIdleCallback: Defer non-critical work
+- Virtual scrolling: Render only visible items
+
+---
+
+### 2. Memory Leak (UI)
+
+**Symptoms**:
+- Browser tab becomes slow over time
+- Memory usage increases continuously
+- Browser eventually crashes
+
+**Diagnosis**:
+
+#### Chrome DevTools → Memory
+```bash
+# Take heap snapshot before/after user interaction
+# Compare snapshots
+# Look for:
+# - Detached DOM nodes
+# - Event listeners not removed
+# - Growing arrays/objects
+```
+
+**Red flags**:
+- Detached DOM elements increasing
+- Event listeners not garbage collected
+- Timers/intervals not cleared
+
+**Mitigation**:
+```javascript
+// Clean up event listeners
+componentWillUnmount() {
+  element.removeEventListener('click', handler);
+  clearInterval(this.intervalId);
+  clearTimeout(this.timeoutId);
+}
+
+// Use WeakMap for DOM references
+const cache = new WeakMap();
+```
+
+---
+
+### 3. Unresponsive UI
+
+**Symptoms**:
+- Clicks don't register
+- Input lag
+- Frozen UI
+
+**Diagnosis**:
+
+#### Check Main Thread
+```bash
+# Chrome DevTools → Performance
+# Look for:
+# - Long tasks (>50ms)
+# - Blocking JavaScript
+# - Forced synchronous layout
+```
+
+**Red flags**:
+- JavaScript blocking >100ms
+- Synchronous XHR requests
+- Layout thrashing (read → write → read)
+
+**Mitigation**:
+```javascript
+// Break up long tasks
+async function processLargeArray(items) {
+  for (let i = 0; i < items.length; i++) {
+    await processItem(items[i]);
+
+    // Yield to main thread every 100 items
+    if (i % 100 === 0) {
+      await new Promise(resolve => setTimeout(resolve, 0));
+    }
+  }
+}
+
+// Use requestIdleCallback
+requestIdleCallback(() => {
+  // Non-critical work
+});
+```
+
+---
+
+### 4. White Screen / Failed Render
+
+**Symptoms**:
+- Blank page
+- Error boundary triggered
+- Console errors
+
+**Diagnosis**:
+
+#### Check Console Errors
+```bash
+# Chrome DevTools → Console
+# Look for:
+# - Uncaught exceptions
+# - Network errors (failed chunks)
+# - CORS errors
+```
+
+**Common causes**:
+- JavaScript error in render
+- Failed to load chunk (code splitting)
+- CORS blocking API calls
+- Missing dependencies
+
+**Mitigation**:
+```javascript
+// Error boundary
+class ErrorBoundary extends React.Component {
+  componentDidCatch(error, errorInfo) {
+    logErrorToService(error, errorInfo);
+  }
+
+  render() {
+    if (this.state.hasError) {
+      return <ErrorFallback />;
+    }
+    return this.props.children;
+  }
+}
+
+// Retry failed chunk loads
+const retryImport = (fn, retriesLeft = 3) => {
+  return new Promise((resolve, reject) => {
+    fn()
+      .then(resolve)
+      .catch(error => {
+        if (retriesLeft === 0) {
+          reject(error);
+        } else {
+          setTimeout(() => {
+            retryImport(fn, retriesLeft - 1).then(resolve, reject);
+          }, 1000);
+        }
+      });
+  });
+};
+```
+
+---
+
+## UI Performance Metrics
+
+**Core Web Vitals**:
+- **LCP** (Largest Contentful Paint): <2.5s (good), <4s (needs improvement), >4s (poor)
+- **FID** (First Input Delay): <100ms (good), <300ms (needs improvement), >300ms (poor)
+- **CLS** (Cumulative Layout Shift): <0.1 (good), <0.25 (needs improvement), >0.25 (poor)
+
+**Other Metrics**:
+- **TTFB** (Time to First Byte): <200ms
+- **FCP** (First Contentful Paint): <1.8s
+- **TTI** (Time to Interactive): <3.8s
+
+**Measurement**:
+```javascript
+// Web Vitals library
+import {getLCP, getFID, getCLS} from 'web-vitals';
+
+getLCP(console.log);
+getFID(console.log);
+getCLS(console.log);
+```
+
+---
+
+## Common UI Anti-Patterns
+
+### 1. Render Everything Upfront
+**Problem**: Rendering 10,000 items at once
+**Solution**: Virtual scrolling, pagination, infinite scroll
+
+### 2. No Code Splitting
+**Problem**: 5MB JavaScript bundle loaded upfront
+**Solution**: Route-based code splitting, lazy loading
+
+### 3. Large Images
+**Problem**: 5MB PNG images
+**Solution**: WebP, compression, lazy loading, responsive images
+
+### 4. Blocking JavaScript
+**Problem**: Heavy computation on main thread
+**Solution**: Web Workers, requestIdleCallback, async/await
+
+### 5. Memory Leaks
+**Problem**: Event listeners not removed, timers not cleared
+**Solution**: Cleanup in componentWillUnmount, WeakMap
+
+---
+
+## UI Diagnostic Checklist
+
+**When diagnosing slow UI**:
+
+- [ ] Check bundle size (target: <500KB gzipped)
+- [ ] Check number of network requests (target: <50)
+- [ ] Check Core Web Vitals (LCP <2.5s, FID <100ms, CLS <0.1)
+- [ ] Check for JavaScript errors in console
+- [ ] Check render performance (no long tasks >50ms)
+- [ ] Check memory usage (no continuous growth)
+- [ ] Check for CORS errors
+- [ ] Check for failed chunk loads
+- [ ] Check image sizes (target: <200KB per image)
+- [ ] Check for blocking resources
+
+**Tools**:
+- Chrome DevTools (Network, Performance, Memory, Console)
+- Lighthouse
+- Web Vitals library
+- webpack-bundle-analyzer
+- React DevTools Profiler
+
+---
+
+## Related Documentation
+
+- [SKILL.md](../SKILL.md) - Main SRE agent
+- [backend-diagnostics.md](backend-diagnostics.md) - Backend troubleshooting
+- [monitoring.md](monitoring.md) - Observability tools