Files
gh-anton-abyzov-specweave-p…/agents/sre/modules/database-diagnostics.md
2025-11-29 17:56:41 +08:00

11 KiB

Database Diagnostics

Purpose: Troubleshoot database performance, slow queries, deadlocks, and connection issues.

Common Database Issues

1. Slow Query

Symptoms:

  • API response time high
  • Specific endpoint slow
  • Database CPU high

Diagnosis:

Enable Slow Query Log (PostgreSQL)

-- Set slow query threshold (1 second)
ALTER SYSTEM SET log_min_duration_statement = 1000;
SELECT pg_reload_conf();

-- Check slow query log
-- /var/log/postgresql/postgresql.log

Enable Slow Query Log (MySQL)

-- Enable slow query log
SET GLOBAL slow_query_log = 'ON';
SET GLOBAL long_query_time = 1;

-- Check slow query log
-- /var/log/mysql/mysql-slow.log

Analyze Query with EXPLAIN

-- PostgreSQL
EXPLAIN ANALYZE
SELECT users.*, posts.*
FROM users
LEFT JOIN posts ON posts.user_id = users.id
WHERE users.last_login_at > NOW() - INTERVAL '30 days';

-- Look for:
-- - Seq Scan (sequential scan = BAD for large tables)
-- - High cost numbers
-- - High actual time

Red flags in EXPLAIN output:

  • Seq Scan on large table (>10k rows) → Missing index
  • Nested Loop with large outer table → Missing index
  • Hash Join with large tables → Consider index
  • Actual time >> Planned time → Statistics outdated

Example Bad Query:

Seq Scan on users  (cost=0.00..100000 rows=10000000)
  Filter: (last_login_at > '2025-09-26'::date)
  Rows Removed by Filter: 9900000

Missing index on last_login_at


Check Missing Indexes

-- PostgreSQL: Find missing indexes
SELECT
  schemaname,
  tablename,
  seq_scan,
  seq_tup_read,
  idx_scan,
  seq_tup_read / seq_scan AS avg_seq_read
FROM pg_stat_user_tables
WHERE seq_scan > 0
ORDER BY seq_tup_read DESC
LIMIT 20;

-- Tables with high seq_scan and low idx_scan need indexes

Create Index

-- PostgreSQL (CONCURRENTLY = no table lock)
CREATE INDEX CONCURRENTLY idx_users_last_login_at
ON users(last_login_at);

-- Verify index is used
EXPLAIN ANALYZE
SELECT * FROM users WHERE last_login_at > NOW() - INTERVAL '30 days';
-- Should show: Index Scan using idx_users_last_login_at

Impact:

  • Before: 7.8 seconds (Seq Scan)
  • After: 50ms (Index Scan)

2. Database Deadlock

Symptoms:

  • "Deadlock detected" errors
  • Transactions timing out
  • API 500 errors

Diagnosis:

Check for Deadlocks (PostgreSQL)

-- Check currently locked queries
SELECT
  blocked_locks.pid AS blocked_pid,
  blocked_activity.usename AS blocked_user,
  blocking_locks.pid AS blocking_pid,
  blocking_activity.usename AS blocking_user,
  blocked_activity.query AS blocked_statement,
  blocking_activity.query AS blocking_statement
FROM pg_catalog.pg_locks blocked_locks
JOIN pg_catalog.pg_stat_activity blocked_activity ON blocked_activity.pid = blocked_locks.pid
JOIN pg_catalog.pg_locks blocking_locks
  ON blocking_locks.locktype = blocked_locks.locktype
  AND blocking_locks.database IS NOT DISTINCT FROM blocked_locks.database
  AND blocking_locks.relation IS NOT DISTINCT FROM blocked_locks.relation
  AND blocking_locks.pid != blocked_locks.pid
JOIN pg_catalog.pg_stat_activity blocking_activity ON blocking_activity.pid = blocking_locks.pid
WHERE NOT blocked_locks.granted;

Check for Deadlocks (MySQL)

-- Show InnoDB status (includes deadlock info)
SHOW ENGINE INNODB STATUS\G

-- Look for "LATEST DETECTED DEADLOCK" section

Common Deadlock Patterns

-- Pattern 1: Lock order mismatch
-- Transaction 1:
BEGIN;
UPDATE accounts SET balance = balance - 100 WHERE id = 1;
UPDATE accounts SET balance = balance + 100 WHERE id = 2;
COMMIT;

-- Transaction 2 (runs concurrently):
BEGIN;
UPDATE accounts SET balance = balance - 50 WHERE id = 2; -- Locks id=2
UPDATE accounts SET balance = balance + 50 WHERE id = 1; -- Waits for id=1 (deadlock!)
COMMIT;

Fix: Always lock in same order

-- Both transactions lock in order: id=1, then id=2
BEGIN;
UPDATE accounts SET balance = balance - 100 WHERE id = LEAST(1, 2);
UPDATE accounts SET balance = balance + 100 WHERE id = GREATEST(1, 2);
COMMIT;

Immediate Mitigation

-- PostgreSQL: Kill blocking query
SELECT pg_terminate_backend(<blocking_pid>);

-- PostgreSQL: Kill idle transactions
SELECT pg_terminate_backend(pid)
FROM pg_stat_activity
WHERE state = 'idle in transaction'
AND state_change < NOW() - INTERVAL '5 minutes';

3. Connection Pool Exhausted

Symptoms:

  • "Too many connections" errors
  • "Connection pool exhausted" errors
  • New connections timing out

Diagnosis:

Check Active Connections (PostgreSQL)

-- Count connections by state
SELECT state, count(*)
FROM pg_stat_activity
GROUP BY state;

-- Show all connections
SELECT pid, usename, application_name, state, query
FROM pg_stat_activity
WHERE state != 'idle';

-- Check max connections
SHOW max_connections;

Check Active Connections (MySQL)

-- Show all connections
SHOW PROCESSLIST;

-- Count connections by state
SELECT state, COUNT(*)
FROM information_schema.processlist
GROUP BY state;

-- Check max connections
SHOW VARIABLES LIKE 'max_connections';

Red flags:

  • Connections = max_connections
  • Many "idle in transaction" (connections held but not used)
  • Long-running queries holding connections

Immediate Mitigation

-- PostgreSQL: Kill idle connections
SELECT pg_terminate_backend(pid)
FROM pg_stat_activity
WHERE state = 'idle'
AND state_change < NOW() - INTERVAL '10 minutes';

-- Increase max_connections (temporary)
ALTER SYSTEM SET max_connections = 200;
SELECT pg_reload_conf();

Long-term Fix:

  • Fix connection leaks in application code
  • Increase connection pool size (if needed)
  • Add connection timeout
  • Use connection pooler (PgBouncer, ProxySQL)

4. High Database CPU

Symptoms:

  • Database CPU >80%
  • All queries slow
  • Server overload

Diagnosis:

Find CPU-heavy Queries (PostgreSQL)

-- Top queries by total time
SELECT
  query,
  calls,
  total_exec_time,
  mean_exec_time,
  max_exec_time
FROM pg_stat_statements
ORDER BY total_exec_time DESC
LIMIT 10;

-- Requires: CREATE EXTENSION pg_stat_statements;

Find CPU-heavy Queries (MySQL)

-- Enable performance schema
SET GLOBAL performance_schema = ON;

-- Top queries by execution time
SELECT
  DIGEST_TEXT,
  COUNT_STAR,
  SUM_TIMER_WAIT,
  AVG_TIMER_WAIT
FROM performance_schema.events_statements_summary_by_digest
ORDER BY SUM_TIMER_WAIT DESC
LIMIT 10;

Common causes:

  • Missing indexes (Seq Scan)
  • Complex queries (many JOINs)
  • Aggregations on large tables
  • Full table scans

Mitigation:

  • Add missing indexes
  • Optimize queries (reduce JOINs)
  • Add query caching
  • Scale database (read replicas)

5. Disk Full

Symptoms:

  • "No space left on device" errors
  • Database refuses writes
  • Application crashes

Diagnosis:

Check Disk Usage

# Linux
df -h

# Database data directory
du -sh /var/lib/postgresql/data/*
du -sh /var/lib/mysql/*

# Find large tables
# PostgreSQL:
SELECT
  schemaname,
  tablename,
  pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename)) AS size
FROM pg_tables
ORDER BY pg_total_relation_size(schemaname||'.'||tablename) DESC
LIMIT 20;

Immediate Mitigation

# 1. Clean up logs
rm /var/log/postgresql/postgresql-*.log.1
rm /var/log/mysql/mysql-slow.log.1

# 2. Vacuum database (PostgreSQL)
VACUUM FULL;

# 3. Archive old data
# Move old records to archive table or backup

# 4. Expand disk (cloud)
# AWS: Modify EBS volume size
# Azure: Expand managed disk

6. Replication Lag

Symptoms:

  • Stale data on read replicas
  • Monitoring alerts for lag
  • Eventually consistent reads

Diagnosis:

Check Replication Lag (PostgreSQL)

-- On primary:
SELECT * FROM pg_stat_replication;

-- On replica:
SELECT
  now() - pg_last_xact_replay_timestamp() AS replication_lag;

Check Replication Lag (MySQL)

-- On replica:
SHOW SLAVE STATUS\G

-- Look for: Seconds_Behind_Master

Red flags:

  • Lag >1 minute
  • Lag increasing over time

Common causes:

  • High write load on primary
  • Replica under-provisioned
  • Network latency
  • Long-running query blocking replay

Mitigation:

  • Scale up replica (more CPU, memory)
  • Optimize slow queries on primary
  • Increase network bandwidth
  • Add more replicas (distribute read load)

Database Performance Metrics

Query Performance:

  • p50 query time: <10ms
  • p95 query time: <100ms
  • p99 query time: <500ms

Resource Usage:

  • CPU: <70% average
  • Memory: <80% of available
  • Disk I/O: <80% of throughput
  • Connections: <80% of max

Availability:

  • Uptime: 99.99% (52.6 min downtime/year)
  • Replication lag: <1 second

Database Diagnostic Checklist

When diagnosing slow database:

  • Check slow query log
  • Run EXPLAIN ANALYZE on slow queries
  • Check for missing indexes (seq_scan > idx_scan)
  • Check for deadlocks
  • Check connection count (target: <80% of max)
  • Check database CPU (target: <70%)
  • Check disk space (target: <80% used)
  • Check replication lag (target: <1s)
  • Check for long-running queries (>30s)
  • Check for idle transactions (>5 min)

Tools:

  • EXPLAIN ANALYZE
  • pg_stat_statements (PostgreSQL)
  • Performance Schema (MySQL)
  • pg_stat_activity (PostgreSQL)
  • SHOW PROCESSLIST (MySQL)
  • Database monitoring (CloudWatch, DataDog)

Database Anti-Patterns

1. N+1 Query Problem

// BAD: N+1 queries
const users = await db.query('SELECT * FROM users');
for (const user of users) {
  const posts = await db.query('SELECT * FROM posts WHERE user_id = ?', [user.id]);
}
// 1 query + N queries = N+1

// GOOD: Single query with JOIN
const usersWithPosts = await db.query(`
  SELECT users.*, posts.*
  FROM users
  LEFT JOIN posts ON posts.user_id = users.id
`);

2. SELECT *

-- BAD: Fetches all columns (inefficient)
SELECT * FROM users WHERE id = 1;

-- GOOD: Fetch only needed columns
SELECT id, name, email FROM users WHERE id = 1;

3. Missing Indexes

-- BAD: No index on frequently queried column
SELECT * FROM users WHERE email = 'user@example.com';
-- Seq Scan on users

-- GOOD: Add index
CREATE INDEX idx_users_email ON users(email);
-- Index Scan using idx_users_email

4. Long Transactions

// BAD: Long transaction holding locks
BEGIN;
const user = await db.query('SELECT * FROM users WHERE id = 1 FOR UPDATE');
await sendEmail(user.email); // External API call (slow!)
await db.query('UPDATE users SET last_email_sent = NOW() WHERE id = 1');
COMMIT;

// GOOD: Keep transactions short
const user = await db.query('SELECT * FROM users WHERE id = 1');
await sendEmail(user.email); // Outside transaction
await db.query('UPDATE users SET last_email_sent = NOW() WHERE id = 1');