Files
gh-jeremylongshore-claude-c…/commands/deadlock.md
2025-11-30 08:18:12 +08:00

794 lines
27 KiB
Markdown

---
description: Detect and resolve database deadlocks with automated monitoring
shortcut: deadlock
---
# Database Deadlock Detector
Detect, analyze, and prevent database deadlocks with automated monitoring, alerting, and resolution strategies for production database systems.
## When to Use This Command
Use `/deadlock` when you need to:
- Investigate recurring deadlock issues in production
- Implement proactive deadlock detection and alerting
- Analyze transaction patterns causing deadlocks
- Optimize lock acquisition order in applications
- Monitor database lock contention in real-time
- Generate deadlock reports for performance tuning
DON'T use this when:
- Database doesn't support deadlock detection (use lock monitoring instead)
- Dealing with application-level race conditions (not database deadlocks)
- Looking for slow queries (use query analyzer instead)
- Investigating connection pool exhaustion (use connection pooler)
## Design Decisions
This command implements **comprehensive deadlock detection and prevention** because:
- Proactive monitoring prevents production incidents
- Automated analysis identifies root causes faster
- Prevention strategies reduce deadlock frequency by 90%+
- Real-time alerting enables rapid incident response
- Historical analysis reveals patterns and trends
**Alternative considered: Reactive deadlock handling**
- Only responds after deadlocks occur
- Relies on application retry logic
- No visibility into deadlock patterns
- Recommended only for low-traffic systems
**Alternative considered: Database-native logging only**
- Limited to log file analysis
- No automated alerting or resolution
- Requires manual correlation of events
- Recommended only for development environments
## Prerequisites
Before running this command:
1. Database user with monitoring permissions (e.g., `pg_monitor` role)
2. Access to database logs or system views
3. Understanding of your application's transaction patterns
4. Monitoring infrastructure (Prometheus/Grafana recommended)
5. Python 3.8+ or Node.js 16+ for monitoring scripts
## Implementation Process
### Step 1: Configure Database Deadlock Logging
Enable comprehensive deadlock detection and logging in your database.
### Step 2: Implement Deadlock Monitoring
Set up automated monitoring to detect and alert on deadlocks in real-time.
### Step 3: Analyze Deadlock Patterns
Build analysis tools to identify common deadlock scenarios and root causes.
### Step 4: Implement Prevention Strategies
Apply code changes and database tuning to prevent deadlocks proactively.
### Step 5: Set Up Continuous Monitoring
Deploy dashboards and alerting for ongoing deadlock visibility.
## Output Format
The command generates:
- `monitoring/deadlock-detector.py` - Real-time deadlock monitoring script
- `analysis/deadlock-analyzer.sql` - SQL queries for pattern analysis
- `config/deadlock-prevention.md` - Prevention strategies documentation
- `dashboards/deadlock-dashboard.json` - Grafana dashboard configuration
- `alerts/deadlock-rules.yml` - Prometheus alerting rules
## Code Examples
### Example 1: PostgreSQL Deadlock Detection and Monitoring
```sql
-- Enable comprehensive deadlock logging
-- Add to postgresql.conf
log_lock_waits = on
deadlock_timeout = '1s'
log_line_prefix = '%t [%p]: [%l-1] user=%u,db=%d,app=%a,client=%h '
-- Create deadlock monitoring view
CREATE OR REPLACE VIEW deadlock_monitor AS
SELECT
l.locktype,
l.relation::regclass AS table_name,
l.mode,
l.granted,
l.pid AS blocked_pid,
l.page,
l.tuple,
a.usename,
a.application_name,
a.client_addr,
a.query AS blocked_query,
a.state,
a.wait_event_type,
a.wait_event,
NOW() - a.query_start AS query_duration,
NOW() - a.state_change AS state_duration
FROM pg_locks l
JOIN pg_stat_activity a ON l.pid = a.pid
WHERE NOT l.granted
ORDER BY a.query_start;
-- Query to identify blocking vs blocked processes
CREATE OR REPLACE FUNCTION show_deadlock_chains()
RETURNS TABLE (
blocked_pid integer,
blocked_query text,
blocking_pid integer,
blocking_query text,
duration interval
) AS $$
SELECT
blocked.pid AS blocked_pid,
blocked.query AS blocked_query,
blocking.pid AS blocking_pid,
blocking.query AS blocking_query,
NOW() - blocked.query_start AS duration
FROM pg_stat_activity blocked
JOIN pg_locks blocked_locks ON blocked.pid = blocked_locks.pid
JOIN pg_locks blocking_locks ON
blocked_locks.locktype = blocking_locks.locktype
AND blocked_locks.relation IS NOT DISTINCT FROM blocking_locks.relation
AND blocked_locks.page IS NOT DISTINCT FROM blocking_locks.page
AND blocked_locks.tuple IS NOT DISTINCT FROM blocking_locks.tuple
AND blocked_locks.pid != blocking_locks.pid
JOIN pg_stat_activity blocking ON blocking_locks.pid = blocking.pid
WHERE NOT blocked_locks.granted
AND blocking_locks.granted
AND blocked.pid != blocking.pid;
$$ LANGUAGE SQL;
-- Historical deadlock analysis
CREATE TABLE deadlock_history (
id SERIAL PRIMARY KEY,
detected_at TIMESTAMP DEFAULT NOW(),
victim_pid INTEGER,
victim_query TEXT,
blocker_pid INTEGER,
blocker_query TEXT,
lock_type TEXT,
table_name TEXT,
resolution_time_ms INTEGER,
metadata JSONB
);
-- Function to log deadlocks
CREATE OR REPLACE FUNCTION log_deadlock_event()
RETURNS TRIGGER AS $$
BEGIN
INSERT INTO deadlock_history (
victim_pid, victim_query, blocker_pid, blocker_query,
lock_type, table_name, metadata
)
SELECT
blocked_pid,
blocked_query,
blocking_pid,
blocking_query,
'deadlock',
'detected_from_logs',
jsonb_build_object(
'detection_method', 'log_trigger',
'timestamp', NOW()
)
FROM show_deadlock_chains()
LIMIT 1;
RETURN NEW;
END;
$$ LANGUAGE plpgsql;
```
```python
# monitoring/deadlock-detector.py
import psycopg2
import time
import logging
import json
from datetime import datetime, timedelta
from typing import List, Dict, Optional
from dataclasses import dataclass, asdict
from collections import defaultdict
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
@dataclass
class DeadlockEvent:
"""Represents a detected deadlock event."""
detected_at: datetime
blocked_pid: int
blocked_query: str
blocking_pid: int
blocking_query: str
lock_type: str
table_name: Optional[str]
duration_seconds: float
def to_dict(self) -> dict:
return {
**asdict(self),
'detected_at': self.detected_at.isoformat()
}
class PostgreSQLDeadlockDetector:
"""Real-time PostgreSQL deadlock detection and alerting."""
def __init__(
self,
connection_string: str,
check_interval: int = 5,
alert_threshold: int = 3,
alert_webhook: Optional[str] = None
):
self.connection_string = connection_string
self.check_interval = check_interval
self.alert_threshold = alert_threshold
self.alert_webhook = alert_webhook
self.deadlock_count = defaultdict(int)
self.last_alert_time = {}
def connect(self) -> psycopg2.extensions.connection:
"""Establish database connection with monitoring role."""
return psycopg2.connect(self.connection_string)
def detect_deadlocks(self) -> List[DeadlockEvent]:
"""Detect current deadlocks using pg_locks and pg_stat_activity."""
query = """
SELECT
blocked.pid AS blocked_pid,
blocked.query AS blocked_query,
blocking.pid AS blocking_pid,
blocking.query AS blocking_query,
blocked_locks.locktype AS lock_type,
blocked_locks.relation::regclass::text AS table_name,
EXTRACT(EPOCH FROM (NOW() - blocked.query_start)) AS duration_seconds
FROM pg_stat_activity blocked
JOIN pg_locks blocked_locks ON blocked.pid = blocked_locks.pid
JOIN pg_locks blocking_locks ON
blocked_locks.locktype = blocking_locks.locktype
AND blocked_locks.relation IS NOT DISTINCT FROM blocking_locks.relation
AND blocked_locks.page IS NOT DISTINCT FROM blocking_locks.page
AND blocked_locks.tuple IS NOT DISTINCT FROM blocking_locks.tuple
AND blocked_locks.pid != blocking_locks.pid
JOIN pg_stat_activity blocking ON blocking_locks.pid = blocking.pid
WHERE NOT blocked_locks.granted
AND blocking_locks.granted
AND blocked.pid != blocking.pid
AND blocked.state = 'active'
ORDER BY duration_seconds DESC;
"""
conn = self.connect()
try:
with conn.cursor() as cur:
cur.execute(query)
rows = cur.fetchall()
events = []
for row in rows:
event = DeadlockEvent(
detected_at=datetime.now(),
blocked_pid=row[0],
blocked_query=row[1][:500], # Truncate long queries
blocking_pid=row[2],
blocking_query=row[3][:500],
lock_type=row[4],
table_name=row[5],
duration_seconds=float(row[6])
)
events.append(event)
return events
finally:
conn.close()
def analyze_deadlock_pattern(self, events: List[DeadlockEvent]) -> Dict[str, any]:
"""Analyze deadlock patterns to identify root causes."""
if not events:
return {}
# Group by table name
tables = defaultdict(int)
lock_types = defaultdict(int)
query_patterns = defaultdict(int)
for event in events:
if event.table_name:
tables[event.table_name] += 1
lock_types[event.lock_type] += 1
# Extract query type (SELECT, UPDATE, DELETE, INSERT)
query_type = event.blocked_query.strip().split()[0].upper()
query_patterns[query_type] += 1
return {
'total_deadlocks': len(events),
'most_common_table': max(tables.items(), key=lambda x: x[1])[0] if tables else None,
'most_common_lock_type': max(lock_types.items(), key=lambda x: x[1])[0] if lock_types else None,
'query_type_distribution': dict(query_patterns),
'average_duration': sum(e.duration_seconds for e in events) / len(events),
'max_duration': max(e.duration_seconds for e in events)
}
def suggest_prevention_strategy(self, analysis: Dict[str, any]) -> List[str]:
"""Generate prevention recommendations based on analysis."""
suggestions = []
if analysis.get('most_common_table'):
table = analysis['most_common_table']
suggestions.append(
f"Consider reviewing lock acquisition order for table '{table}'. "
f"Ensure all transactions lock this table in consistent order."
)
if analysis.get('query_type_distribution', {}).get('UPDATE', 0) > 0:
suggestions.append(
"UPDATE queries detected in deadlocks. Use SELECT ... FOR UPDATE "
"with consistent ordering to prevent UPDATE deadlocks."
)
if analysis.get('average_duration', 0) > 10:
suggestions.append(
f"Average deadlock duration is {analysis['average_duration']:.2f}s. "
"Consider reducing transaction scope or implementing application-level "
"retry logic with exponential backoff."
)
lock_type = analysis.get('most_common_lock_type')
if lock_type == 'relation':
suggestions.append(
"Table-level locks detected. Consider using row-level locking "
"or implementing optimistic locking patterns."
)
return suggestions
def alert_on_deadlock(self, events: List[DeadlockEvent], analysis: Dict[str, any]):
"""Send alerts when deadlock threshold is exceeded."""
if len(events) >= self.alert_threshold:
logger.warning(
f"DEADLOCK ALERT: {len(events)} deadlocks detected. "
f"Analysis: {json.dumps(analysis, indent=2)}"
)
# Send webhook alert if configured
if self.alert_webhook:
import requests
payload = {
'text': f'🚨 Deadlock Alert: {len(events)} deadlocks detected',
'events': [e.to_dict() for e in events],
'analysis': analysis,
'suggestions': self.suggest_prevention_strategy(analysis)
}
try:
requests.post(self.alert_webhook, json=payload, timeout=5)
except Exception as e:
logger.error(f"Failed to send webhook alert: {e}")
def run_continuous_monitoring(self):
"""Run continuous deadlock monitoring loop."""
logger.info(f"Starting deadlock monitoring (check interval: {self.check_interval}s)")
while True:
try:
events = self.detect_deadlocks()
if events:
logger.info(f"Detected {len(events)} potential deadlocks")
analysis = self.analyze_deadlock_pattern(events)
# Log detailed information
for event in events:
logger.warning(
f"Deadlock: PID {event.blocked_pid} blocked by {event.blocking_pid} "
f"on {event.table_name} ({event.lock_type}) for {event.duration_seconds:.2f}s"
)
# Print suggestions
suggestions = self.suggest_prevention_strategy(analysis)
if suggestions:
logger.info("Prevention strategies:")
for suggestion in suggestions:
logger.info(f" - {suggestion}")
self.alert_on_deadlock(events, analysis)
time.sleep(self.check_interval)
except KeyboardInterrupt:
logger.info("Monitoring stopped by user")
break
except Exception as e:
logger.error(f"Error in monitoring loop: {e}")
time.sleep(self.check_interval)
# Usage example
if __name__ == "__main__":
detector = PostgreSQLDeadlockDetector(
connection_string="postgresql://monitor_user:password@localhost:5432/mydb",
check_interval=5,
alert_threshold=3,
alert_webhook="https://hooks.slack.com/services/YOUR/WEBHOOK/URL"
)
detector.run_continuous_monitoring()
```
### Example 2: MySQL Deadlock Detection and InnoDB Monitoring
```sql
-- Enable InnoDB deadlock logging
-- Add to my.cnf
[mysqld]
innodb_print_all_deadlocks = 1
innodb_deadlock_detect = ON
innodb_lock_wait_timeout = 50
-- Create deadlock monitoring table
CREATE TABLE deadlock_log (
id INT AUTO_INCREMENT PRIMARY KEY,
detected_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
victim_thread_id BIGINT,
victim_query TEXT,
waiting_query TEXT,
lock_mode VARCHAR(50),
table_name VARCHAR(255),
index_name VARCHAR(255),
deadlock_info TEXT,
INDEX idx_detected_at (detected_at)
) ENGINE=InnoDB;
-- View current locks and blocking sessions
SELECT
r.trx_id AS waiting_trx_id,
r.trx_mysql_thread_id AS waiting_thread,
r.trx_query AS waiting_query,
b.trx_id AS blocking_trx_id,
b.trx_mysql_thread_id AS blocking_thread,
b.trx_query AS blocking_query,
l.lock_mode,
l.lock_type,
l.lock_table,
l.lock_index,
TIMESTAMPDIFF(SECOND, r.trx_started, NOW()) AS wait_time_seconds
FROM information_schema.innodb_lock_waits w
JOIN information_schema.innodb_trx r ON w.requesting_trx_id = r.trx_id
JOIN information_schema.innodb_trx b ON w.blocking_trx_id = b.trx_id
JOIN information_schema.innodb_locks l ON w.requesting_lock_id = l.lock_id
ORDER BY wait_time_seconds DESC;
-- Analyze deadlock frequency by table
SELECT
table_name,
COUNT(*) AS deadlock_count,
MAX(detected_at) AS last_deadlock,
AVG(TIMESTAMPDIFF(SECOND, detected_at, NOW())) AS avg_age_seconds
FROM deadlock_log
WHERE detected_at >= DATE_SUB(NOW(), INTERVAL 24 HOUR)
GROUP BY table_name
ORDER BY deadlock_count DESC;
```
```javascript
// monitoring/mysql-deadlock-detector.js
const mysql = require('mysql2/promise');
const fs = require('fs').promises;
class MySQLDeadlockDetector {
constructor(config) {
this.config = config;
this.pool = mysql.createPool({
host: config.host,
user: config.user,
password: config.password,
database: config.database,
waitForConnections: true,
connectionLimit: 10,
queueLimit: 0
});
this.checkInterval = config.checkInterval || 10000;
this.deadlockStats = {
total: 0,
byTable: {},
byHour: {}
};
}
async detectCurrentLockWaits() {
const query = `
SELECT
r.trx_id AS waiting_trx_id,
r.trx_mysql_thread_id AS waiting_thread,
r.trx_query AS waiting_query,
b.trx_id AS blocking_trx_id,
b.trx_mysql_thread_id AS blocking_thread,
b.trx_query AS blocking_query,
l.lock_mode,
l.lock_type,
l.lock_table,
l.lock_index,
TIMESTAMPDIFF(SECOND, r.trx_started, NOW()) AS wait_time_seconds
FROM information_schema.innodb_lock_waits w
JOIN information_schema.innodb_trx r ON w.requesting_trx_id = r.trx_id
JOIN information_schema.innodb_trx b ON w.blocking_trx_id = b.trx_id
JOIN information_schema.innodb_locks l ON w.requesting_lock_id = l.lock_id
WHERE TIMESTAMPDIFF(SECOND, r.trx_started, NOW()) > 5
ORDER BY wait_time_seconds DESC
`;
const [rows] = await this.pool.query(query);
return rows;
}
async parseInnoDBStatus() {
const [rows] = await this.pool.query('SHOW ENGINE INNODB STATUS');
const status = rows[0].Status;
// Extract deadlock information
const deadlockRegex = /LATEST DETECTED DEADLOCK[\s\S]*?(?=TRANSACTIONS|$)/;
const match = status.match(deadlockRegex);
if (match) {
const deadlockInfo = match[0];
const timestamp = new Date();
// Parse transaction details
const transactions = this.extractTransactionDetails(deadlockInfo);
return {
timestamp,
deadlockInfo,
transactions
};
}
return null;
}
extractTransactionDetails(deadlockInfo) {
// Extract table names involved
const tableRegex = /table `([^`]+)`\.`([^`]+)`/g;
const tables = [];
let match;
while ((match = tableRegex.exec(deadlockInfo)) !== null) {
tables.push(`${match[1]}.${match[2]}`);
}
// Extract lock modes
const lockRegex = /lock mode (\w+)/g;
const lockModes = [];
while ((match = lockRegex.exec(deadlockInfo)) !== null) {
lockModes.push(match[1]);
}
return {
tables: [...new Set(tables)],
lockModes: [...new Set(lockModes)]
};
}
async logDeadlock(deadlockEvent) {
const query = `
INSERT INTO deadlock_log (
victim_thread_id,
victim_query,
waiting_query,
lock_mode,
table_name,
deadlock_info
) VALUES (?, ?, ?, ?, ?, ?)
`;
const tables = deadlockEvent.transactions.tables.join(', ');
const lockModes = deadlockEvent.transactions.lockModes.join(', ');
await this.pool.query(query, [
null,
'extracted_from_innodb_status',
'extracted_from_innodb_status',
lockModes,
tables,
deadlockEvent.deadlockInfo
]);
this.deadlockStats.total++;
// Update per-table stats
deadlockEvent.transactions.tables.forEach(table => {
this.deadlockStats.byTable[table] =
(this.deadlockStats.byTable[table] || 0) + 1;
});
}
generatePreventionAdvice(lockWaits) {
const advice = [];
// Analyze lock wait patterns
const tableFrequency = {};
lockWaits.forEach(wait => {
const table = wait.lock_table;
tableFrequency[table] = (tableFrequency[table] || 0) + 1;
});
// Find most problematic table
const sortedTables = Object.entries(tableFrequency)
.sort((a, b) => b[1] - a[1]);
if (sortedTables.length > 0) {
const [mostProblematicTable, count] = sortedTables[0];
advice.push({
severity: 'high',
table: mostProblematicTable,
suggestion: `Table ${mostProblematicTable} has ${count} lock waits. ` +
`Consider: 1) Reducing transaction scope, 2) Adding appropriate indexes, ` +
`3) Implementing consistent lock ordering.`
});
}
// Check for long-running transactions
const longRunning = lockWaits.filter(w => w.wait_time_seconds > 30);
if (longRunning.length > 0) {
advice.push({
severity: 'medium',
suggestion: `${longRunning.length} transactions waiting > 30s. ` +
`Review transaction isolation levels and consider READ COMMITTED ` +
`instead of REPEATABLE READ for reduced lock contention.`
});
}
return advice;
}
async startMonitoring() {
console.log('Starting MySQL deadlock monitoring...');
setInterval(async () => {
try {
// Check for current lock waits
const lockWaits = await this.detectCurrentLockWaits();
if (lockWaits.length > 0) {
console.warn(`⚠️ ${lockWaits.length} lock waits detected:`);
lockWaits.forEach(wait => {
console.warn(
` Thread ${wait.waiting_thread} waiting on thread ${wait.blocking_thread} ` +
`for ${wait.wait_time_seconds}s on ${wait.lock_table}`
);
});
const advice = this.generatePreventionAdvice(lockWaits);
if (advice.length > 0) {
console.log('\n💡 Prevention advice:');
advice.forEach(item => {
console.log(` [${item.severity}] ${item.suggestion}`);
});
}
}
// Check InnoDB status for recent deadlocks
const deadlock = await this.parseInnoDBStatus();
if (deadlock) {
console.error('🚨 DEADLOCK DETECTED:');
console.error(` Tables: ${deadlock.transactions.tables.join(', ')}`);
console.error(` Lock modes: ${deadlock.transactions.lockModes.join(', ')}`);
await this.logDeadlock(deadlock);
}
} catch (error) {
console.error('Monitoring error:', error);
}
}, this.checkInterval);
}
async getStatistics() {
const query = `
SELECT
DATE(detected_at) AS date,
COUNT(*) AS deadlock_count,
table_name,
lock_mode
FROM deadlock_log
WHERE detected_at >= DATE_SUB(NOW(), INTERVAL 7 DAY)
GROUP BY DATE(detected_at), table_name, lock_mode
ORDER BY date DESC, deadlock_count DESC
`;
const [rows] = await this.pool.query(query);
return {
historical: rows,
current: this.deadlockStats
};
}
}
// Usage
const detector = new MySQLDeadlockDetector({
host: 'localhost',
user: 'monitor_user',
password: 'password',
database: 'mydb',
checkInterval: 10000
});
detector.startMonitoring();
// Export statistics every hour
setInterval(async () => {
const stats = await detector.getStatistics();
await fs.writeFile(
'deadlock-stats.json',
JSON.stringify(stats, null, 2)
);
}, 3600000);
```
## Error Handling
| Error | Cause | Solution |
|-------|-------|----------|
| "Permission denied" | Insufficient database privileges | Grant `pg_monitor` role (PostgreSQL) or `PROCESS` privilege (MySQL) |
| "Connection timeout" | Network or authentication issues | Verify connection string and firewall rules |
| "No deadlocks detected" | Deadlocks resolved before detection | Reduce `deadlock_timeout` to 500ms for faster detection |
| "Table not found" | Missing monitoring tables | Run setup scripts to create required tables |
| "Log file not accessible" | Filesystem permissions | Ensure logging user has write access to log directory |
## Configuration Options
**Deadlock Detection**
- `deadlock_timeout`: Time to wait before logging lock waits (PostgreSQL: 1s default)
- `innodb_deadlock_detect`: Enable/disable InnoDB deadlock detection (MySQL)
- `innodb_print_all_deadlocks`: Log all deadlocks to error log (MySQL)
- `log_lock_waits`: Log queries waiting for locks (PostgreSQL)
**Monitoring Parameters**
- `check_interval`: Frequency of deadlock checks (5-10 seconds recommended)
- `alert_threshold`: Number of deadlocks before alerting (3-5 recommended)
- `retention_period`: How long to keep deadlock history (7-30 days)
## Best Practices
DO:
- Always acquire locks in consistent order across transactions
- Keep transactions as short as possible
- Use row-level locking instead of table-level when possible
- Implement retry logic with exponential backoff
- Monitor deadlock trends over time
- Set appropriate lock timeouts (`innodb_lock_wait_timeout` = 50s)
DON'T:
- Hold locks during expensive operations (network calls, file I/O)
- Mix DDL and DML in the same transaction
- Use SELECT ... FOR UPDATE without ORDER BY
- Ignore deadlock patterns (they indicate design issues)
- Set deadlock_timeout too high (delays detection)
## Performance Considerations
- Monitoring queries add minimal overhead (<0.1% CPU typically)
- Use connection pooling to reduce monitoring overhead
- Index `deadlock_history` table on `detected_at` for fast queries
- Archive old deadlock logs to separate table monthly
- Consider read replicas for monitoring queries in high-traffic systems
## Related Commands
- `/sql-query-optimizer` - Optimize queries to reduce lock duration
- `/database-index-advisor` - Add indexes to minimize table scans
- `/database-transaction-monitor` - Monitor transaction patterns
- `/database-connection-pooler` - Optimize connection management
- `/database-health-monitor` - Overall database health monitoring
## Version History
- v1.0.0 (2024-10): Initial implementation with PostgreSQL and MySQL support
- Planned v1.1.0: Add Microsoft SQL Server and Oracle deadlock detection