Initial commit

2025-11-30 08:18:40 +08:00
commit d6245b656d
8 changed files with 723 additions and 0 deletions
--- a/commands/transactions.md
+++ b/commands/transactions.md
@@ -0,0 +1,570 @@
+---
+description: Monitor database transactions with real-time alerting for performance and lock issues
+shortcut: txn-monitor
+---
+
+# Database Transaction Monitor
+
+Monitor database transaction performance, detect long-running transactions, identify lock contention, track rollback rates, and automatically alert on transaction anomalies for production database health.
+
+## When to Use This Command
+
+Use `/txn-monitor` when you need to:
+- Detect and kill long-running transactions blocking other queries
+- Monitor lock wait times and identify deadlock patterns
+- Track transaction rollback rates for error analysis
+- Alert on isolation level anomalies (phantom reads, dirty reads)
+- Analyze transaction throughput and latency trends
+- Investigate application connection leak issues
+
+DON'T use this when:
+- Database has minimal transaction load (<100 TPS)
+- All transactions complete within milliseconds
+- Looking for query optimization (use query optimizer instead)
+- Investigating data corruption (use audit logger instead)
+
+## Design Decisions
+
+This command implements **real-time transaction monitoring with automated alerting** because:
+- Long-running transactions (>30s) block other queries and cause performance degradation
+- Lock contention detection prevents cascade failures
+- Rollback rate monitoring identifies application bugs early
+- Automatic alerts reduce MTTR (Mean Time To Resolution)
+- Historical trend analysis enables capacity planning
+
+**Alternative considered: Periodic manual checks**
+- No automated alerting on issues
+- Relies on humans checking dashboards
+- Slower incident response
+- Recommended only for development environments
+
+**Alternative considered: Database log parsing**
+- Post-mortem analysis only
+- No real-time alerts
+- Requires custom log parsing logic
+- Recommended for compliance/audit purposes
+
+## Prerequisites
+
+Before running this command:
+1. Database monitoring permissions (pg_monitor role or PROCESS privilege)
+2. Access to pg_stat_activity (PostgreSQL) or performance_schema (MySQL)
+3. Alerting infrastructure (Slack, PagerDuty, email)
+4. Monitoring data retention strategy (metrics database or time-series DB)
+5. Runbook for common transaction issues
+
+## Implementation Process
+
+### Step 1: Enable Transaction Monitoring
+Configure database to track transaction statistics.
+
+### Step 2: Build Real-Time Monitor
+Create monitoring script that polls transaction statistics every 5-10 seconds.
+
+### Step 3: Define Alert Thresholds
+Set thresholds for long-running transactions, lock waits, and rollback rates.
+
+### Step 4: Implement Automated Actions
+Auto-kill transactions exceeding thresholds or alert operators.
+
+### Step 5: Create Dashboards
+Build Grafana dashboards for transaction metrics visualization.
+
+## Output Format
+
+The command generates:
+- `monitoring/transaction_monitor.py` - Real-time transaction monitoring daemon
+- `queries/transaction_analysis.sql` - Transaction health diagnostic queries
+- `alerts/transaction_alerts.yml` - Prometheus alerting rules
+- `dashboards/transaction_dashboard.json` - Grafana dashboard configuration
+- `docs/transaction_runbook.md` - Incident response procedures
+
+## Code Examples
+
+### Example 1: PostgreSQL Real-Time Transaction Monitor
+
+```python
+# monitoring/postgres_transaction_monitor.py
+import psycopg2
+from psycopg2.extras import Dict Cursor
+import time
+import logging
+from typing import List, Dict, Optional
+from dataclasses import dataclass, asdict
+from datetime import datetime, timedelta
+
+logging.basicConfig(level=logging.INFO)
+logger = logging.getLogger(__name__)
+
+@dataclass
+class TransactionInfo:
+    """Represents an active transaction."""
+    pid: int
+    username: str
+    database: str
+    application_name: str
+    client_addr: str
+    state: str
+    query: str
+    transaction_start: datetime
+    query_start: datetime
+    wait_event: Optional[str]
+    blocking_pids: List[int]
+
+    def duration_seconds(self) -> float:
+        return (datetime.now() - self.transaction_start).total_seconds()
+
+    def to_dict(self) -> dict:
+        result = asdict(self)
+        result['transaction_start'] = self.transaction_start.isoformat()
+        result['query_start'] = self.query_start.isoformat()
+        result['duration_seconds'] = self.duration_seconds()
+        return result
+
+class PostgreSQLTransactionMonitor:
+    """Monitor PostgreSQL transactions in real-time."""
+
+    def __init__(
+        self,
+        connection_string: str,
+        long_transaction_threshold: int = 30,
+        check_interval: int = 10
+    ):
+        self.conn_string = connection_string
+        self.long_transaction_threshold = long_transaction_threshold
+        self.check_interval = check_interval
+        self.stats = {
+            'total_transactions': 0,
+            'long_running_count': 0,
+            'blocked_count': 0,
+            'idle_in_transaction_count': 0
+        }
+
+    def connect(self):
+        return psycopg2.connect(self.conn_string, cursor_factory=DictCursor)
+
+    def get_active_transactions(self) -> List[TransactionInfo]:
+        """Fetch all active transactions with blocking information."""
+        query = """
+        SELECT
+            a.pid,
+            a.usename,
+            a.datname AS database,
+            a.application_name,
+            a.client_addr::text,
+            a.state,
+            a.query,
+            a.xact_start AS transaction_start,
+            a.query_start,
+            a.wait_event,
+            array_agg(b.pid) FILTER (WHERE b.pid IS NOT NULL) AS blocking_pids
+        FROM pg_stat_activity a
+        LEFT JOIN pg_stat_activity b ON b.pid = ANY(pg_blocking_pids(a.pid))
+        WHERE a.pid != pg_backend_pid()
+          AND a.state != 'idle'
+          AND a.xact_start IS NOT NULL
+        GROUP BY a.pid, a.usename, a.datname, a.application_name,
+                 a.client_addr, a.state, a.query, a.xact_start,
+                 a.query_start, a.wait_event
+        ORDER BY a.xact_start;
+        """
+
+        conn = self.connect()
+        try:
+            with conn.cursor() as cur:
+                cur.execute(query)
+                rows = cur.fetchall()
+
+                transactions = []
+                for row in rows:
+                    txn = TransactionInfo(
+                        pid=row['pid'],
+                        username=row['usename'],
+                        database=row['database'],
+                        application_name=row['application_name'] or 'unknown',
+                        client_addr=row['client_addr'] or 'local',
+                        state=row['state'],
+                        query=row['query'][:200],  # Truncate long queries
+                        transaction_start=row['transaction_start'],
+                        query_start=row['query_start'],
+                        wait_event=row['wait_event'],
+                        blocking_pids=row['blocking_pids'] or []
+                    )
+                    transactions.append(txn)
+
+                return transactions
+
+        finally:
+            conn.close()
+
+    def find_long_running_transactions(
+        self,
+        transactions: List[TransactionInfo]
+    ) -> List[TransactionInfo]:
+        """Identify transactions exceeding threshold."""
+        return [
+            txn for txn in transactions
+            if txn.duration_seconds() > self.long_transaction_threshold
+        ]
+
+    def find_blocked_transactions(
+        self,
+        transactions: List[TransactionInfo]
+    ) -> List[TransactionInfo]:
+        """Identify transactions waiting on locks."""
+        return [
+            txn for txn in transactions
+            if txn.blocking_pids and len(txn.blocking_pids) > 0
+        ]
+
+    def find_idle_in_transaction(
+        self,
+        transactions: List[TransactionInfo]
+    ) -> List[TransactionInfo]:
+        """Find idle transactions holding locks."""
+        return [
+            txn for txn in transactions
+            if txn.state == 'idle in transaction'
+            and txn.duration_seconds() > 60
+        ]
+
+    def kill_transaction(self, pid: int, reason: str) -> bool:
+        """Terminate a transaction by PID."""
+        conn = self.connect()
+        try:
+            with conn.cursor() as cur:
+                cur.execute("SELECT pg_terminate_backend(%s)", (pid,))
+                success = cur.fetchone()[0]
+
+                if success:
+                    logger.warning(f"Killed transaction PID {pid}: {reason}")
+                else:
+                    logger.error(f"Failed to kill transaction PID {pid}")
+
+                return success
+
+        finally:
+            conn.close()
+
+    def get_transaction_stats(self) -> Dict[str, any]:
+        """Get overall transaction statistics."""
+        conn = self.connect()
+        try:
+            with conn.cursor() as cur:
+                cur.execute("""
+                    SELECT
+                        (SELECT count(*) FROM pg_stat_activity WHERE state != 'idle') AS active_connections,
+                        (SELECT count(*) FROM pg_stat_activity WHERE state = 'idle in transaction') AS idle_in_txn,
+                        (SELECT sum(xact_commit) FROM pg_stat_database) AS total_commits,
+                        (SELECT sum(xact_rollback) FROM pg_stat_database) AS total_rollbacks,
+                        (SELECT sum(conflicts) FROM pg_stat_database_conflicts) AS conflicts
+                """)
+
+                row = cur.fetchone()
+
+                total_txns = row['total_commits'] + row['total_rollbacks']
+                rollback_rate = (row['total_rollbacks'] / total_txns * 100) if total_txns > 0 else 0
+
+                return {
+                    'active_connections': row['active_connections'],
+                    'idle_in_transaction': row['idle_in_txn'],
+                    'total_commits': row['total_commits'],
+                    'total_rollbacks': row['total_rollbacks'],
+                    'rollback_rate_percent': round(rollback_rate, 2),
+                    'conflicts': row['conflicts']
+                }
+
+        finally:
+            conn.close()
+
+    def alert(self, severity: str, message: str, details: dict = None):
+        """Send alert to monitoring system."""
+        log_func = {
+            'critical': logger.critical,
+            'warning': logger.warning,
+            'info': logger.info
+        }.get(severity, logger.info)
+
+        log_func(f"[{severity.upper()}] {message}")
+
+        if details:
+            logger.info(f"Details: {details}")
+
+        # Implement webhook/email/PagerDuty integration here
+        # Example: requests.post(webhook_url, json={'message': message, 'details': details})
+
+    def run_monitoring_loop(self):
+        """Main monitoring loop."""
+        logger.info(f"Starting transaction monitoring (interval: {self.check_interval}s)")
+
+        while True:
+            try:
+                # Fetch active transactions
+                transactions = self.get_active_transactions()
+                self.stats['total_transactions'] = len(transactions)
+
+                # Check for long-running transactions
+                long_running = self.find_long_running_transactions(transactions)
+                if long_running:
+                    self.stats['long_running_count'] = len(long_running)
+
+                    for txn in long_running:
+                        self.alert(
+                            'warning',
+                            f"Long-running transaction detected: PID {txn.pid}",
+                            {
+                                'duration': txn.duration_seconds(),
+                                'database': txn.database,
+                                'username': txn.username,
+                                'query': txn.query
+                            }
+                        )
+
+                        # Auto-kill if exceeds 5 minutes
+                        if txn.duration_seconds() > 300:
+                            self.kill_transaction(
+                                txn.pid,
+                                f"Exceeded 5 minute threshold ({txn.duration_seconds():.0f}s)"
+                            )
+
+                # Check for blocked transactions
+                blocked = self.find_blocked_transactions(transactions)
+                if blocked:
+                    self.stats['blocked_count'] = len(blocked)
+
+                    for txn in blocked:
+                        self.alert(
+                            'warning',
+                            f"Blocked transaction: PID {txn.pid}",
+                            {
+                                'blocking_pids': txn.blocking_pids,
+                                'wait_event': txn.wait_event,
+                                'duration': txn.duration_seconds()
+                            }
+                        )
+
+                # Check for idle in transaction
+                idle_txns = self.find_idle_in_transaction(transactions)
+                if idle_txns:
+                    self.stats['idle_in_transaction_count'] = len(idle_txns)
+
+                    for txn in idle_txns:
+                        self.alert(
+                            'warning',
+                            f"Idle in transaction: PID {txn.pid}",
+                            {
+                                'duration': txn.duration_seconds(),
+                                'application': txn.application_name
+                            }
+                        )
+
+                        # Kill idle transactions after 10 minutes
+                        if txn.duration_seconds() > 600:
+                            self.kill_transaction(txn.pid, "Idle in transaction >10 minutes")
+
+                # Get overall stats
+                stats = self.get_transaction_stats()
+
+                # Alert on high rollback rate
+                if stats['rollback_rate_percent'] > 10:
+                    self.alert(
+                        'warning',
+                        f"High transaction rollback rate: {stats['rollback_rate_percent']}%",
+                        stats
+                    )
+
+                # Log periodic status
+                logger.info(
+                    f"Monitoring: {stats['active_connections']} active, "
+                    f"{len(long_running)} long-running, "
+                    f"{len(blocked)} blocked, "
+                    f"{stats['rollback_rate_percent']}% rollback rate"
+                )
+
+                time.sleep(self.check_interval)
+
+            except KeyboardInterrupt:
+                logger.info("Monitoring stopped by user")
+                break
+            except Exception as e:
+                logger.error(f"Monitoring error: {e}")
+                time.sleep(self.check_interval)
+
+# Usage
+if __name__ == "__main__":
+    monitor = PostgreSQLTransactionMonitor(
+        connection_string="postgresql://monitor_user:password@localhost:5432/mydb",
+        long_transaction_threshold=30,  # 30 seconds
+        check_interval=10  # Check every 10 seconds
+    )
+
+    monitor.run_monitoring_loop()
+```
+
+### Example 2: Transaction Analysis Queries
+
+```sql
+-- PostgreSQL transaction health diagnostic queries
+
+-- 1. Long-running transactions
+SELECT
+    pid,
+    usename,
+    application_name,
+    client_addr,
+    NOW() - xact_start AS transaction_duration,
+    NOW() - query_start AS query_duration,
+    state,
+    LEFT(query, 100) AS query_snippet
+FROM pg_stat_activity
+WHERE xact_start IS NOT NULL
+  AND state != 'idle'
+  AND pid != pg_backend_pid()
+ORDER BY xact_start;
+
+-- 2. Blocking tree (which transactions are blocking others)
+WITH RECURSIVE blocking_tree AS (
+    SELECT
+        a.pid,
+        a.usename,
+        a.query AS blocked_query,
+        NULL::integer AS blocking_pid,
+        NULL::text AS blocking_query,
+        1 AS level
+    FROM pg_stat_activity a
+    WHERE NOT EXISTS (
+        SELECT 1 FROM pg_stat_activity b
+        WHERE b.pid = ANY(pg_blocking_pids(a.pid))
+    )
+    AND a.pid IN (
+        SELECT unnest(pg_blocking_pids(c.pid))
+        FROM pg_stat_activity c
+    )
+
+    UNION ALL
+
+    SELECT
+        a.pid,
+        a.usename,
+        a.query,
+        b.pid,
+        b.query,
+        bt.level + 1
+    FROM blocking_tree bt
+    JOIN pg_stat_activity a ON a.pid = ANY(
+        SELECT unnest(pg_blocking_pids(x.pid))
+        FROM pg_stat_activity x
+        WHERE x.pid = bt.pid
+    )
+    JOIN pg_stat_activity b ON b.pid = ANY(pg_blocking_pids(a.pid))
+)
+SELECT
+    level,
+    pid,
+    usename,
+    blocking_pid,
+    LEFT(blocked_query, 50) AS blocked_query,
+    LEFT(blocking_query, 50) AS blocking_query
+FROM blocking_tree
+ORDER BY level, pid;
+
+-- 3. Transaction rollback rate by database
+SELECT
+    datname,
+    xact_commit AS commits,
+    xact_rollback AS rollbacks,
+    ROUND(100.0 * xact_rollback / NULLIF(xact_commit + xact_rollback, 0), 2) AS rollback_rate_percent
+FROM pg_stat_database
+WHERE datname NOT IN ('template0', 'template1', 'postgres')
+ORDER BY rollback_rate_percent DESC;
+
+-- 4. Idle in transaction connections
+SELECT
+    pid,
+    usename,
+    application_name,
+    client_addr,
+    NOW() - state_change AS idle_duration,
+    state,
+    query
+FROM pg_stat_activity
+WHERE state = 'idle in transaction'
+  AND pid != pg_backend_pid()
+ORDER BY state_change;
+
+-- 5. Lock wait time by query
+SELECT
+    wait_event_type,
+    wait_event,
+    COUNT(*) AS waiting_count,
+    array_agg(DISTINCT pid) AS waiting_pids
+FROM pg_stat_activity
+WHERE wait_event IS NOT NULL
+  AND state = 'active'
+GROUP BY wait_event_type, wait_event
+ORDER BY waiting_count DESC;
+```
+
+## Error Handling
+
+| Error | Cause | Solution |
+|-------|-------|----------|
+| "Permission denied for pg_stat_activity" | Insufficient monitoring privileges | Grant pg_monitor role or SELECT on pg_stat_activity |
+| "Cannot terminate backend" | Trying to kill superuser connection | Use pg_cancel_backend or kill from OS level |
+| "Connection pool exhausted" | Too many idle connections | Kill idle in transaction connections, increase pool size |
+| "High rollback rate" | Application errors or constraint violations | Review application logs and fix bugs |
+| "Lock wait timeout exceeded" | Deadlock or very long lock hold | Analyze blocking queries, implement timeouts |
+
+## Configuration Options
+
+**Monitoring Intervals**
+- `check_interval`: 5-10 seconds for real-time alerting
+- `long_transaction_threshold`: 30-60 seconds (production), 300s (analytics)
+- `idle_in_transaction_timeout`: 600 seconds (10 minutes)
+
+**Auto-Kill Thresholds**
+- Long-running OLTP: 60-300 seconds
+- Long-running analytics: 3600 seconds (1 hour)
+- Idle in transaction: 600 seconds (10 minutes)
+
+**Alert Thresholds**
+- Rollback rate: >5% warning, >10% critical
+- Blocked transactions: >10 warning, >50 critical
+- Active connections: >80% of max_connections
+
+## Best Practices
+
+DO:
+- Set statement_timeout in application connection strings
+- Use connection pooling to limit total connections
+- Implement transaction timeout in application code
+- Monitor transaction throughput trends over time
+- Kill idle in transaction connections automatically
+- Track rollback reasons in application logs
+
+DON'T:
+- Leave transactions open while waiting for user input
+- Hold locks during expensive operations (file I/O, network calls)
+- Use long-running transactions in OLTP workloads
+- Ignore idle in transaction connections (they hold locks)
+- Set transaction timeouts too low (causes false positives)
+
+## Performance Considerations
+
+- Monitoring adds <0.1% CPU overhead with 10-second intervals
+- pg_stat_activity queries are lightweight (<1ms)
+- Auto-killing transactions requires careful threshold tuning
+- Historical metrics retention: 30 days (aggregated), 7 days (detailed)
+- Consider read replicas for monitoring queries in high-load systems
+
+## Related Commands
+
+- `/database-deadlock-detector` - Detailed deadlock analysis
+- `/database-health-monitor` - Overall database health metrics
+- `/sql-query-optimizer` - Optimize slow queries causing lock contention
+- `/database-connection-pooler` - Manage connection pool sizing
+
+## Version History
+
+- v1.0.0 (2024-10): Initial implementation with PostgreSQL real-time monitoring
+- Planned v1.1.0: Add MySQL transaction monitoring and distributed transaction support