---
description: Monitor database transactions with real-time alerting for performance and lock issues
shortcut: txn-monitor
---

# Database Transaction Monitor

Monitor database transaction performance, detect long-running transactions, identify lock contention, track rollback rates, and automatically alert on transaction anomalies for production database health.

## When to Use This Command

Use `/txn-monitor` when you need to:
- Detect and kill long-running transactions blocking other queries
- Monitor lock wait times and identify deadlock patterns
- Track transaction rollback rates for error analysis
- Alert on isolation level anomalies (phantom reads, dirty reads)
- Analyze transaction throughput and latency trends
- Investigate application connection leak issues

DON'T use this when:
- Database has minimal transaction load (<100 TPS)
- All transactions complete within milliseconds
- Looking for query optimization (use query optimizer instead)
- Investigating data corruption (use audit logger instead)

## Design Decisions

This command implements **real-time transaction monitoring with automated alerting** because:
- Long-running transactions (>30s) block other queries and cause performance degradation
- Lock contention detection prevents cascade failures
- Rollback rate monitoring identifies application bugs early
- Automatic alerts reduce MTTR (Mean Time To Resolution)
- Historical trend analysis enables capacity planning

**Alternative considered: Periodic manual checks**
- No automated alerting on issues
- Relies on humans checking dashboards
- Slower incident response
- Recommended only for development environments

**Alternative considered: Database log parsing**
- Post-mortem analysis only
- No real-time alerts
- Requires custom log parsing logic
- Recommended for compliance/audit purposes

## Prerequisites

Before running this command:
1. Database monitoring permissions (pg_monitor role or PROCESS privilege)
2. Access to pg_stat_activity (PostgreSQL) or performance_schema (MySQL)
3. Alerting infrastructure (Slack, PagerDuty, email)
4. Monitoring data retention strategy (metrics database or time-series DB)
5. Runbook for common transaction issues

## Implementation Process

### Step 1: Enable Transaction Monitoring
Configure database to track transaction statistics.

### Step 2: Build Real-Time Monitor
Create monitoring script that polls transaction statistics every 5-10 seconds.

### Step 3: Define Alert Thresholds
Set thresholds for long-running transactions, lock waits, and rollback rates.

### Step 4: Implement Automated Actions
Auto-kill transactions exceeding thresholds or alert operators.

### Step 5: Create Dashboards
Build Grafana dashboards for transaction metrics visualization.

## Output Format

The command generates:
- `monitoring/transaction_monitor.py` - Real-time transaction monitoring daemon
- `queries/transaction_analysis.sql` - Transaction health diagnostic queries
- `alerts/transaction_alerts.yml` - Prometheus alerting rules
- `dashboards/transaction_dashboard.json` - Grafana dashboard configuration
- `docs/transaction_runbook.md` - Incident response procedures

## Code Examples

### Example 1: PostgreSQL Real-Time Transaction Monitor

```python
# monitoring/postgres_transaction_monitor.py
import psycopg2
from psycopg2.extras import Dict Cursor
import time
import logging
from typing import List, Dict, Optional
from dataclasses import dataclass, asdict
from datetime import datetime, timedelta

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

@dataclass
class TransactionInfo:
    """Represents an active transaction."""
    pid: int
    username: str
    database: str
    application_name: str
    client_addr: str
    state: str
    query: str
    transaction_start: datetime
    query_start: datetime
    wait_event: Optional[str]
    blocking_pids: List[int]

    def duration_seconds(self) -> float:
        return (datetime.now() - self.transaction_start).total_seconds()

    def to_dict(self) -> dict:
        result = asdict(self)
        result['transaction_start'] = self.transaction_start.isoformat()
        result['query_start'] = self.query_start.isoformat()
        result['duration_seconds'] = self.duration_seconds()
        return result

class PostgreSQLTransactionMonitor:
    """Monitor PostgreSQL transactions in real-time."""

    def __init__(
        self,
        connection_string: str,
        long_transaction_threshold: int = 30,
        check_interval: int = 10
    ):
        self.conn_string = connection_string
        self.long_transaction_threshold = long_transaction_threshold
        self.check_interval = check_interval
        self.stats = {
            'total_transactions': 0,
            'long_running_count': 0,
            'blocked_count': 0,
            'idle_in_transaction_count': 0
        }

    def connect(self):
        return psycopg2.connect(self.conn_string, cursor_factory=DictCursor)

    def get_active_transactions(self) -> List[TransactionInfo]:
        """Fetch all active transactions with blocking information."""
        query = """
        SELECT
            a.pid,
            a.usename,
            a.datname AS database,
            a.application_name,
            a.client_addr::text,
            a.state,
            a.query,
            a.xact_start AS transaction_start,
            a.query_start,
            a.wait_event,
            array_agg(b.pid) FILTER (WHERE b.pid IS NOT NULL) AS blocking_pids
        FROM pg_stat_activity a
        LEFT JOIN pg_stat_activity b ON b.pid = ANY(pg_blocking_pids(a.pid))
        WHERE a.pid != pg_backend_pid()
          AND a.state != 'idle'
          AND a.xact_start IS NOT NULL
        GROUP BY a.pid, a.usename, a.datname, a.application_name,
                 a.client_addr, a.state, a.query, a.xact_start,
                 a.query_start, a.wait_event
        ORDER BY a.xact_start;
        """

        conn = self.connect()
        try:
            with conn.cursor() as cur:
                cur.execute(query)
                rows = cur.fetchall()

                transactions = []
                for row in rows:
                    txn = TransactionInfo(
                        pid=row['pid'],
                        username=row['usename'],
                        database=row['database'],
                        application_name=row['application_name'] or 'unknown',
                        client_addr=row['client_addr'] or 'local',
                        state=row['state'],
                        query=row['query'][:200],  # Truncate long queries
                        transaction_start=row['transaction_start'],
                        query_start=row['query_start'],
                        wait_event=row['wait_event'],
                        blocking_pids=row['blocking_pids'] or []
                    )
                    transactions.append(txn)

                return transactions

        finally:
            conn.close()

    def find_long_running_transactions(
        self,
        transactions: List[TransactionInfo]
    ) -> List[TransactionInfo]:
        """Identify transactions exceeding threshold."""
        return [
            txn for txn in transactions
            if txn.duration_seconds() > self.long_transaction_threshold
        ]

    def find_blocked_transactions(
        self,
        transactions: List[TransactionInfo]
    ) -> List[TransactionInfo]:
        """Identify transactions waiting on locks."""
        return [
            txn for txn in transactions
            if txn.blocking_pids and len(txn.blocking_pids) > 0
        ]

    def find_idle_in_transaction(
        self,
        transactions: List[TransactionInfo]
    ) -> List[TransactionInfo]:
        """Find idle transactions holding locks."""
        return [
            txn for txn in transactions
            if txn.state == 'idle in transaction'
            and txn.duration_seconds() > 60
        ]

    def kill_transaction(self, pid: int, reason: str) -> bool:
        """Terminate a transaction by PID."""
        conn = self.connect()
        try:
            with conn.cursor() as cur:
                cur.execute("SELECT pg_terminate_backend(%s)", (pid,))
                success = cur.fetchone()[0]

                if success:
                    logger.warning(f"Killed transaction PID {pid}: {reason}")
                else:
                    logger.error(f"Failed to kill transaction PID {pid}")

                return success

        finally:
            conn.close()

    def get_transaction_stats(self) -> Dict[str, any]:
        """Get overall transaction statistics."""
        conn = self.connect()
        try:
            with conn.cursor() as cur:
                cur.execute("""
                    SELECT
                        (SELECT count(*) FROM pg_stat_activity WHERE state != 'idle') AS active_connections,
                        (SELECT count(*) FROM pg_stat_activity WHERE state = 'idle in transaction') AS idle_in_txn,
                        (SELECT sum(xact_commit) FROM pg_stat_database) AS total_commits,
                        (SELECT sum(xact_rollback) FROM pg_stat_database) AS total_rollbacks,
                        (SELECT sum(conflicts) FROM pg_stat_database_conflicts) AS conflicts
                """)

                row = cur.fetchone()

                total_txns = row['total_commits'] + row['total_rollbacks']
                rollback_rate = (row['total_rollbacks'] / total_txns * 100) if total_txns > 0 else 0

                return {
                    'active_connections': row['active_connections'],
                    'idle_in_transaction': row['idle_in_txn'],
                    'total_commits': row['total_commits'],
                    'total_rollbacks': row['total_rollbacks'],
                    'rollback_rate_percent': round(rollback_rate, 2),
                    'conflicts': row['conflicts']
                }

        finally:
            conn.close()

    def alert(self, severity: str, message: str, details: dict = None):
        """Send alert to monitoring system."""
        log_func = {
            'critical': logger.critical,
            'warning': logger.warning,
            'info': logger.info
        }.get(severity, logger.info)

        log_func(f"[{severity.upper()}] {message}")

        if details:
            logger.info(f"Details: {details}")

        # Implement webhook/email/PagerDuty integration here
        # Example: requests.post(webhook_url, json={'message': message, 'details': details})

    def run_monitoring_loop(self):
        """Main monitoring loop."""
        logger.info(f"Starting transaction monitoring (interval: {self.check_interval}s)")

        while True:
            try:
                # Fetch active transactions
                transactions = self.get_active_transactions()
                self.stats['total_transactions'] = len(transactions)

                # Check for long-running transactions
                long_running = self.find_long_running_transactions(transactions)
                if long_running:
                    self.stats['long_running_count'] = len(long_running)

                    for txn in long_running:
                        self.alert(
                            'warning',
                            f"Long-running transaction detected: PID {txn.pid}",
                            {
                                'duration': txn.duration_seconds(),
                                'database': txn.database,
                                'username': txn.username,
                                'query': txn.query
                            }
                        )

                        # Auto-kill if exceeds 5 minutes
                        if txn.duration_seconds() > 300:
                            self.kill_transaction(
                                txn.pid,
                                f"Exceeded 5 minute threshold ({txn.duration_seconds():.0f}s)"
                            )

                # Check for blocked transactions
                blocked = self.find_blocked_transactions(transactions)
                if blocked:
                    self.stats['blocked_count'] = len(blocked)

                    for txn in blocked:
                        self.alert(
                            'warning',
                            f"Blocked transaction: PID {txn.pid}",
                            {
                                'blocking_pids': txn.blocking_pids,
                                'wait_event': txn.wait_event,
                                'duration': txn.duration_seconds()
                            }
                        )

                # Check for idle in transaction
                idle_txns = self.find_idle_in_transaction(transactions)
                if idle_txns:
                    self.stats['idle_in_transaction_count'] = len(idle_txns)

                    for txn in idle_txns:
                        self.alert(
                            'warning',
                            f"Idle in transaction: PID {txn.pid}",
                            {
                                'duration': txn.duration_seconds(),
                                'application': txn.application_name
                            }
                        )

                        # Kill idle transactions after 10 minutes
                        if txn.duration_seconds() > 600:
                            self.kill_transaction(txn.pid, "Idle in transaction >10 minutes")

                # Get overall stats
                stats = self.get_transaction_stats()

                # Alert on high rollback rate
                if stats['rollback_rate_percent'] > 10:
                    self.alert(
                        'warning',
                        f"High transaction rollback rate: {stats['rollback_rate_percent']}%",
                        stats
                    )

                # Log periodic status
                logger.info(
                    f"Monitoring: {stats['active_connections']} active, "
                    f"{len(long_running)} long-running, "
                    f"{len(blocked)} blocked, "
                    f"{stats['rollback_rate_percent']}% rollback rate"
                )

                time.sleep(self.check_interval)

            except KeyboardInterrupt:
                logger.info("Monitoring stopped by user")
                break
            except Exception as e:
                logger.error(f"Monitoring error: {e}")
                time.sleep(self.check_interval)

# Usage
if __name__ == "__main__":
    monitor = PostgreSQLTransactionMonitor(
        connection_string="postgresql://monitor_user:password@localhost:5432/mydb",
        long_transaction_threshold=30,  # 30 seconds
        check_interval=10  # Check every 10 seconds
    )

    monitor.run_monitoring_loop()
```

### Example 2: Transaction Analysis Queries

```sql
-- PostgreSQL transaction health diagnostic queries

-- 1. Long-running transactions
SELECT
    pid,
    usename,
    application_name,
    client_addr,
    NOW() - xact_start AS transaction_duration,
    NOW() - query_start AS query_duration,
    state,
    LEFT(query, 100) AS query_snippet
FROM pg_stat_activity
WHERE xact_start IS NOT NULL
  AND state != 'idle'
  AND pid != pg_backend_pid()
ORDER BY xact_start;

-- 2. Blocking tree (which transactions are blocking others)
WITH RECURSIVE blocking_tree AS (
    SELECT
        a.pid,
        a.usename,
        a.query AS blocked_query,
        NULL::integer AS blocking_pid,
        NULL::text AS blocking_query,
        1 AS level
    FROM pg_stat_activity a
    WHERE NOT EXISTS (
        SELECT 1 FROM pg_stat_activity b
        WHERE b.pid = ANY(pg_blocking_pids(a.pid))
    )
    AND a.pid IN (
        SELECT unnest(pg_blocking_pids(c.pid))
        FROM pg_stat_activity c
    )

    UNION ALL

    SELECT
        a.pid,
        a.usename,
        a.query,
        b.pid,
        b.query,
        bt.level + 1
    FROM blocking_tree bt
    JOIN pg_stat_activity a ON a.pid = ANY(
        SELECT unnest(pg_blocking_pids(x.pid))
        FROM pg_stat_activity x
        WHERE x.pid = bt.pid
    )
    JOIN pg_stat_activity b ON b.pid = ANY(pg_blocking_pids(a.pid))
)
SELECT
    level,
    pid,
    usename,
    blocking_pid,
    LEFT(blocked_query, 50) AS blocked_query,
    LEFT(blocking_query, 50) AS blocking_query
FROM blocking_tree
ORDER BY level, pid;

-- 3. Transaction rollback rate by database
SELECT
    datname,
    xact_commit AS commits,
    xact_rollback AS rollbacks,
    ROUND(100.0 * xact_rollback / NULLIF(xact_commit + xact_rollback, 0), 2) AS rollback_rate_percent
FROM pg_stat_database
WHERE datname NOT IN ('template0', 'template1', 'postgres')
ORDER BY rollback_rate_percent DESC;

-- 4. Idle in transaction connections
SELECT
    pid,
    usename,
    application_name,
    client_addr,
    NOW() - state_change AS idle_duration,
    state,
    query
FROM pg_stat_activity
WHERE state = 'idle in transaction'
  AND pid != pg_backend_pid()
ORDER BY state_change;

-- 5. Lock wait time by query
SELECT
    wait_event_type,
    wait_event,
    COUNT(*) AS waiting_count,
    array_agg(DISTINCT pid) AS waiting_pids
FROM pg_stat_activity
WHERE wait_event IS NOT NULL
  AND state = 'active'
GROUP BY wait_event_type, wait_event
ORDER BY waiting_count DESC;
```

## Error Handling

| Error | Cause | Solution |
|-------|-------|----------|
| "Permission denied for pg_stat_activity" | Insufficient monitoring privileges | Grant pg_monitor role or SELECT on pg_stat_activity |
| "Cannot terminate backend" | Trying to kill superuser connection | Use pg_cancel_backend or kill from OS level |
| "Connection pool exhausted" | Too many idle connections | Kill idle in transaction connections, increase pool size |
| "High rollback rate" | Application errors or constraint violations | Review application logs and fix bugs |
| "Lock wait timeout exceeded" | Deadlock or very long lock hold | Analyze blocking queries, implement timeouts |

## Configuration Options

**Monitoring Intervals**
- `check_interval`: 5-10 seconds for real-time alerting
- `long_transaction_threshold`: 30-60 seconds (production), 300s (analytics)
- `idle_in_transaction_timeout`: 600 seconds (10 minutes)

**Auto-Kill Thresholds**
- Long-running OLTP: 60-300 seconds
- Long-running analytics: 3600 seconds (1 hour)
- Idle in transaction: 600 seconds (10 minutes)

**Alert Thresholds**
- Rollback rate: >5% warning, >10% critical
- Blocked transactions: >10 warning, >50 critical
- Active connections: >80% of max_connections

## Best Practices

DO:
- Set statement_timeout in application connection strings
- Use connection pooling to limit total connections
- Implement transaction timeout in application code
- Monitor transaction throughput trends over time
- Kill idle in transaction connections automatically
- Track rollback reasons in application logs

DON'T:
- Leave transactions open while waiting for user input
- Hold locks during expensive operations (file I/O, network calls)
- Use long-running transactions in OLTP workloads
- Ignore idle in transaction connections (they hold locks)
- Set transaction timeouts too low (causes false positives)

## Performance Considerations

- Monitoring adds <0.1% CPU overhead with 10-second intervals
- pg_stat_activity queries are lightweight (<1ms)
- Auto-killing transactions requires careful threshold tuning
- Historical metrics retention: 30 days (aggregated), 7 days (detailed)
- Consider read replicas for monitoring queries in high-load systems

## Related Commands

- `/database-deadlock-detector` - Detailed deadlock analysis
- `/database-health-monitor` - Overall database health metrics
- `/sql-query-optimizer` - Optimize slow queries causing lock contention
- `/database-connection-pooler` - Manage connection pool sizing

## Version History

- v1.0.0 (2024-10): Initial implementation with PostgreSQL real-time monitoring
- Planned v1.1.0: Add MySQL transaction monitoring and distributed transaction support