Files
gh-jeremylongshore-claude-c…/commands/index-advisor.md
2025-11-30 08:18:22 +08:00

24 KiB

description, shortcut
description shortcut
Analyze query patterns and recommend optimal database indexes index-advisor

Database Index Advisor

Analyze query workloads, identify missing indexes, detect unused indexes, and recommend optimal indexing strategies with automated index impact analysis and maintenance scheduling for production databases.

When to Use This Command

Use /index-advisor when you need to:

  • Optimize slow queries with proper indexing strategies
  • Analyze database workload for missing index opportunities
  • Identify and remove unused indexes consuming storage and write performance
  • Design composite indexes for multi-column query patterns
  • Implement covering indexes to eliminate table lookups
  • Monitor index bloat and schedule maintenance (REINDEX, VACUUM)

DON'T use this when:

  • Database is small (<1GB) with minimal query load
  • All queries are simple primary key lookups
  • You're looking for application-level query issues (use query optimizer instead)
  • Database doesn't support custom indexes (some managed databases)

Design Decisions

This command implements workload-based index analysis because:

  • Real query patterns reveal actual index opportunities
  • EXPLAIN ANALYZE provides accurate index impact estimates
  • Unused index detection prevents unnecessary write overhead
  • Composite index recommendations reduce total index count
  • Covering indexes eliminate expensive table lookups (3-10x speedup)

Alternative considered: Static schema analysis

  • Only analyzes table structure, not query patterns
  • Can't estimate real-world performance impact
  • May recommend indexes that won't be used
  • Recommended only for initial schema design

Alternative considered: Manual EXPLAIN analysis

  • Requires deep SQL expertise for every query
  • Time-consuming and error-prone
  • No systematic unused index detection
  • Recommended only for ad-hoc optimization

Prerequisites

Before running this command:

  1. Access to database query logs or slow query log
  2. Permission to run EXPLAIN ANALYZE on queries
  3. Monitoring of database storage and I/O metrics
  4. Understanding of application query patterns
  5. Maintenance window for index creation (for large tables)

Implementation Process

Step 1: Collect Query Workload Data

Capture real production queries from logs or pg_stat_statements.

Step 2: Analyze Query Execution Plans

Run EXPLAIN ANALYZE to identify sequential scans and suboptimal query plans.

Step 3: Generate Index Recommendations

Identify missing indexes, composite index opportunities, and covering indexes.

Step 4: Simulate Index Impact

Estimate query performance improvements with hypothetical indexes.

Step 5: Implement and Monitor Indexes

Create recommended indexes and track query performance improvements.

Output Format

The command generates:

  • analysis/missing_indexes.sql - CREATE INDEX statements for missing indexes
  • analysis/unused_indexes.sql - DROP INDEX statements for unused indexes
  • reports/index_impact_report.html - Visual impact analysis with before/after metrics
  • monitoring/index_health.sql - Queries to monitor index bloat and usage
  • maintenance/reindex_schedule.sh - Automated index maintenance script

Code Examples

Example 1: PostgreSQL Index Advisor with pg_stat_statements

-- Enable pg_stat_statements extension for query tracking
CREATE EXTENSION IF NOT EXISTS pg_stat_statements;

-- Configure extended statistics
ALTER SYSTEM SET pg_stat_statements.track = 'all';
ALTER SYSTEM SET pg_stat_statements.max = 10000;
SELECT pg_reload_conf();

-- View most expensive queries without proper indexes
CREATE OR REPLACE VIEW slow_queries_needing_indexes AS
SELECT
    queryid,
    LEFT(query, 100) AS query_snippet,
    calls,
    total_exec_time,
    mean_exec_time,
    max_exec_time,
    stddev_exec_time,
    ROUND((100.0 * total_exec_time / SUM(total_exec_time) OVER ()), 2) AS pct_total_time
FROM pg_stat_statements
WHERE query NOT LIKE '%pg_stat%'
  AND mean_exec_time > 100  -- Queries averaging >100ms
ORDER BY total_exec_time DESC
LIMIT 50;

-- Identify missing indexes by analyzing sequential scans
CREATE OR REPLACE VIEW tables_needing_indexes AS
SELECT
    schemaname,
    tablename,
    seq_scan AS sequential_scans,
    seq_tup_read AS rows_read_sequentially,
    idx_scan AS index_scans,
    idx_tup_fetch AS rows_fetched_via_index,
    pg_size_pretty(pg_relation_size(schemaname||'.'||tablename)) AS table_size,
    CASE
        WHEN seq_scan > 0 THEN ROUND(100.0 * seq_scan / (seq_scan + COALESCE(idx_scan, 0)), 2)
        ELSE 0
    END AS pct_sequential_scans
FROM pg_stat_user_tables
WHERE seq_scan > 1000  -- Tables with >1000 sequential scans
  AND seq_tup_read > 10000  -- Reading >10k rows sequentially
ORDER BY seq_tup_read DESC;

-- Detect unused indexes consuming storage
CREATE OR REPLACE VIEW unused_indexes AS
SELECT
    schemaname,
    tablename,
    indexname,
    idx_scan AS index_scans,
    idx_tup_read AS tuples_read,
    idx_tup_fetch AS tuples_fetched,
    pg_size_pretty(pg_relation_size(indexrelid)) AS index_size,
    pg_get_indexdef(indexrelid) AS index_definition
FROM pg_stat_user_indexes
WHERE idx_scan = 0  -- Never used
  AND indexname NOT LIKE '%_pkey'  -- Exclude primary keys
  AND indexname NOT LIKE '%_unique%'  -- Exclude unique constraints
ORDER BY pg_relation_size(indexrelid) DESC;

-- Analyze index bloat and recommend REINDEX
CREATE OR REPLACE VIEW index_bloat_analysis AS
SELECT
    schemaname,
    tablename,
    indexname,
    pg_size_pretty(pg_relation_size(indexrelid)) AS index_size,
    pg_size_pretty(pg_relation_size(tablename::regclass)) AS table_size,
    ROUND(100.0 * pg_relation_size(indexrelid) / NULLIF(pg_relation_size(tablename::regclass), 0), 2) AS index_to_table_ratio,
    CASE
        WHEN pg_relation_size(indexrelid) > pg_relation_size(tablename::regclass) * 0.3
        THEN 'High bloat - consider REINDEX'
        WHEN pg_relation_size(indexrelid) > pg_relation_size(tablename::regclass) * 0.15
        THEN 'Moderate bloat - monitor'
        ELSE 'Healthy'
    END AS bloat_status
FROM pg_stat_user_indexes
WHERE pg_relation_size(indexrelid) > 100 * 1024 * 1024  -- >100MB
ORDER BY pg_relation_size(indexrelid) DESC;
# scripts/index_advisor.py - Comprehensive Index Analysis Tool
import psycopg2
from psycopg2.extras import DictCursor
import re
import logging
from typing import List, Dict, Tuple, Optional
from dataclasses import dataclass, asdict
from collections import defaultdict

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

@dataclass
class IndexRecommendation:
    """Represents an index recommendation with impact analysis."""
    table_name: str
    recommended_index: str
    reason: str
    affected_queries: List[str]
    estimated_speedup: str
    storage_cost_mb: float
    priority: str  # 'high', 'medium', 'low'

    def to_dict(self) -> dict:
        return asdict(self)

class PostgreSQLIndexAdvisor:
    """Analyze queries and recommend optimal indexes."""

    def __init__(self, connection_string: str):
        self.conn_string = connection_string

    def connect(self):
        return psycopg2.connect(self.conn_string, cursor_factory=DictCursor)

    def analyze_slow_queries(self, min_duration_ms: int = 100) -> List[Dict]:
        """Identify slow queries from pg_stat_statements."""
        conn = self.connect()

        try:
            with conn.cursor() as cur:
                cur.execute("""
                    SELECT
                        queryid,
                        LEFT(query, 200) AS query,
                        calls,
                        ROUND(total_exec_time::numeric, 2) AS total_time_ms,
                        ROUND(mean_exec_time::numeric, 2) AS mean_time_ms,
                        ROUND(max_exec_time::numeric, 2) AS max_time_ms
                    FROM pg_stat_statements
                    WHERE mean_exec_time > %s
                      AND query NOT LIKE '%%pg_stat%%'
                      AND query NOT LIKE '%%information_schema%%'
                    ORDER BY total_exec_time DESC
                    LIMIT 100;
                """, (min_duration_ms,))

                return [dict(row) for row in cur.fetchall()]

        finally:
            conn.close()

    def extract_where_columns(self, query: str) -> List[Tuple[str, str]]:
        """Extract table and column names from WHERE clauses."""
        columns = []

        # Pattern: WHERE table.column = value
        where_pattern = r'WHERE\s+(\w+)\.(\w+)\s*[=<>]'
        matches = re.finditer(where_pattern, query, re.IGNORECASE)

        for match in matches:
            table = match.group(1)
            column = match.group(2)
            columns.append((table, column))

        # Pattern: JOIN table ON t1.col = t2.col
        join_pattern = r'JOIN\s+(\w+)\s+\w+\s+ON\s+\w+\.(\w+)\s*=\s*\w+\.\w+'
        matches = re.finditer(join_pattern, query, re.IGNORECASE)

        for match in matches:
            table = match.group(1)
            column = match.group(2)
            columns.append((table, column))

        return columns

    def check_existing_indexes(self, table: str, column: str) -> bool:
        """Check if index exists for table.column."""
        conn = self.connect()

        try:
            with conn.cursor() as cur:
                cur.execute("""
                    SELECT COUNT(*) > 0 AS has_index
                    FROM pg_indexes
                    WHERE tablename = %s
                      AND indexdef LIKE %s;
                """, (table, f'%{column}%'))

                result = cur.fetchone()
                return result['has_index'] if result else False

        finally:
            conn.close()

    def generate_recommendations(self) -> List[IndexRecommendation]:
        """Generate comprehensive index recommendations."""
        recommendations = []

        # Get slow queries
        slow_queries = self.analyze_slow_queries()
        logger.info(f"Analyzing {len(slow_queries)} slow queries...")

        # Track columns needing indexes
        column_usage = defaultdict(lambda: {'count': 0, 'queries': [], 'total_time': 0})

        for query_info in slow_queries:
            query = query_info['query']
            total_time = float(query_info['total_time_ms'])

            # Extract WHERE/JOIN columns
            columns = self.extract_where_columns(query)

            for table, column in columns:
                # Check if index exists
                if not self.check_existing_indexes(table, column):
                    key = f"{table}.{column}"
                    column_usage[key]['count'] += 1
                    column_usage[key]['queries'].append(query[:100])
                    column_usage[key]['total_time'] += total_time

        # Generate recommendations
        for key, usage in column_usage.items():
            table, column = key.split('.')

            # Estimate speedup based on query count and time
            if usage['total_time'] > 10000:  # >10 seconds total
                priority = 'high'
                speedup = '5-10x faster'
            elif usage['total_time'] > 1000:  # >1 second total
                priority = 'medium'
                speedup = '3-5x faster'
            else:
                priority = 'low'
                speedup = '2-3x faster'

            # Estimate storage cost (rough approximation)
            storage_cost = self.estimate_index_size(table)

            recommendation = IndexRecommendation(
                table_name=table,
                recommended_index=f"CREATE INDEX idx_{table}_{column} ON {table}({column});",
                reason=f"Used in {usage['count']} slow queries totaling {usage['total_time']:.0f}ms",
                affected_queries=usage['queries'][:5],  # Top 5 queries
                estimated_speedup=speedup,
                storage_cost_mb=storage_cost,
                priority=priority
            )

            recommendations.append(recommendation)

        # Sort by priority and total time
        recommendations.sort(
            key=lambda r: (
                {'high': 0, 'medium': 1, 'low': 2}[r.priority],
                -sum(1 for _ in r.affected_queries)
            )
        )

        return recommendations

    def estimate_index_size(self, table: str) -> float:
        """Estimate index size in MB based on table size."""
        conn = self.connect()

        try:
            with conn.cursor() as cur:
                cur.execute("""
                    SELECT pg_relation_size(%s) / (1024.0 * 1024.0) AS size_mb
                    FROM pg_class
                    WHERE relname = %s;
                """, (table, table))

                result = cur.fetchone()
                if result:
                    # Index typically 20-30% of table size
                    return round(result['size_mb'] * 0.25, 2)
                return 10.0  # Default estimate

        except Exception as e:
            logger.warning(f"Could not estimate size for {table}: {e}")
            return 10.0
        finally:
            conn.close()

    def find_unused_indexes(self) -> List[Dict]:
        """Identify indexes that are never used."""
        conn = self.connect()

        try:
            with conn.cursor() as cur:
                cur.execute("""
                    SELECT
                        schemaname,
                        tablename,
                        indexname,
                        pg_size_pretty(pg_relation_size(indexrelid)) AS index_size,
                        pg_get_indexdef(indexrelid) AS definition
                    FROM pg_stat_user_indexes
                    WHERE idx_scan = 0
                      AND indexname NOT LIKE '%_pkey'
                      AND indexname NOT LIKE '%_unique%'
                      AND pg_relation_size(indexrelid) > 1024 * 1024  -- >1MB
                    ORDER BY pg_relation_size(indexrelid) DESC;
                """)

                return [dict(row) for row in cur.fetchall()]

        finally:
            conn.close()

    def generate_report(self):
        """Generate comprehensive index analysis report."""
        logger.info("=== Index Analysis Report ===\n")

        # Recommendations
        recommendations = self.generate_recommendations()

        if recommendations:
            logger.info(f"Found {len(recommendations)} index recommendations:\n")

            for i, rec in enumerate(recommendations, 1):
                logger.info(f"{i}. [{rec.priority.upper()}] {rec.table_name}")
                logger.info(f"   Recommendation: {rec.recommended_index}")
                logger.info(f"   Reason: {rec.reason}")
                logger.info(f"   Expected speedup: {rec.estimated_speedup}")
                logger.info(f"   Storage cost: ~{rec.storage_cost_mb}MB")
                logger.info(f"   Affected queries: {len(rec.affected_queries)}")
                logger.info("")

        # Unused indexes
        unused = self.find_unused_indexes()

        if unused:
            logger.info(f"\n=== Unused Indexes ({len(unused)}) ===\n")

            for idx in unused:
                logger.info(f"DROP INDEX {idx['schemaname']}.{idx['indexname']};")
                logger.info(f"  -- Table: {idx['tablename']}, Size: {idx['index_size']}")

        logger.info("\n=== Summary ===")
        logger.info(f"Missing indexes: {len(recommendations)}")
        logger.info(f"Unused indexes: {len(unused)}")
        logger.info(f"Potential storage savings: {sum(self._parse_size(idx['index_size']) for idx in unused):.2f}MB")

    def _parse_size(self, size_str: str) -> float:
        """Parse PostgreSQL pg_size_pretty output to MB."""
        if 'GB' in size_str:
            return float(size_str.replace(' GB', '').replace('GB', '')) * 1024
        elif 'MB' in size_str:
            return float(size_str.replace(' MB', '').replace('MB', ''))
        elif 'KB' in size_str:
            return float(size_str.replace(' KB', '').replace('KB', '')) / 1024
        return 0.0

# Usage
if __name__ == "__main__":
    advisor = PostgreSQLIndexAdvisor(
        "postgresql://user:password@localhost:5432/mydb"
    )

    advisor.generate_report()

Example 2: MySQL Index Advisor with Performance Schema

-- Enable performance schema for query analysis
UPDATE performance_schema.setup_instruments
SET ENABLED = 'YES', TIMED = 'YES'
WHERE NAME LIKE 'statement/%';

UPDATE performance_schema.setup_consumers
SET ENABLED = 'YES'
WHERE NAME LIKE '%statements%';

-- Identify slow queries needing indexes
CREATE OR REPLACE VIEW slow_queries_analysis AS
SELECT
    DIGEST_TEXT AS query,
    COUNT_STAR AS executions,
    ROUND(AVG_TIMER_WAIT / 1000000000, 2) AS avg_time_ms,
    ROUND(MAX_TIMER_WAIT / 1000000000, 2) AS max_time_ms,
    ROUND(SUM_TIMER_WAIT / 1000000000, 2) AS total_time_ms,
    SUM_ROWS_EXAMINED AS total_rows_examined,
    SUM_ROWS_SENT AS total_rows_sent,
    ROUND(SUM_ROWS_EXAMINED / COUNT_STAR, 0) AS avg_rows_examined
FROM performance_schema.events_statements_summary_by_digest
WHERE DIGEST_TEXT IS NOT NULL
  AND SCHEMA_NAME NOT IN ('mysql', 'performance_schema', 'information_schema')
  AND AVG_TIMER_WAIT > 100000000  -- >100ms average
ORDER BY SUM_TIMER_WAIT DESC
LIMIT 50;

-- Find tables with full table scans
SELECT
    object_schema AS database_name,
    object_name AS table_name,
    count_read AS select_count,
    count_fetch AS rows_fetched,
    ROUND(count_fetch / NULLIF(count_read, 0), 2) AS avg_rows_per_select
FROM performance_schema.table_io_waits_summary_by_table
WHERE object_schema NOT IN ('mysql', 'performance_schema', 'information_schema')
  AND count_read > 1000
ORDER BY count_fetch DESC;

-- Identify duplicate indexes
SELECT
    t1.TABLE_SCHEMA AS database_name,
    t1.TABLE_NAME AS table_name,
    t1.INDEX_NAME AS index1,
    t2.INDEX_NAME AS index2,
    GROUP_CONCAT(t1.COLUMN_NAME ORDER BY t1.SEQ_IN_INDEX) AS columns
FROM information_schema.STATISTICS t1
JOIN information_schema.STATISTICS t2
    ON t1.TABLE_SCHEMA = t2.TABLE_SCHEMA
    AND t1.TABLE_NAME = t2.TABLE_NAME
    AND t1.INDEX_NAME < t2.INDEX_NAME
    AND t1.COLUMN_NAME = t2.COLUMN_NAME
    AND t1.SEQ_IN_INDEX = t2.SEQ_IN_INDEX
WHERE t1.TABLE_SCHEMA NOT IN ('mysql', 'performance_schema', 'information_schema')
GROUP BY t1.TABLE_SCHEMA, t1.TABLE_NAME, t1.INDEX_NAME, t2.INDEX_NAME
HAVING COUNT(*) = (
    SELECT COUNT(*)
    FROM information_schema.STATISTICS
    WHERE TABLE_SCHEMA = t1.TABLE_SCHEMA
      AND TABLE_NAME = t1.TABLE_NAME
      AND INDEX_NAME = t1.INDEX_NAME
);
// scripts/mysql-index-advisor.js
const mysql = require('mysql2/promise');

class MySQLIndexAdvisor {
    constructor(config) {
        this.pool = mysql.createPool({
            host: config.host,
            user: config.user,
            password: config.password,
            database: config.database,
            waitForConnections: true,
            connectionLimit: 10
        });
    }

    async analyzeTableIndexes(tableName) {
        const [rows] = await this.pool.query(`
            SELECT
                COLUMN_NAME,
                CARDINALITY,
                INDEX_NAME,
                SEQ_IN_INDEX,
                NON_UNIQUE
            FROM information_schema.STATISTICS
            WHERE TABLE_SCHEMA = DATABASE()
              AND TABLE_NAME = ?
            ORDER BY INDEX_NAME, SEQ_IN_INDEX
        `, [tableName]);

        return rows;
    }

    async findMissingIndexes() {
        // Analyze slow queries from performance schema
        const [slowQueries] = await this.pool.query(`
            SELECT
                DIGEST_TEXT AS query,
                COUNT_STAR AS executions,
                ROUND(AVG_TIMER_WAIT / 1000000000, 2) AS avg_time_ms,
                SUM_ROWS_EXAMINED AS total_rows_examined
            FROM performance_schema.events_statements_summary_by_digest
            WHERE AVG_TIMER_WAIT > 100000000
              AND SCHEMA_NAME = DATABASE()
            ORDER BY AVG_TIMER_WAIT DESC
            LIMIT 20
        `);

        const recommendations = [];

        for (const query of slowQueries) {
            // Extract WHERE conditions
            const whereMatch = query.query.match(/WHERE\s+(\w+)\s*[=<>]/i);

            if (whereMatch) {
                const column = whereMatch[1];

                recommendations.push({
                    table: 'unknown',  // Extract from query
                    column: column,
                    query: query.query.substring(0, 100),
                    avgTime: query.avg_time_ms,
                    recommendation: `CREATE INDEX idx_${column} ON table_name(${column});`
                });
            }
        }

        return recommendations;
    }

    async generateReport() {
        console.log('=== MySQL Index Analysis Report ===\n');

        const recommendations = await this.findMissingIndexes();

        console.log(`Found ${recommendations.length} potential index improvements:\n`);

        recommendations.forEach((rec, i) => {
            console.log(`${i + 1}. ${rec.recommendation}`);
            console.log(`   Average query time: ${rec.avgTime}ms`);
            console.log(`   Query: ${rec.query}...`);
            console.log('');
        });
    }
}

// Usage
(async () => {
    const advisor = new MySQLIndexAdvisor({
        host: 'localhost',
        user: 'root',
        password: 'password',
        database: 'mydb'
    });

    await advisor.generateReport();
})();

Error Handling

Error Cause Solution
"Index too large" Index exceeds max key length Use partial index or hash index for long columns
"Duplicate key violation" Creating unique index on non-unique data Check for duplicates before creating unique index
"Out of disk space" Index creation requires temporary storage Free up disk space or use CONCURRENTLY option
"Lock timeout" Index creation blocking queries Use CREATE INDEX CONCURRENTLY (PostgreSQL) or ALGORITHM=INPLACE (MySQL)
"Statistics out of date" Old cardinality estimates Run ANALYZE (PostgreSQL) or ANALYZE TABLE (MySQL)

Configuration Options

Index Types

  • B-Tree: Default, good for equality and range queries
  • Hash: Fast equality lookups (PostgreSQL 10+)
  • GIN/GiST: Full-text search and JSON queries
  • BRIN: Block range indexes for very large sequential tables

Index Options

  • CONCURRENTLY: Create without blocking writes (PostgreSQL)
  • ALGORITHM=INPLACE: Online index creation (MySQL)
  • INCLUDE columns: Covering index (PostgreSQL 11+)
  • WHERE clause: Partial index for filtered queries

Best Practices

DO:

  • Create indexes on foreign key columns
  • Use composite indexes for multi-column WHERE clauses
  • Order composite index columns by selectivity (most selective first)
  • Use covering indexes to avoid table lookups
  • Create indexes CONCURRENTLY in production
  • Monitor index usage with pg_stat_user_indexes

DON'T:

  • Create indexes on every column "just in case"
  • Index low-cardinality columns (boolean, enum with few values)
  • Use functions in WHERE clauses on indexed columns
  • Forget to ANALYZE after index creation
  • Create redundant indexes (e.g., (a,b) and (a) both exist)

Performance Considerations

  • Each index adds 10-30% write overhead (INSERT/UPDATE/DELETE)
  • Indexes consume storage (typically 20-30% of table size)
  • Too many indexes slow writes more than they speed reads
  • Index-only scans are 3-10x faster than table lookups
  • Covering indexes eliminate random I/O entirely
  • Partial indexes reduce storage and maintenance overhead
  • /sql-query-optimizer - Rewrite queries for better performance
  • /database-partition-manager - Partition large tables for faster queries
  • /database-health-monitor - Monitor index bloat and maintenance needs
  • /database-backup-automator - Schedule REINDEX during maintenance windows

Version History

  • v1.0.0 (2024-10): Initial implementation with PostgreSQL and MySQL support
  • Planned v1.1.0: Add hypothetical index simulation and automated A/B testing