# Root Cause Analysis: Database Query Performance Degradation ## Executive Summary Database query latency increased 10x (p95: 50ms → 500ms) starting March 10th, impacting all API endpoints. Root cause identified as unoptimized database migration that created missing indexes on frequently-queried columns. Confidence: High (95%+). Resolution: Add indexes + optimize query patterns. Time to fix: 2 hours. --- ## 1. Effect Definition **What happened**: Database query latency increased dramatically across all tables **Quantification**: - **Baseline**: p50: 10ms, p95: 50ms, p99: 100ms (stable for 6 months) - **Current**: p50: 100ms, p95: 500ms, p99: 2000ms (10x increase at p95) - **Absolute increase**: +450ms at p95 - **Relative increase**: 900% at p95 **Timeline**: - **First observed**: March 10th, 2:15 PM UTC - **Duration**: Ongoing (March 10-12, 48 hours elapsed) - **Baseline period**: Jan 1 - March 9 (stable) - **Degradation start**: Exact timestamp March 10th 14:15:22 UTC **Impact**: - **Users affected**: All users (100% of traffic) - **API endpoints affected**: All endpoints (database-dependent) - **Severity**: High - 25% of API requests timing out (>5 sec) - User-visible page load delays - Support tickets increased 3x - Estimated revenue impact: $50k/day from abandoned transactions **Context**: - Database: PostgreSQL 14.7, 500GB data - Application: REST API (Node.js), 10k req/min - Recent changes: Database migration deployed March 10th 2:00 PM --- ## 2. Competing Hypotheses ### Hypothesis 1: Database migration introduced inefficient schema - **Type**: Root cause candidate - **Evidence for**: - **Timing**: Migration deployed March 10 2:00 PM, degradation started 2:15 PM (15 min after) - **Perfect temporal match**: Strongest temporal correlation - **Migration contents**: Added new columns, restructured indexes - **Evidence against**: - None - all evidence supports this hypothesis ### Hypothesis 2: Traffic spike overloaded database - **Type**: Confounder / alternative explanation - **Evidence for**: - March 10 is typically high-traffic day (week-end effect) - **Evidence against**: - **Traffic unchanged**: Monitoring shows traffic at 10k req/min (same as baseline) - **No traffic spike at 2:15 PM**: Traffic flat throughout March 10 - **Previous high-traffic days handled fine**: Traffic has been higher (15k req/min) without issues ### Hypothesis 3: Database server resource exhaustion - **Type**: Proximate cause / symptom - **Evidence for**: - CPU usage increased from 30% → 80% at 2:15 PM - Disk I/O increased from 100 IOPS → 5000 IOPS - Connection pool near saturation (95/100 connections) - **Evidence against**: - **These are symptoms, not root**: Something CAUSED the increased resource usage - Resource exhaustion doesn't explain WHY queries became slow ### Hypothesis 4: Slow query introduced by application code change - **Type**: Proximate cause candidate - **Evidence for**: - Application deploy on March 9th (1 day prior) - **Evidence against**: - **Timing mismatch**: Deploy was 24 hours before degradation - **No code changes to query logic**: Deploy only changed UI - **Query patterns unchanged**: Same queries, same frequency ### Hypothesis 5: Database server hardware issue - **Type**: Alternative explanation - **Evidence for**: - Occasional disk errors in system logs - **Evidence against**: - **Disk errors present before March 10**: Noise, not new - **No hardware alerts**: Monitoring shows no hardware degradation - **Sudden onset**: Hardware failures typically gradual **Most likely root cause**: Database migration (Hypothesis 1) **Confounders ruled out**: - Traffic unchanged (Hypothesis 2) - Application code unchanged (Hypothesis 4) - Hardware stable (Hypothesis 5) **Symptoms identified**: - Resource exhaustion (Hypothesis 3) is symptom, not root --- ## 3. Causal Model ### Causal Chain: Root → Proximate → Effect ``` ROOT CAUSE: Database migration removed indexes on user_id + created_at columns (March 10, 2:00 PM deployment) ↓ (mechanism: queries now do full table scans instead of index scans) INTERMEDIATE CAUSE: Every query on users table must scan entire table (5M rows) instead of using index (10-1000 rows) ↓ (mechanism: table scans require disk I/O, CPU cycles) PROXIMATE CAUSE: Database CPU at 80%, disk I/O at 5000 IOPS (50x increase) Query execution time 10x slower ↓ (mechanism: queries queue up, connection pool saturates) OBSERVED EFFECT: API endpoints slow (p95: 500ms vs 50ms baseline) 25% of requests timeout Users experience page load delays ``` ### Why March 10 2:15 PM specifically? - **Migration deployed**: March 10 2:00 PM - **Migration applied**: 2:00-2:10 PM (10 min to run schema changes) - **First slow queries**: 2:15 PM (first queries after migration completed) - **5-minute lag**: Time for connection pool to cycle and pick up new schema ### Missing Index Details **Migration removed these indexes**: ```sql -- BEFORE (efficient): CREATE INDEX idx_users_user_id_created_at ON users(user_id, created_at); CREATE INDEX idx_transactions_user_id ON transactions(user_id); -- AFTER (inefficient): -- (indexes removed by mistake in migration) ``` **Impact**: ```sql -- Common query pattern: SELECT * FROM users WHERE user_id = 123 AND created_at > '2024-01-01'; -- BEFORE (with index): 5ms (index scan, 10 rows) -- AFTER (without index): 500ms (full table scan, 5M rows) ``` ### Confounders Ruled Out **No confounding variables found**: - **Traffic**: Controlled (unchanged) - **Hardware**: Controlled (stable) - **Code**: Controlled (no changes to queries) - **External dependencies**: Controlled (no changes) **Only variable that changed**: Database schema (migration) --- ## 4. Evidence Assessment ### Temporal Sequence: ✓ PASS (5/5) **Timeline**: ``` March 9, 3:00 PM: Application deploy (UI changes only, no queries changed) March 10, 2:00 PM: Database migration starts March 10, 2:10 PM: Migration completes March 10, 2:15 PM: First slow queries logged (p95: 500ms) March 10, 2:20 PM: Alerting fires (p95 exceeds 200ms threshold) ``` **Verdict**: ✓ Cause (migration) clearly precedes effect (slow queries) by 5-15 minutes **Why 5-minute lag?** - Connection pool refresh time - Gradual connection cycling to new schema - First slow queries at 2:15 PM were from connections that picked up new schema --- ### Strength of Association: ✓ PASS (5/5) **Correlation strength**: Very strong (r = 0.99) **Evidence**: - **Before migration**: p95 latency stable at 50ms (6 months) - **Immediately after migration**: p95 latency jumped to 500ms - **10x increase**: Large effect size - **100% of queries affected**: All database queries slower, not selective **Statistical significance**: P < 0.001 (highly significant) - Comparing 1000 queries before (mean: 50ms) vs 1000 queries after (mean: 500ms) - Effect size: Cohen's d = 5.2 (very large) --- ### Dose-Response: ✓ PASS (4/5) **Gradient observed**: - **Table size vs latency**: - Small tables (<10k rows): 20ms → 50ms (2.5x increase) - Medium tables (100k rows): 50ms → 200ms (4x increase) - Large tables (5M rows): 50ms → 500ms (10x increase) **Mechanism**: Larger tables → more rows to scan → longer queries **Interpretation**: Clear dose-response relationship strengthens causation --- ### Counterfactual Test: ✓ PASS (5/5) **Counterfactual question**: "What if the migration hadn't been deployed?" **Test 1: Rollback Experiment** (Strongest evidence) - **Action**: Rolled back database migration March 11, 9:00 AM - **Result**: Latency immediately returned to baseline (p95: 55ms) - **Conclusion**: ✓ Migration removal eliminates effect (strong causation) **Test 2: Control Query** - **Tested**: Queries on tables NOT affected by migration (no index changes) - **Result**: Latency unchanged (p95: 50ms before and after migration) - **Conclusion**: ✓ Only migrated tables affected (specificity) **Test 3: Historical Comparison** - **Baseline period**: Jan-March 9 (no migration), p95: 50ms - **Degradation period**: March 10-11 (migration active), p95: 500ms - **Post-rollback**: March 11+ (migration reverted), p95: 55ms - **Conclusion**: ✓ Pattern strongly implicates migration **Verdict**: Counterfactual tests strongly confirm causation --- ### Mechanism: ✓ PASS (5/5) **HOW migration caused slow queries**: 1. **Migration removed indexes**: ```sql -- Migration accidentally dropped these indexes: DROP INDEX idx_users_user_id_created_at; DROP INDEX idx_transactions_user_id; ``` 2. **Query planner changed strategy**: ``` BEFORE (with index): EXPLAIN SELECT * FROM users WHERE user_id = 123 AND created_at > '2024-01-01'; → Index Scan using idx_users_user_id_created_at (cost=0.43..8.45 rows=1) AFTER (without index): EXPLAIN SELECT * FROM users WHERE user_id = 123 AND created_at > '2024-01-01'; → Seq Scan on users (cost=0.00..112000.00 rows=5000000) ``` 3. **Full table scans require disk I/O**: - Index scan: Read 10-1000 rows (1-100 KB) from index + data pages - Full table scan: Read 5M rows (5 GB) from disk - **50x-500x more I/O** 4. **Increased I/O saturates CPU & disk**: - CPU: Scanning rows, filtering predicates (30% → 80%) - Disk: Reading table pages (100 IOPS → 5000 IOPS) 5. **Saturation causes queuing**: - Slow queries → connections held longer - Connection pool saturates (95/100) - New queries wait for available connections - Latency compounds **Plausibility**: Very high - **Established theory**: Index scans vs table scans (well-known database optimization) - **Quantifiable impact**: Can calculate I/O difference (50x-500x) - **Reproducible**: Same pattern in staging environment **Supporting evidence**: - EXPLAIN ANALYZE output shows table scans post-migration - PostgreSQL logs show "sequential scan" warnings - Disk I/O metrics show 50x increase correlated with migration **Verdict**: ✓ Mechanism fully explained with strong supporting evidence --- ### Consistency: ✓ PASS (5/5) **Relationship holds across contexts**: 1. **All affected tables show same pattern**: - users table: 50ms → 500ms - transactions table: 30ms → 300ms - orders table: 40ms → 400ms - **Consistent 10x degradation** 2. **All query types affected**: - SELECT: 10x slower - JOIN: 10x slower - Aggregations (COUNT, SUM): 10x slower 3. **Consistent across all environments**: - Production: 50ms → 500ms - Staging: 45ms → 450ms (when migration tested) - Dev: 40ms → 400ms 4. **Consistent across time**: - March 10 14:15 - March 11 9:00: p95 at 500ms - Every hour during this period: ~500ms (stable degradation) 5. **Replication**: - Tested in staging: Same migration → same 10x degradation - Rollback in staging: Latency restored - **Reproducible causal relationship** **Verdict**: ✓ Extremely consistent pattern across tables, query types, environments, and time --- ### Confounding Control: ✓ PASS (4/5) **Potential confounders identified and ruled out**: #### Confounder 1: Traffic Spike **Could traffic spike cause both**: - Migration deployment (timing coincidence) - Slow queries (overload) **Ruled out**: - Traffic monitoring shows flat 10k req/min (no spike) - Even if traffic spiked, wouldn't explain why migration rollback fixed it #### Confounder 2: Hardware Degradation **Could hardware issue cause both**: - Migration coincidentally deployed during degradation - Slow queries (hardware bottleneck) **Ruled out**: - Hardware metrics stable (CPU headroom, no disk errors) - Rollback immediately fixed latency (hardware didn't suddenly improve) #### Confounder 3: Application Code Change **Could code change cause both**: - Buggy queries - Migration deployed same time as code **Ruled out**: - Code deploy was March 9 (24 hrs before degradation) - No query changes in code deploy (only UI) - Rollback of migration (not code) fixed issue **Controlled variables**: - ✓ Traffic (flat during period) - ✓ Hardware (stable metrics) - ✓ Code (no query changes) - ✓ External dependencies (no changes) **Verdict**: ✓ No confounders found, only migration variable changed --- ### Bradford Hill Criteria: 25/27 (Very Strong) | Criterion | Score | Justification | |-----------|-------|---------------| | **1. Strength** | 3/3 | 10x latency increase = very strong association | | **2. Consistency** | 3/3 | Consistent across tables, queries, environments, time | | **3. Specificity** | 3/3 | Only migrated tables affected; rollback restores; specific cause → specific effect | | **4. Temporality** | 3/3 | Migration clearly precedes degradation by 5-15 min | | **5. Dose-response** | 3/3 | Larger tables → greater latency increase (clear gradient) | | **6. Plausibility** | 3/3 | Index vs table scan theory well-established | | **7. Coherence** | 3/3 | Fits database optimization knowledge, no contradictions | | **8. Experiment** | 3/3 | Rollback experiment: removing cause eliminates effect | | **9. Analogy** | 1/3 | Similar patterns exist (missing indexes → slow queries) but not perfect analogy | **Total**: 25/27 = **Very strong causal evidence** **Interpretation**: All criteria met except weak analogy (not critical). Strong case for causation. --- ## 5. Conclusion ### Most Likely Root Cause **Root cause**: Database migration removed indexes on `user_id` and `created_at` columns, forcing query planner to use full table scans instead of efficient index scans. **Confidence level**: **High (95%+)** **Reasoning**: 1. **Perfect temporal sequence**: Migration (2:00 PM) → degradation (2:15 PM) 2. **Strong counterfactual test**: Rollback immediately restored performance 3. **Clear mechanism**: Index scans (fast) → table scans (slow) with 50x-500x more I/O 4. **Dose-response**: Larger tables show greater degradation 5. **Consistency**: Pattern holds across all tables, queries, environments 6. **No confounders**: Traffic, hardware, code all controlled 7. **Bradford Hill 25/27**: Very strong causal evidence 8. **Reproducible**: Same effect in staging environment **Why this is root cause (not symptom)**: - **Missing indexes** is the fundamental issue - **High CPU/disk I/O** is symptom of table scans - **Slow queries** is symptom of high resource usage - Fixing missing indexes eliminates all downstream symptoms **Causal Mechanism**: ``` Missing indexes (root) ↓ Query planner uses table scans (mechanism) ↓ 50x-500x more disk I/O (mechanism) ↓ CPU & disk saturation (symptom) ↓ Query queuing, connection pool saturation (symptom) ↓ 10x latency increase (observed effect) ``` --- ### Alternative Explanations (Ruled Out) #### Alternative 1: Traffic Spike **Why less likely**: - Traffic monitoring shows flat 10k req/min (no spike) - Previous traffic spikes (15k req/min) handled without issue - Rollback fixed latency without changing traffic #### Alternative 2: Hardware Degradation **Why less likely**: - Hardware metrics stable (no degradation) - Sudden onset inconsistent with hardware failure (usually gradual) - Rollback immediately fixed issue (hardware didn't change) #### Alternative 3: Application Code Bug **Why less likely**: - Code deploy 24 hours before degradation (timing mismatch) - No query logic changes in deploy - Rollback of migration (not code) fixed issue --- ### Unresolved Questions 1. **Why were indexes removed?** - Migration script error? (likely) - Intentional optimization attempt gone wrong? (possible) - Need to review migration PR and approval process 2. **How did this pass review?** - Were indexes intentionally removed or accidental? - Was migration tested in staging before production? - Need process improvement 3. **Why no pre-deploy testing catch this?** - Staging environment testing missed this - Query performance tests insufficient - Need better pre-deploy validation --- ## 6. Recommended Actions ### Immediate Actions (Address Root Cause) **1. Re-add missing indexes** (DONE - March 11, 9:00 AM) ```sql CREATE INDEX idx_users_user_id_created_at ON users(user_id, created_at); CREATE INDEX idx_transactions_user_id ON transactions(user_id); ``` - **Result**: Latency restored to 55ms (within 5ms of baseline) - **Time to fix**: 15 minutes (index creation) **2. Validate index coverage** (IN PROGRESS) - Audit all tables for missing indexes - Compare production indexes to staging/dev - Document expected indexes per table ### Preventive Actions (Process Improvements) **1. Improve migration review process** - **Require EXPLAIN ANALYZE before/after** for all migrations - **Staging performance tests mandatory** (query latency benchmarks) - **Index change review**: Any index drop requires extra approval **2. Add pre-deploy validation** - Automated query performance regression tests - Alert if any query >2x slower in staging - Block deployment if performance degrades >20% **3. Improve monitoring & alerting** - Alert on index usage changes (track `pg_stat_user_indexes`) - Alert on query plan changes (seq scan warnings) - Dashboards for index hit rate, table scan frequency **4. Database migration checklist** - [ ] EXPLAIN ANALYZE on affected queries - [ ] Staging performance tests passed - [ ] Index usage reviewed - [ ] Rollback plan documented - [ ] Monitoring in place ### Validation Tests (Confirm Fix) **1. Performance benchmark** (DONE) - **Test**: Run 1000 queries pre-fix vs post-fix - **Result**: - Pre-fix (migration): p95 = 500ms - Post-fix (indexes restored): p95 = 55ms - **Conclusion**: ✓ Fix successful **2. Load test** (DONE) - **Test**: 15k req/min (1.5x normal traffic) - **Result**: p95 = 60ms (acceptable, <10% degradation) - **Conclusion**: ✓ System can handle load with indexes **3. Index usage monitoring** (ONGOING) - **Metrics**: `pg_stat_user_indexes` shows indexes being used - **Query plans**: EXPLAIN shows index scans (not seq scans) - **Conclusion**: ✓ Indexes actively used --- ### Success Criteria **Performance restored**: - [x] p95 latency <100ms (achieved: 55ms) - [x] p99 latency <200ms (achieved: 120ms) - [x] CPU usage <50% (achieved: 35%) - [x] Disk I/O <500 IOPS (achieved: 150 IOPS) - [x] Connection pool utilization <70% (achieved: 45%) **User impact resolved**: - [x] Timeout rate <1% (achieved: 0.2%) - [x] Support tickets normalized (dropped 80%) - [x] Page load times back to normal **Process improvements**: - [x] Migration checklist created - [x] Performance regression tests added to CI/CD - [ ] Post-mortem doc written (IN PROGRESS) - [ ] Team training on index optimization (SCHEDULED) --- ## 7. Limitations ### Data Limitations **Missing data**: - **No query performance baselines in staging**: Can't compare staging pre/post migration - **Limited historical index usage data**: `pg_stat_user_indexes` only has 7 days retention - **No migration testing logs**: Can't determine if migration was tested in staging **Measurement limitations**: - **Latency measured at application layer**: Database-internal latency not tracked separately - **No per-query latency breakdown**: Can't isolate which specific queries most affected ### Analysis Limitations **Assumptions**: - **Assumed connection pool refresh time**: Estimated 5 min for connections to cycle to new schema (not measured) - **Didn't test other potential optimizations**: Only tested rollback, not alternative fixes (e.g., query rewriting) **Alternative fixes not explored**: - Could queries be rewritten to work without indexes? (possible but not investigated) - Could connection pool be increased? (wouldn't fix root cause) ### Generalizability **Context-specific**: - This analysis applies to PostgreSQL databases with similar query patterns - May not apply to other database systems (MySQL, MongoDB, etc.) with different query optimizers - Specific to tables with millions of rows (small tables less affected) **Lessons learned**: - Index removal can cause 10x+ performance degradation for large tables - Migration testing in staging must include performance benchmarks - Rollback plans essential for database schema changes --- ## 8. Meta: Analysis Quality Self-Assessment Using `rubric_causal_inference_root_cause.json`: ### Scores: 1. **Effect Definition Clarity**: 5/5 (precise quantification, timeline, baseline, impact) 2. **Hypothesis Generation**: 5/5 (5 hypotheses, systematic evaluation) 3. **Root Cause Identification**: 5/5 (root vs proximate distinguished, causal chain clear) 4. **Causal Model Quality**: 5/5 (full chain, mechanisms, confounders noted) 5. **Temporal Sequence Verification**: 5/5 (detailed timeline, lag explained) 6. **Counterfactual Testing**: 5/5 (rollback experiment + control queries) 7. **Mechanism Explanation**: 5/5 (detailed mechanism with EXPLAIN output evidence) 8. **Confounding Control**: 4/5 (identified and ruled out major confounders, comprehensive) 9. **Evidence Quality & Strength**: 5/5 (quasi-experiment via rollback, Bradford Hill 25/27) 10. **Confidence & Limitations**: 5/5 (explicit confidence, limitations, alternatives evaluated) **Average**: 4.9/5 - **Excellent** (publication-quality analysis) **Assessment**: This root cause analysis exceeds standards for high-stakes engineering decisions. Strong evidence across all criteria, particularly counterfactual testing (rollback experiment) and mechanism explanation (query plans). Appropriate for postmortem documentation and process improvement decisions. --- ## Appendix: Supporting Evidence ### A. Query Plans Before/After **BEFORE (with index)**: ```sql EXPLAIN ANALYZE SELECT * FROM users WHERE user_id = 123 AND created_at > '2024-01-01'; Index Scan using idx_users_user_id_created_at on users (cost=0.43..8.45 rows=1 width=152) (actual time=0.025..0.030 rows=1 loops=1) Index Cond: ((user_id = 123) AND (created_at > '2024-01-01'::date)) Planning Time: 0.112 ms Execution Time: 0.052 ms ← Fast ``` **AFTER (without index)**: ```sql EXPLAIN ANALYZE SELECT * FROM users WHERE user_id = 123 AND created_at > '2024-01-01'; Seq Scan on users (cost=0.00..112000.00 rows=5000000 width=152) (actual time=0.025..485.234 rows=1 loops=1) Filter: ((user_id = 123) AND (created_at > '2024-01-01'::date)) Rows Removed by Filter: 4999999 Planning Time: 0.108 ms Execution Time: 485.267 ms ← 10,000x slower ``` ### B. Monitoring Metrics **Latency (p95)**: - March 9: 50ms (stable) - March 10 14:00: 50ms (pre-migration) - March 10 14:15: 500ms (post-migration) ← **10x jump** - March 11 09:00: 550ms (still degraded) - March 11 09:15: 55ms (rollback restored) **Database CPU**: - Baseline: 30% - March 10 14:15: 80% ← Spike at migration time - March 11 09:15: 35% (rollback restored) **Disk I/O (IOPS)**: - Baseline: 100 IOPS - March 10 14:15: 5000 IOPS ← 50x increase - March 11 09:15: 150 IOPS (rollback restored)