Files
gh-lyndonkl-claude/skills/causal-inference-root-cause/resources/examples/database-performance-degradation.md
2025-11-30 08:38:26 +08:00

22 KiB

Root Cause Analysis: Database Query Performance Degradation

Executive Summary

Database query latency increased 10x (p95: 50ms → 500ms) starting March 10th, impacting all API endpoints. Root cause identified as unoptimized database migration that created missing indexes on frequently-queried columns. Confidence: High (95%+). Resolution: Add indexes + optimize query patterns. Time to fix: 2 hours.


1. Effect Definition

What happened: Database query latency increased dramatically across all tables

Quantification:

  • Baseline: p50: 10ms, p95: 50ms, p99: 100ms (stable for 6 months)
  • Current: p50: 100ms, p95: 500ms, p99: 2000ms (10x increase at p95)
  • Absolute increase: +450ms at p95
  • Relative increase: 900% at p95

Timeline:

  • First observed: March 10th, 2:15 PM UTC
  • Duration: Ongoing (March 10-12, 48 hours elapsed)
  • Baseline period: Jan 1 - March 9 (stable)
  • Degradation start: Exact timestamp March 10th 14:15:22 UTC

Impact:

  • Users affected: All users (100% of traffic)
  • API endpoints affected: All endpoints (database-dependent)
  • Severity: High
    • 25% of API requests timing out (>5 sec)
    • User-visible page load delays
    • Support tickets increased 3x
    • Estimated revenue impact: $50k/day from abandoned transactions

Context:

  • Database: PostgreSQL 14.7, 500GB data
  • Application: REST API (Node.js), 10k req/min
  • Recent changes: Database migration deployed March 10th 2:00 PM

2. Competing Hypotheses

Hypothesis 1: Database migration introduced inefficient schema

  • Type: Root cause candidate
  • Evidence for:
    • Timing: Migration deployed March 10 2:00 PM, degradation started 2:15 PM (15 min after)
    • Perfect temporal match: Strongest temporal correlation
    • Migration contents: Added new columns, restructured indexes
  • Evidence against:
    • None - all evidence supports this hypothesis

Hypothesis 2: Traffic spike overloaded database

  • Type: Confounder / alternative explanation
  • Evidence for:
    • March 10 is typically high-traffic day (week-end effect)
  • Evidence against:
    • Traffic unchanged: Monitoring shows traffic at 10k req/min (same as baseline)
    • No traffic spike at 2:15 PM: Traffic flat throughout March 10
    • Previous high-traffic days handled fine: Traffic has been higher (15k req/min) without issues

Hypothesis 3: Database server resource exhaustion

  • Type: Proximate cause / symptom
  • Evidence for:
    • CPU usage increased from 30% → 80% at 2:15 PM
    • Disk I/O increased from 100 IOPS → 5000 IOPS
    • Connection pool near saturation (95/100 connections)
  • Evidence against:
    • These are symptoms, not root: Something CAUSED the increased resource usage
    • Resource exhaustion doesn't explain WHY queries became slow

Hypothesis 4: Slow query introduced by application code change

  • Type: Proximate cause candidate
  • Evidence for:
    • Application deploy on March 9th (1 day prior)
  • Evidence against:
    • Timing mismatch: Deploy was 24 hours before degradation
    • No code changes to query logic: Deploy only changed UI
    • Query patterns unchanged: Same queries, same frequency

Hypothesis 5: Database server hardware issue

  • Type: Alternative explanation
  • Evidence for:
    • Occasional disk errors in system logs
  • Evidence against:
    • Disk errors present before March 10: Noise, not new
    • No hardware alerts: Monitoring shows no hardware degradation
    • Sudden onset: Hardware failures typically gradual

Most likely root cause: Database migration (Hypothesis 1)

Confounders ruled out:

  • Traffic unchanged (Hypothesis 2)
  • Application code unchanged (Hypothesis 4)
  • Hardware stable (Hypothesis 5)

Symptoms identified:

  • Resource exhaustion (Hypothesis 3) is symptom, not root

3. Causal Model

Causal Chain: Root → Proximate → Effect

ROOT CAUSE:
Database migration removed indexes on user_id + created_at columns
(March 10, 2:00 PM deployment)
    ↓ (mechanism: queries now do full table scans instead of index scans)

INTERMEDIATE CAUSE:
Every query on users table must scan entire table (5M rows)
instead of using index (10-1000 rows)
    ↓ (mechanism: table scans require disk I/O, CPU cycles)

PROXIMATE CAUSE:
Database CPU at 80%, disk I/O at 5000 IOPS (50x increase)
Query execution time 10x slower
    ↓ (mechanism: queries queue up, connection pool saturates)

OBSERVED EFFECT:
API endpoints slow (p95: 500ms vs 50ms baseline)
25% of requests timeout
Users experience page load delays

Why March 10 2:15 PM specifically?

  • Migration deployed: March 10 2:00 PM
  • Migration applied: 2:00-2:10 PM (10 min to run schema changes)
  • First slow queries: 2:15 PM (first queries after migration completed)
  • 5-minute lag: Time for connection pool to cycle and pick up new schema

Missing Index Details

Migration removed these indexes:

-- BEFORE (efficient):
CREATE INDEX idx_users_user_id_created_at ON users(user_id, created_at);
CREATE INDEX idx_transactions_user_id ON transactions(user_id);

-- AFTER (inefficient):
-- (indexes removed by mistake in migration)

Impact:

-- Common query pattern:
SELECT * FROM users WHERE user_id = 123 AND created_at > '2024-01-01';

-- BEFORE (with index): 5ms (index scan, 10 rows)
-- AFTER (without index): 500ms (full table scan, 5M rows)

Confounders Ruled Out

No confounding variables found:

  • Traffic: Controlled (unchanged)
  • Hardware: Controlled (stable)
  • Code: Controlled (no changes to queries)
  • External dependencies: Controlled (no changes)

Only variable that changed: Database schema (migration)


4. Evidence Assessment

Temporal Sequence: ✓ PASS (5/5)

Timeline:

March 9, 3:00 PM: Application deploy (UI changes only, no queries changed)
March 10, 2:00 PM: Database migration starts
March 10, 2:10 PM: Migration completes
March 10, 2:15 PM: First slow queries logged (p95: 500ms)
March 10, 2:20 PM: Alerting fires (p95 exceeds 200ms threshold)

Verdict: ✓ Cause (migration) clearly precedes effect (slow queries) by 5-15 minutes

Why 5-minute lag?

  • Connection pool refresh time
  • Gradual connection cycling to new schema
  • First slow queries at 2:15 PM were from connections that picked up new schema

Strength of Association: ✓ PASS (5/5)

Correlation strength: Very strong (r = 0.99)

Evidence:

  • Before migration: p95 latency stable at 50ms (6 months)
  • Immediately after migration: p95 latency jumped to 500ms
  • 10x increase: Large effect size
  • 100% of queries affected: All database queries slower, not selective

Statistical significance: P < 0.001 (highly significant)

  • Comparing 1000 queries before (mean: 50ms) vs 1000 queries after (mean: 500ms)
  • Effect size: Cohen's d = 5.2 (very large)

Dose-Response: ✓ PASS (4/5)

Gradient observed:

  • Table size vs latency:
    • Small tables (<10k rows): 20ms → 50ms (2.5x increase)
    • Medium tables (100k rows): 50ms → 200ms (4x increase)
    • Large tables (5M rows): 50ms → 500ms (10x increase)

Mechanism: Larger tables → more rows to scan → longer queries

Interpretation: Clear dose-response relationship strengthens causation


Counterfactual Test: ✓ PASS (5/5)

Counterfactual question: "What if the migration hadn't been deployed?"

Test 1: Rollback Experiment (Strongest evidence)

  • Action: Rolled back database migration March 11, 9:00 AM
  • Result: Latency immediately returned to baseline (p95: 55ms)
  • Conclusion: ✓ Migration removal eliminates effect (strong causation)

Test 2: Control Query

  • Tested: Queries on tables NOT affected by migration (no index changes)
  • Result: Latency unchanged (p95: 50ms before and after migration)
  • Conclusion: ✓ Only migrated tables affected (specificity)

Test 3: Historical Comparison

  • Baseline period: Jan-March 9 (no migration), p95: 50ms
  • Degradation period: March 10-11 (migration active), p95: 500ms
  • Post-rollback: March 11+ (migration reverted), p95: 55ms
  • Conclusion: ✓ Pattern strongly implicates migration

Verdict: Counterfactual tests strongly confirm causation


Mechanism: ✓ PASS (5/5)

HOW migration caused slow queries:

  1. Migration removed indexes:

    -- Migration accidentally dropped these indexes:
    DROP INDEX idx_users_user_id_created_at;
    DROP INDEX idx_transactions_user_id;
    
  2. Query planner changed strategy:

    BEFORE (with index):
    EXPLAIN SELECT * FROM users WHERE user_id = 123 AND created_at > '2024-01-01';
    → Index Scan using idx_users_user_id_created_at (cost=0.43..8.45 rows=1)
    
    AFTER (without index):
    EXPLAIN SELECT * FROM users WHERE user_id = 123 AND created_at > '2024-01-01';
    → Seq Scan on users (cost=0.00..112000.00 rows=5000000)
    
  3. Full table scans require disk I/O:

    • Index scan: Read 10-1000 rows (1-100 KB) from index + data pages
    • Full table scan: Read 5M rows (5 GB) from disk
    • 50x-500x more I/O
  4. Increased I/O saturates CPU & disk:

    • CPU: Scanning rows, filtering predicates (30% → 80%)
    • Disk: Reading table pages (100 IOPS → 5000 IOPS)
  5. Saturation causes queuing:

    • Slow queries → connections held longer
    • Connection pool saturates (95/100)
    • New queries wait for available connections
    • Latency compounds

Plausibility: Very high

  • Established theory: Index scans vs table scans (well-known database optimization)
  • Quantifiable impact: Can calculate I/O difference (50x-500x)
  • Reproducible: Same pattern in staging environment

Supporting evidence:

  • EXPLAIN ANALYZE output shows table scans post-migration
  • PostgreSQL logs show "sequential scan" warnings
  • Disk I/O metrics show 50x increase correlated with migration

Verdict: ✓ Mechanism fully explained with strong supporting evidence


Consistency: ✓ PASS (5/5)

Relationship holds across contexts:

  1. All affected tables show same pattern:

    • users table: 50ms → 500ms
    • transactions table: 30ms → 300ms
    • orders table: 40ms → 400ms
    • Consistent 10x degradation
  2. All query types affected:

    • SELECT: 10x slower
    • JOIN: 10x slower
    • Aggregations (COUNT, SUM): 10x slower
  3. Consistent across all environments:

    • Production: 50ms → 500ms
    • Staging: 45ms → 450ms (when migration tested)
    • Dev: 40ms → 400ms
  4. Consistent across time:

    • March 10 14:15 - March 11 9:00: p95 at 500ms
    • Every hour during this period: ~500ms (stable degradation)
  5. Replication:

    • Tested in staging: Same migration → same 10x degradation
    • Rollback in staging: Latency restored
    • Reproducible causal relationship

Verdict: ✓ Extremely consistent pattern across tables, query types, environments, and time


Confounding Control: ✓ PASS (4/5)

Potential confounders identified and ruled out:

Confounder 1: Traffic Spike

Could traffic spike cause both:

  • Migration deployment (timing coincidence)
  • Slow queries (overload)

Ruled out:

  • Traffic monitoring shows flat 10k req/min (no spike)
  • Even if traffic spiked, wouldn't explain why migration rollback fixed it

Confounder 2: Hardware Degradation

Could hardware issue cause both:

  • Migration coincidentally deployed during degradation
  • Slow queries (hardware bottleneck)

Ruled out:

  • Hardware metrics stable (CPU headroom, no disk errors)
  • Rollback immediately fixed latency (hardware didn't suddenly improve)

Confounder 3: Application Code Change

Could code change cause both:

  • Buggy queries
  • Migration deployed same time as code

Ruled out:

  • Code deploy was March 9 (24 hrs before degradation)
  • No query changes in code deploy (only UI)
  • Rollback of migration (not code) fixed issue

Controlled variables:

  • ✓ Traffic (flat during period)
  • ✓ Hardware (stable metrics)
  • ✓ Code (no query changes)
  • ✓ External dependencies (no changes)

Verdict: ✓ No confounders found, only migration variable changed


Bradford Hill Criteria: 25/27 (Very Strong)

Criterion Score Justification
1. Strength 3/3 10x latency increase = very strong association
2. Consistency 3/3 Consistent across tables, queries, environments, time
3. Specificity 3/3 Only migrated tables affected; rollback restores; specific cause → specific effect
4. Temporality 3/3 Migration clearly precedes degradation by 5-15 min
5. Dose-response 3/3 Larger tables → greater latency increase (clear gradient)
6. Plausibility 3/3 Index vs table scan theory well-established
7. Coherence 3/3 Fits database optimization knowledge, no contradictions
8. Experiment 3/3 Rollback experiment: removing cause eliminates effect
9. Analogy 1/3 Similar patterns exist (missing indexes → slow queries) but not perfect analogy

Total: 25/27 = Very strong causal evidence

Interpretation: All criteria met except weak analogy (not critical). Strong case for causation.


5. Conclusion

Most Likely Root Cause

Root cause: Database migration removed indexes on user_id and created_at columns, forcing query planner to use full table scans instead of efficient index scans.

Confidence level: High (95%+)

Reasoning:

  1. Perfect temporal sequence: Migration (2:00 PM) → degradation (2:15 PM)
  2. Strong counterfactual test: Rollback immediately restored performance
  3. Clear mechanism: Index scans (fast) → table scans (slow) with 50x-500x more I/O
  4. Dose-response: Larger tables show greater degradation
  5. Consistency: Pattern holds across all tables, queries, environments
  6. No confounders: Traffic, hardware, code all controlled
  7. Bradford Hill 25/27: Very strong causal evidence
  8. Reproducible: Same effect in staging environment

Why this is root cause (not symptom):

  • Missing indexes is the fundamental issue
  • High CPU/disk I/O is symptom of table scans
  • Slow queries is symptom of high resource usage
  • Fixing missing indexes eliminates all downstream symptoms

Causal Mechanism:

Missing indexes (root)
    ↓
Query planner uses table scans (mechanism)
    ↓
50x-500x more disk I/O (mechanism)
    ↓
CPU & disk saturation (symptom)
    ↓
Query queuing, connection pool saturation (symptom)
    ↓
10x latency increase (observed effect)

Alternative Explanations (Ruled Out)

Alternative 1: Traffic Spike

Why less likely:

  • Traffic monitoring shows flat 10k req/min (no spike)
  • Previous traffic spikes (15k req/min) handled without issue
  • Rollback fixed latency without changing traffic

Alternative 2: Hardware Degradation

Why less likely:

  • Hardware metrics stable (no degradation)
  • Sudden onset inconsistent with hardware failure (usually gradual)
  • Rollback immediately fixed issue (hardware didn't change)

Alternative 3: Application Code Bug

Why less likely:

  • Code deploy 24 hours before degradation (timing mismatch)
  • No query logic changes in deploy
  • Rollback of migration (not code) fixed issue

Unresolved Questions

  1. Why were indexes removed?

    • Migration script error? (likely)
    • Intentional optimization attempt gone wrong? (possible)
    • Need to review migration PR and approval process
  2. How did this pass review?

    • Were indexes intentionally removed or accidental?
    • Was migration tested in staging before production?
    • Need process improvement
  3. Why no pre-deploy testing catch this?

    • Staging environment testing missed this
    • Query performance tests insufficient
    • Need better pre-deploy validation

Immediate Actions (Address Root Cause)

1. Re-add missing indexes (DONE - March 11, 9:00 AM)

CREATE INDEX idx_users_user_id_created_at
ON users(user_id, created_at);

CREATE INDEX idx_transactions_user_id
ON transactions(user_id);
  • Result: Latency restored to 55ms (within 5ms of baseline)
  • Time to fix: 15 minutes (index creation)

2. Validate index coverage (IN PROGRESS)

  • Audit all tables for missing indexes
  • Compare production indexes to staging/dev
  • Document expected indexes per table

Preventive Actions (Process Improvements)

1. Improve migration review process

  • Require EXPLAIN ANALYZE before/after for all migrations
  • Staging performance tests mandatory (query latency benchmarks)
  • Index change review: Any index drop requires extra approval

2. Add pre-deploy validation

  • Automated query performance regression tests
  • Alert if any query >2x slower in staging
  • Block deployment if performance degrades >20%

3. Improve monitoring & alerting

  • Alert on index usage changes (track pg_stat_user_indexes)
  • Alert on query plan changes (seq scan warnings)
  • Dashboards for index hit rate, table scan frequency

4. Database migration checklist

  • EXPLAIN ANALYZE on affected queries
  • Staging performance tests passed
  • Index usage reviewed
  • Rollback plan documented
  • Monitoring in place

Validation Tests (Confirm Fix)

1. Performance benchmark (DONE)

  • Test: Run 1000 queries pre-fix vs post-fix
  • Result:
    • Pre-fix (migration): p95 = 500ms
    • Post-fix (indexes restored): p95 = 55ms
  • Conclusion: ✓ Fix successful

2. Load test (DONE)

  • Test: 15k req/min (1.5x normal traffic)
  • Result: p95 = 60ms (acceptable, <10% degradation)
  • Conclusion: ✓ System can handle load with indexes

3. Index usage monitoring (ONGOING)

  • Metrics: pg_stat_user_indexes shows indexes being used
  • Query plans: EXPLAIN shows index scans (not seq scans)
  • Conclusion: ✓ Indexes actively used

Success Criteria

Performance restored:

  • p95 latency <100ms (achieved: 55ms)
  • p99 latency <200ms (achieved: 120ms)
  • CPU usage <50% (achieved: 35%)
  • Disk I/O <500 IOPS (achieved: 150 IOPS)
  • Connection pool utilization <70% (achieved: 45%)

User impact resolved:

  • Timeout rate <1% (achieved: 0.2%)
  • Support tickets normalized (dropped 80%)
  • Page load times back to normal

Process improvements:

  • Migration checklist created
  • Performance regression tests added to CI/CD
  • Post-mortem doc written (IN PROGRESS)
  • Team training on index optimization (SCHEDULED)

7. Limitations

Data Limitations

Missing data:

  • No query performance baselines in staging: Can't compare staging pre/post migration
  • Limited historical index usage data: pg_stat_user_indexes only has 7 days retention
  • No migration testing logs: Can't determine if migration was tested in staging

Measurement limitations:

  • Latency measured at application layer: Database-internal latency not tracked separately
  • No per-query latency breakdown: Can't isolate which specific queries most affected

Analysis Limitations

Assumptions:

  • Assumed connection pool refresh time: Estimated 5 min for connections to cycle to new schema (not measured)
  • Didn't test other potential optimizations: Only tested rollback, not alternative fixes (e.g., query rewriting)

Alternative fixes not explored:

  • Could queries be rewritten to work without indexes? (possible but not investigated)
  • Could connection pool be increased? (wouldn't fix root cause)

Generalizability

Context-specific:

  • This analysis applies to PostgreSQL databases with similar query patterns
  • May not apply to other database systems (MySQL, MongoDB, etc.) with different query optimizers
  • Specific to tables with millions of rows (small tables less affected)

Lessons learned:

  • Index removal can cause 10x+ performance degradation for large tables
  • Migration testing in staging must include performance benchmarks
  • Rollback plans essential for database schema changes

8. Meta: Analysis Quality Self-Assessment

Using rubric_causal_inference_root_cause.json:

Scores:

  1. Effect Definition Clarity: 5/5 (precise quantification, timeline, baseline, impact)
  2. Hypothesis Generation: 5/5 (5 hypotheses, systematic evaluation)
  3. Root Cause Identification: 5/5 (root vs proximate distinguished, causal chain clear)
  4. Causal Model Quality: 5/5 (full chain, mechanisms, confounders noted)
  5. Temporal Sequence Verification: 5/5 (detailed timeline, lag explained)
  6. Counterfactual Testing: 5/5 (rollback experiment + control queries)
  7. Mechanism Explanation: 5/5 (detailed mechanism with EXPLAIN output evidence)
  8. Confounding Control: 4/5 (identified and ruled out major confounders, comprehensive)
  9. Evidence Quality & Strength: 5/5 (quasi-experiment via rollback, Bradford Hill 25/27)
  10. Confidence & Limitations: 5/5 (explicit confidence, limitations, alternatives evaluated)

Average: 4.9/5 - Excellent (publication-quality analysis)

Assessment: This root cause analysis exceeds standards for high-stakes engineering decisions. Strong evidence across all criteria, particularly counterfactual testing (rollback experiment) and mechanism explanation (query plans). Appropriate for postmortem documentation and process improvement decisions.


Appendix: Supporting Evidence

A. Query Plans Before/After

BEFORE (with index):

EXPLAIN ANALYZE SELECT * FROM users WHERE user_id = 123 AND created_at > '2024-01-01';

Index Scan using idx_users_user_id_created_at on users
    (cost=0.43..8.45 rows=1 width=152)
    (actual time=0.025..0.030 rows=1 loops=1)
Index Cond: ((user_id = 123) AND (created_at > '2024-01-01'::date))
Planning Time: 0.112 ms
Execution Time: 0.052 ms   Fast

AFTER (without index):

EXPLAIN ANALYZE SELECT * FROM users WHERE user_id = 123 AND created_at > '2024-01-01';

Seq Scan on users
    (cost=0.00..112000.00 rows=5000000 width=152)
    (actual time=0.025..485.234 rows=1 loops=1)
Filter: ((user_id = 123) AND (created_at > '2024-01-01'::date))
Rows Removed by Filter: 4999999
Planning Time: 0.108 ms
Execution Time: 485.267 ms   10,000x slower

B. Monitoring Metrics

Latency (p95):

  • March 9: 50ms (stable)
  • March 10 14:00: 50ms (pre-migration)
  • March 10 14:15: 500ms (post-migration) ← 10x jump
  • March 11 09:00: 550ms (still degraded)
  • March 11 09:15: 55ms (rollback restored)

Database CPU:

  • Baseline: 30%
  • March 10 14:15: 80% ← Spike at migration time
  • March 11 09:15: 35% (rollback restored)

Disk I/O (IOPS):

  • Baseline: 100 IOPS
  • March 10 14:15: 5000 IOPS ← 50x increase
  • March 11 09:15: 150 IOPS (rollback restored)