Files

Zhongwei Li 41d9f6b189 Initial commit

2025-11-30 08:38:26 +08:00

22 KiB

Raw Blame History

Root Cause Analysis: Database Query Performance Degradation

Executive Summary

Database query latency increased 10x (p95: 50ms → 500ms) starting March 10th, impacting all API endpoints. Root cause identified as unoptimized database migration that created missing indexes on frequently-queried columns. Confidence: High (95%+). Resolution: Add indexes + optimize query patterns. Time to fix: 2 hours.

1. Effect Definition

What happened: Database query latency increased dramatically across all tables

Quantification:

Baseline: p50: 10ms, p95: 50ms, p99: 100ms (stable for 6 months)
Current: p50: 100ms, p95: 500ms, p99: 2000ms (10x increase at p95)
Absolute increase: +450ms at p95
Relative increase: 900% at p95

Timeline:

First observed: March 10th, 2:15 PM UTC
Duration: Ongoing (March 10-12, 48 hours elapsed)
Baseline period: Jan 1 - March 9 (stable)
Degradation start: Exact timestamp March 10th 14:15:22 UTC

Impact:

Users affected: All users (100% of traffic)
API endpoints affected: All endpoints (database-dependent)
Severity: High
- 25% of API requests timing out (>5 sec)
- User-visible page load delays
- Support tickets increased 3x
- Estimated revenue impact: $50k/day from abandoned transactions

Context:

Database: PostgreSQL 14.7, 500GB data
Application: REST API (Node.js), 10k req/min
Recent changes: Database migration deployed March 10th 2:00 PM

2. Competing Hypotheses

Hypothesis 1: Database migration introduced inefficient schema

Type: Root cause candidate
Evidence for:
- Timing: Migration deployed March 10 2:00 PM, degradation started 2:15 PM (15 min after)
- Perfect temporal match: Strongest temporal correlation
- Migration contents: Added new columns, restructured indexes
Evidence against:
- None - all evidence supports this hypothesis

Hypothesis 2: Traffic spike overloaded database

Type: Confounder / alternative explanation
Evidence for:
- March 10 is typically high-traffic day (week-end effect)
Evidence against:
- Traffic unchanged: Monitoring shows traffic at 10k req/min (same as baseline)
- No traffic spike at 2:15 PM: Traffic flat throughout March 10
- Previous high-traffic days handled fine: Traffic has been higher (15k req/min) without issues

Hypothesis 3: Database server resource exhaustion

Type: Proximate cause / symptom
Evidence for:
- CPU usage increased from 30% → 80% at 2:15 PM
- Disk I/O increased from 100 IOPS → 5000 IOPS
- Connection pool near saturation (95/100 connections)
Evidence against:
- These are symptoms, not root: Something CAUSED the increased resource usage
- Resource exhaustion doesn't explain WHY queries became slow

Hypothesis 4: Slow query introduced by application code change

Type: Proximate cause candidate
Evidence for:
- Application deploy on March 9th (1 day prior)
Evidence against:
- Timing mismatch: Deploy was 24 hours before degradation
- No code changes to query logic: Deploy only changed UI
- Query patterns unchanged: Same queries, same frequency

Hypothesis 5: Database server hardware issue

Type: Alternative explanation
Evidence for:
- Occasional disk errors in system logs
Evidence against:
- Disk errors present before March 10: Noise, not new
- No hardware alerts: Monitoring shows no hardware degradation
- Sudden onset: Hardware failures typically gradual

Most likely root cause: Database migration (Hypothesis 1)

Confounders ruled out:

Traffic unchanged (Hypothesis 2)
Application code unchanged (Hypothesis 4)
Hardware stable (Hypothesis 5)

Symptoms identified:

Resource exhaustion (Hypothesis 3) is symptom, not root

3. Causal Model

Causal Chain: Root → Proximate → Effect

ROOT CAUSE:
Database migration removed indexes on user_id + created_at columns
(March 10, 2:00 PM deployment)
    ↓ (mechanism: queries now do full table scans instead of index scans)

INTERMEDIATE CAUSE:
Every query on users table must scan entire table (5M rows)
instead of using index (10-1000 rows)
    ↓ (mechanism: table scans require disk I/O, CPU cycles)

PROXIMATE CAUSE:
Database CPU at 80%, disk I/O at 5000 IOPS (50x increase)
Query execution time 10x slower
    ↓ (mechanism: queries queue up, connection pool saturates)

OBSERVED EFFECT:
API endpoints slow (p95: 500ms vs 50ms baseline)
25% of requests timeout
Users experience page load delays

Why March 10 2:15 PM specifically?

Migration deployed: March 10 2:00 PM
Migration applied: 2:00-2:10 PM (10 min to run schema changes)
First slow queries: 2:15 PM (first queries after migration completed)
5-minute lag: Time for connection pool to cycle and pick up new schema

Missing Index Details

Migration removed these indexes:

-- BEFORE (efficient):
CREATE INDEX idx_users_user_id_created_at ON users(user_id, created_at);
CREATE INDEX idx_transactions_user_id ON transactions(user_id);

-- AFTER (inefficient):
-- (indexes removed by mistake in migration)

Impact:

-- Common query pattern:
SELECT * FROM users WHERE user_id = 123 AND created_at > '2024-01-01';

-- BEFORE (with index): 5ms (index scan, 10 rows)
-- AFTER (without index): 500ms (full table scan, 5M rows)

Confounders Ruled Out

No confounding variables found:

Traffic: Controlled (unchanged)
Hardware: Controlled (stable)
Code: Controlled (no changes to queries)
External dependencies: Controlled (no changes)

Only variable that changed: Database schema (migration)

4. Evidence Assessment

Temporal Sequence: ✓ PASS (5/5)

Timeline:

March 9, 3:00 PM: Application deploy (UI changes only, no queries changed)
March 10, 2:00 PM: Database migration starts
March 10, 2:10 PM: Migration completes
March 10, 2:15 PM: First slow queries logged (p95: 500ms)
March 10, 2:20 PM: Alerting fires (p95 exceeds 200ms threshold)

Verdict: ✓ Cause (migration) clearly precedes effect (slow queries) by 5-15 minutes

Why 5-minute lag?

Connection pool refresh time
Gradual connection cycling to new schema
First slow queries at 2:15 PM were from connections that picked up new schema

Strength of Association: ✓ PASS (5/5)

Correlation strength: Very strong (r = 0.99)

Evidence:

Before migration: p95 latency stable at 50ms (6 months)
Immediately after migration: p95 latency jumped to 500ms
10x increase: Large effect size
100% of queries affected: All database queries slower, not selective

Statistical significance: P < 0.001 (highly significant)

Comparing 1000 queries before (mean: 50ms) vs 1000 queries after (mean: 500ms)
Effect size: Cohen's d = 5.2 (very large)

Dose-Response: ✓ PASS (4/5)

Gradient observed:

Table size vs latency:
- Small tables (<10k rows): 20ms → 50ms (2.5x increase)
- Medium tables (100k rows): 50ms → 200ms (4x increase)
- Large tables (5M rows): 50ms → 500ms (10x increase)

Mechanism: Larger tables → more rows to scan → longer queries

Interpretation: Clear dose-response relationship strengthens causation

Counterfactual Test: ✓ PASS (5/5)

Counterfactual question: "What if the migration hadn't been deployed?"

Test 1: Rollback Experiment (Strongest evidence)

Action: Rolled back database migration March 11, 9:00 AM
Result: Latency immediately returned to baseline (p95: 55ms)
Conclusion: ✓ Migration removal eliminates effect (strong causation)

Test 2: Control Query

Tested: Queries on tables NOT affected by migration (no index changes)
Result: Latency unchanged (p95: 50ms before and after migration)
Conclusion: ✓ Only migrated tables affected (specificity)

Test 3: Historical Comparison

Baseline period: Jan-March 9 (no migration), p95: 50ms
Degradation period: March 10-11 (migration active), p95: 500ms
Post-rollback: March 11+ (migration reverted), p95: 55ms
Conclusion: ✓ Pattern strongly implicates migration

Verdict: Counterfactual tests strongly confirm causation

Mechanism: ✓ PASS (5/5)

HOW migration caused slow queries:

Migration removed indexes:

-- Migration accidentally dropped these indexes:
DROP INDEX idx_users_user_id_created_at;
DROP INDEX idx_transactions_user_id;

Query planner changed strategy:

BEFORE (with index):
EXPLAIN SELECT * FROM users WHERE user_id = 123 AND created_at > '2024-01-01';
→ Index Scan using idx_users_user_id_created_at (cost=0.43..8.45 rows=1)

AFTER (without index):
EXPLAIN SELECT * FROM users WHERE user_id = 123 AND created_at > '2024-01-01';
→ Seq Scan on users (cost=0.00..112000.00 rows=5000000)

Full table scans require disk I/O:
- Index scan: Read 10-1000 rows (1-100 KB) from index + data pages
- Full table scan: Read 5M rows (5 GB) from disk
- 50x-500x more I/O
Increased I/O saturates CPU & disk:
- CPU: Scanning rows, filtering predicates (30% → 80%)
- Disk: Reading table pages (100 IOPS → 5000 IOPS)
Saturation causes queuing:
- Slow queries → connections held longer
- Connection pool saturates (95/100)
- New queries wait for available connections
- Latency compounds

Plausibility: Very high

Established theory: Index scans vs table scans (well-known database optimization)
Quantifiable impact: Can calculate I/O difference (50x-500x)
Reproducible: Same pattern in staging environment

Supporting evidence:

EXPLAIN ANALYZE output shows table scans post-migration
PostgreSQL logs show "sequential scan" warnings
Disk I/O metrics show 50x increase correlated with migration

Verdict: ✓ Mechanism fully explained with strong supporting evidence

Consistency: ✓ PASS (5/5)

Relationship holds across contexts:

All affected tables show same pattern:
- users table: 50ms → 500ms
- transactions table: 30ms → 300ms
- orders table: 40ms → 400ms
- Consistent 10x degradation
All query types affected:
- SELECT: 10x slower
- JOIN: 10x slower
- Aggregations (COUNT, SUM): 10x slower
Consistent across all environments:
- Production: 50ms → 500ms
- Staging: 45ms → 450ms (when migration tested)
- Dev: 40ms → 400ms
Consistent across time:
- March 10 14:15 - March 11 9:00: p95 at 500ms
- Every hour during this period: ~500ms (stable degradation)
Replication:
- Tested in staging: Same migration → same 10x degradation
- Rollback in staging: Latency restored
- Reproducible causal relationship

Verdict: ✓ Extremely consistent pattern across tables, query types, environments, and time

Confounding Control: ✓ PASS (4/5)

Potential confounders identified and ruled out:

Confounder 1: Traffic Spike

Could traffic spike cause both:

Migration deployment (timing coincidence)
Slow queries (overload)

Ruled out:

Traffic monitoring shows flat 10k req/min (no spike)
Even if traffic spiked, wouldn't explain why migration rollback fixed it

Confounder 2: Hardware Degradation

Could hardware issue cause both:

Migration coincidentally deployed during degradation
Slow queries (hardware bottleneck)

Ruled out:

Hardware metrics stable (CPU headroom, no disk errors)
Rollback immediately fixed latency (hardware didn't suddenly improve)

Confounder 3: Application Code Change

Could code change cause both:

Buggy queries
Migration deployed same time as code

Ruled out:

Code deploy was March 9 (24 hrs before degradation)
No query changes in code deploy (only UI)
Rollback of migration (not code) fixed issue

Controlled variables:

✓ Traffic (flat during period)
✓ Hardware (stable metrics)
✓ Code (no query changes)
✓ External dependencies (no changes)

Verdict: ✓ No confounders found, only migration variable changed

Bradford Hill Criteria: 25/27 (Very Strong)

Criterion	Score	Justification
1. Strength	3/3	10x latency increase = very strong association
2. Consistency	3/3	Consistent across tables, queries, environments, time
3. Specificity	3/3	Only migrated tables affected; rollback restores; specific cause → specific effect
4. Temporality	3/3	Migration clearly precedes degradation by 5-15 min
5. Dose-response	3/3	Larger tables → greater latency increase (clear gradient)
6. Plausibility	3/3	Index vs table scan theory well-established
7. Coherence	3/3	Fits database optimization knowledge, no contradictions
8. Experiment	3/3	Rollback experiment: removing cause eliminates effect
9. Analogy	1/3	Similar patterns exist (missing indexes → slow queries) but not perfect analogy

Total: 25/27 = Very strong causal evidence

Interpretation: All criteria met except weak analogy (not critical). Strong case for causation.

5. Conclusion

Most Likely Root Cause

Root cause: Database migration removed indexes on user_id and created_at columns, forcing query planner to use full table scans instead of efficient index scans.

Confidence level: High (95%+)

Reasoning:

Perfect temporal sequence: Migration (2:00 PM) → degradation (2:15 PM)
Strong counterfactual test: Rollback immediately restored performance
Clear mechanism: Index scans (fast) → table scans (slow) with 50x-500x more I/O
Dose-response: Larger tables show greater degradation
Consistency: Pattern holds across all tables, queries, environments
No confounders: Traffic, hardware, code all controlled
Bradford Hill 25/27: Very strong causal evidence
Reproducible: Same effect in staging environment

Why this is root cause (not symptom):

Missing indexes is the fundamental issue
High CPU/disk I/O is symptom of table scans
Slow queries is symptom of high resource usage
Fixing missing indexes eliminates all downstream symptoms

Causal Mechanism:

Missing indexes (root)
    ↓
Query planner uses table scans (mechanism)
    ↓
50x-500x more disk I/O (mechanism)
    ↓
CPU & disk saturation (symptom)
    ↓
Query queuing, connection pool saturation (symptom)
    ↓
10x latency increase (observed effect)

Alternative Explanations (Ruled Out)

Alternative 1: Traffic Spike

Why less likely:

Traffic monitoring shows flat 10k req/min (no spike)
Previous traffic spikes (15k req/min) handled without issue
Rollback fixed latency without changing traffic

Alternative 2: Hardware Degradation

Why less likely:

Hardware metrics stable (no degradation)
Sudden onset inconsistent with hardware failure (usually gradual)
Rollback immediately fixed issue (hardware didn't change)

Alternative 3: Application Code Bug

Why less likely:

Code deploy 24 hours before degradation (timing mismatch)
No query logic changes in deploy
Rollback of migration (not code) fixed issue

Unresolved Questions

Why were indexes removed?
- Migration script error? (likely)
- Intentional optimization attempt gone wrong? (possible)
- Need to review migration PR and approval process
How did this pass review?
- Were indexes intentionally removed or accidental?
- Was migration tested in staging before production?
- Need process improvement
Why no pre-deploy testing catch this?
- Staging environment testing missed this
- Query performance tests insufficient
- Need better pre-deploy validation

6. Recommended Actions

Immediate Actions (Address Root Cause)

1. Re-add missing indexes (DONE - March 11, 9:00 AM)

CREATE INDEX idx_users_user_id_created_at
ON users(user_id, created_at);

CREATE INDEX idx_transactions_user_id
ON transactions(user_id);

Result: Latency restored to 55ms (within 5ms of baseline)
Time to fix: 15 minutes (index creation)

2. Validate index coverage (IN PROGRESS)

Audit all tables for missing indexes
Compare production indexes to staging/dev
Document expected indexes per table

Preventive Actions (Process Improvements)

1. Improve migration review process

Require EXPLAIN ANALYZE before/after for all migrations
Staging performance tests mandatory (query latency benchmarks)
Index change review: Any index drop requires extra approval

2. Add pre-deploy validation

Automated query performance regression tests
Alert if any query >2x slower in staging
Block deployment if performance degrades >20%

3. Improve monitoring & alerting

Alert on index usage changes (track pg_stat_user_indexes)
Alert on query plan changes (seq scan warnings)
Dashboards for index hit rate, table scan frequency

4. Database migration checklist

EXPLAIN ANALYZE on affected queries
Staging performance tests passed
Index usage reviewed
Rollback plan documented
Monitoring in place

Validation Tests (Confirm Fix)

1. Performance benchmark (DONE)

Test: Run 1000 queries pre-fix vs post-fix
Result:
- Pre-fix (migration): p95 = 500ms
- Post-fix (indexes restored): p95 = 55ms
Conclusion: ✓ Fix successful

2. Load test (DONE)

Test: 15k req/min (1.5x normal traffic)
Result: p95 = 60ms (acceptable, <10% degradation)
Conclusion: ✓ System can handle load with indexes

3. Index usage monitoring (ONGOING)

Metrics: pg_stat_user_indexes shows indexes being used
Query plans: EXPLAIN shows index scans (not seq scans)
Conclusion: ✓ Indexes actively used

Success Criteria

Performance restored:

p95 latency <100ms (achieved: 55ms)
p99 latency <200ms (achieved: 120ms)
CPU usage <50% (achieved: 35%)
Disk I/O <500 IOPS (achieved: 150 IOPS)
Connection pool utilization <70% (achieved: 45%)

User impact resolved:

Timeout rate <1% (achieved: 0.2%)
Support tickets normalized (dropped 80%)
Page load times back to normal

Process improvements:

Migration checklist created
Performance regression tests added to CI/CD
Post-mortem doc written (IN PROGRESS)
Team training on index optimization (SCHEDULED)

7. Limitations

Data Limitations

Missing data:

No query performance baselines in staging: Can't compare staging pre/post migration
Limited historical index usage data: pg_stat_user_indexes only has 7 days retention
No migration testing logs: Can't determine if migration was tested in staging

Measurement limitations:

Latency measured at application layer: Database-internal latency not tracked separately
No per-query latency breakdown: Can't isolate which specific queries most affected

Analysis Limitations

Assumptions:

Assumed connection pool refresh time: Estimated 5 min for connections to cycle to new schema (not measured)
Didn't test other potential optimizations: Only tested rollback, not alternative fixes (e.g., query rewriting)

Alternative fixes not explored:

Could queries be rewritten to work without indexes? (possible but not investigated)
Could connection pool be increased? (wouldn't fix root cause)

Generalizability

Context-specific:

This analysis applies to PostgreSQL databases with similar query patterns
May not apply to other database systems (MySQL, MongoDB, etc.) with different query optimizers
Specific to tables with millions of rows (small tables less affected)

Lessons learned:

Index removal can cause 10x+ performance degradation for large tables
Migration testing in staging must include performance benchmarks
Rollback plans essential for database schema changes

8. Meta: Analysis Quality Self-Assessment

Using rubric_causal_inference_root_cause.json:

Scores:

Effect Definition Clarity: 5/5 (precise quantification, timeline, baseline, impact)
Hypothesis Generation: 5/5 (5 hypotheses, systematic evaluation)
Root Cause Identification: 5/5 (root vs proximate distinguished, causal chain clear)
Causal Model Quality: 5/5 (full chain, mechanisms, confounders noted)
Temporal Sequence Verification: 5/5 (detailed timeline, lag explained)
Counterfactual Testing: 5/5 (rollback experiment + control queries)
Mechanism Explanation: 5/5 (detailed mechanism with EXPLAIN output evidence)
Confounding Control: 4/5 (identified and ruled out major confounders, comprehensive)
Evidence Quality & Strength: 5/5 (quasi-experiment via rollback, Bradford Hill 25/27)
Confidence & Limitations: 5/5 (explicit confidence, limitations, alternatives evaluated)

Average: 4.9/5 - Excellent (publication-quality analysis)

Assessment: This root cause analysis exceeds standards for high-stakes engineering decisions. Strong evidence across all criteria, particularly counterfactual testing (rollback experiment) and mechanism explanation (query plans). Appropriate for postmortem documentation and process improvement decisions.

Appendix: Supporting Evidence

A. Query Plans Before/After

BEFORE (with index):

EXPLAIN ANALYZE SELECT * FROM users WHERE user_id = 123 AND created_at > '2024-01-01';

Index Scan using idx_users_user_id_created_at on users
    (cost=0.43..8.45 rows=1 width=152)
    (actual time=0.025..0.030 rows=1 loops=1)
Index Cond: ((user_id = 123) AND (created_at > '2024-01-01'::date))
Planning Time: 0.112 ms
Execution Time: 0.052 ms  ← Fast

AFTER (without index):

EXPLAIN ANALYZE SELECT * FROM users WHERE user_id = 123 AND created_at > '2024-01-01';

Seq Scan on users
    (cost=0.00..112000.00 rows=5000000 width=152)
    (actual time=0.025..485.234 rows=1 loops=1)
Filter: ((user_id = 123) AND (created_at > '2024-01-01'::date))
Rows Removed by Filter: 4999999
Planning Time: 0.108 ms
Execution Time: 485.267 ms  ← 10,000x slower

B. Monitoring Metrics

Latency (p95):

March 9: 50ms (stable)
March 10 14:00: 50ms (pre-migration)
March 10 14:15: 500ms (post-migration) ← 10x jump
March 11 09:00: 550ms (still degraded)
March 11 09:15: 55ms (rollback restored)

Database CPU:

Baseline: 30%
March 10 14:15: 80% ← Spike at migration time
March 11 09:15: 35% (rollback restored)

Disk I/O (IOPS):

Baseline: 100 IOPS
March 10 14:15: 5000 IOPS ← 50x increase
March 11 09:15: 150 IOPS (rollback restored)

22 KiB Raw Blame History

Root Cause Analysis: Database Query Performance Degradation

Executive Summary

1. Effect Definition

2. Competing Hypotheses

Hypothesis 1: Database migration introduced inefficient schema

Hypothesis 2: Traffic spike overloaded database

Hypothesis 3: Database server resource exhaustion

Hypothesis 4: Slow query introduced by application code change

Hypothesis 5: Database server hardware issue

3. Causal Model

Causal Chain: Root → Proximate → Effect

Why March 10 2:15 PM specifically?

Missing Index Details

Confounders Ruled Out

4. Evidence Assessment

Temporal Sequence: ✓ PASS (5/5)

Strength of Association: ✓ PASS (5/5)

Dose-Response: ✓ PASS (4/5)

Counterfactual Test: ✓ PASS (5/5)

Mechanism: ✓ PASS (5/5)

Consistency: ✓ PASS (5/5)

Confounding Control: ✓ PASS (4/5)

Confounder 1: Traffic Spike

Confounder 2: Hardware Degradation

Confounder 3: Application Code Change

Bradford Hill Criteria: 25/27 (Very Strong)

5. Conclusion

Most Likely Root Cause

Alternative Explanations (Ruled Out)

Alternative 1: Traffic Spike

Alternative 2: Hardware Degradation

Alternative 3: Application Code Bug

Unresolved Questions

6. Recommended Actions

Immediate Actions (Address Root Cause)

Preventive Actions (Process Improvements)

Validation Tests (Confirm Fix)

Success Criteria

7. Limitations

Data Limitations

Analysis Limitations

Generalizability

8. Meta: Analysis Quality Self-Assessment

Scores:

Appendix: Supporting Evidence

A. Query Plans Before/After

B. Monitoring Metrics

22 KiB

Raw Blame History