zhongwei/gh-anton-abyzov-specweave-plugins-specweave-infrastructure

Files

Zhongwei Li 9427ed1eea Initial commit

2025-11-29 17:56:41 +08:00

12 KiB

Raw Blame History

Post-Mortem: [Incident Title]

Date of Incident: YYYY-MM-DD Date of Post-Mortem: YYYY-MM-DD Author: [Name] Reviewers: [Names] Severity: SEV1 / SEV2 / SEV3

Executive Summary

What Happened: [One-paragraph summary of incident]

Impact: [Brief impact summary - users, duration, business]

Root Cause: [Root cause in one sentence]

Resolution: [How it was fixed]

Example:

What Happened: On October 26, 2025, the application became unavailable for 30 minutes due to database connection pool exhaustion.

Impact: All users were unable to access the application from 14:00-14:30 UTC. Approximately 10,000 users affected.

Root Cause: Payment service had a connection leak (connections not properly closed in error handling path), which exhausted the database connection pool during high traffic.

Resolution: Application was restarted to release connections (immediate fix), and the connection leak was fixed in code (permanent fix).

Incident Details

Timeline

Time (UTC)	Event	Actor
14:00	Alert: "Database Connection Pool Exhausted"	Monitoring
14:02	On-call engineer paged	PagerDuty
14:02	Jane acknowledged alert	SRE (Jane)
14:05	Confirmed database connections at max (100/100)	SRE (Jane)
14:08	Checked application logs for connection usage	SRE (Jane)
14:10	Identified connection leak in payment service	SRE (Jane)
14:12	Decision: Restart payment service to free connections	SRE (Jane)
14:15	Payment service restarted	SRE (Jane)
14:17	Database connections dropped to 20/100	SRE (Jane)
14:20	Health checks passing, traffic restored	SRE (Jane)
14:25	Monitoring for stability	SRE (Jane)
14:30	Incident declared resolved	SRE (Jane)
15:00	Developer identified code fix	Dev (Mike)
16:00	Code fix deployed to production	Dev (Mike)
16:30	Verified no recurrence after 1 hour	SRE (Jane)

Total Duration: 30 minutes (outage) + 2.5 hours (full resolution)

Impact

Users Affected:

Scope: All users (100%)
Count: ~10,000 active users
Duration: 30 minutes complete outage

Services Affected:

✅ Frontend (down - unable to reach backend)
✅ Backend API (degraded - connection pool exhausted)
✅ Database (saturated - all connections in use)
❌ Authentication (not affected - separate service)
❌ Payment processing (not affected - queued transactions)

Business Impact:

Revenue Lost: $5,000 (estimated, based on 30 min downtime)
SLA Breach: No (30 min < 43.2 min monthly budget for 99.9%)
Customer Complaints: 47 support tickets, 12 social media mentions
Reputation: Minor (quickly resolved, transparent communication)

Root Cause Analysis

The Five Whys

1. Why did the application become unavailable? → Database connection pool was exhausted (100/100 connections in use)

2. Why was the connection pool exhausted? → Payment service had a connection leak (connections not being released)

3. Why were connections not being released? → Error handling path in payment service missing conn.close() in finally block

4. Why was the error path missing conn.close()? → Developer oversight during code review

5. Why didn't code review catch this? → No automated test or linter to check connection cleanup

Root Cause: Connection leak in payment service error handling path, compounded by lack of automated testing for connection cleanup.

Contributing Factors

Technical Factors:

Connection pool size too small (100 connections) for Black Friday traffic
No connection timeout configured (connections held indefinitely)
No monitoring alert for connection pool usage (only alerted at 100%)
No circuit breaker to prevent cascade failures

Process Factors:

Code review missed connection leak
No automated test for connection cleanup
No load testing before high-traffic event (Black Friday)
No runbook for connection pool exhaustion

Human Factors:

Developer unfamiliar with connection pool best practices
Time pressure during feature development (rushed code review)

Detection and Response

Detection

How Detected: Automated monitoring alert

Alert: "Database Connection Pool Exhausted"

Trigger: SELECT count(*) FROM pg_stat_activity >= 100
Alert latency: <1 minute (excellent)
False positive rate: 0% (first time this alert fired)

Detection Quality:

✅ Good: Alert fired quickly (<1 min after issue started)
❌ To Improve: No early warning (should alert at 80%, not 100%)

Response

Response Timeline:

Time to acknowledge: 2 minutes (target: <5 min) ✅
Time to triage: 5 minutes (target: <10 min) ✅
Time to identify root cause: 10 minutes (target: <30 min) ✅
Time to mitigate: 15 minutes (target: <30 min) ✅
Time to resolve: 30 minutes (target: <60 min) ✅

What Worked Well:

✅ Monitoring detected issue immediately
✅ Clear escalation path (on-call responded in 2 min)
✅ Good communication (updates every 10 min)
✅ Quick diagnosis (root cause found in 10 min)

What Could Be Improved:

❌ No runbook for this scenario (had to figure out on the spot)
❌ No early warning alert (only alerted when 100% full)
❌ Connection pool too small (should have been sized for traffic)

Resolution

Short-term Fix

Immediate (Restore service):

Restarted payment service to release connections
- systemctl restart payment-service
- Impact: Service restored in 2 minutes
Monitored connection pool for 30 minutes
- Verified connections stayed <50%
- No recurrence

Short-term (Prevent immediate recurrence):

Fixed connection leak in payment service code
- Added finally block with conn.close()
- Deployed hotfix at 16:00 UTC
- Verified no leak with load test
Increased connection pool size
- Changed max_connections from 100 to 200
- Provides headroom for traffic spikes
Added connection pool monitoring alert
- Alert at 80% usage (early warning)
- Prevents exhaustion

Long-term Prevention

Action Items (with owners and deadlines):

#	Action	Priority	Owner	Due Date	Status
1	Add automated test for connection cleanup	P1	Lisa (QA)	2025-10-27	✅ Done
2	Add linter rule to check connection cleanup	P1	Mike (Dev)	2025-10-27	✅ Done
3	Add connection timeout (30s)	P2	Tom (DBA)	2025-10-28	⏳ In Progress
4	Review all DB queries for connection leaks	P2	Mike (Dev)	2025-11-02	📅 Planned
5	Load test before high-traffic events	P3	John (DevOps)	2025-11-10	📅 Planned
6	Create runbook: Connection Pool Issues	P3	Jane (SRE)	2025-10-28	✅ Done
7	Add circuit breaker to prevent cascades	P3	Mike (Dev)	2025-11-15	📅 Planned

Lessons Learned

What Went Well

Monitoring was effective
- Alert fired within 1 minute of issue
- Clear symptoms (connection pool full)
Response was fast
- On-call responded in 2 minutes
- Root cause identified in 10 minutes
- Service restored in 15 minutes
Communication was clear
- Updates every 10 minutes
- Status page updated promptly
- Customer support informed
Team collaboration
- SRE diagnosed, Developer fixed, DBA scaled
- Clear roles and responsibilities

What Went Wrong

Connection leak in production
- Code review missed the leak
- No automated test or linter
- Developer unfamiliar with best practices
No early warning
- Alert only fired at 100% (too late)
- Should alert at 80% for early action
Capacity planning gap
- Connection pool too small for Black Friday
- No load testing before high-traffic event
No runbook
- Had to figure out diagnosis on the fly
- Runbook would have saved 5-10 minutes
No circuit breaker
- Could have prevented full outage
- Should fail gracefully, not cascade

Preventable?

YES - This incident was preventable.

How it could have been prevented:

✅ Automated test for connection cleanup → Would have caught leak
✅ Linter rule for connection cleanup → Would have caught in CI
✅ Load testing before Black Friday → Would have found pool too small
✅ Connection pool monitoring at 80% → Would have given early warning
✅ Code review focus on error paths → Would have caught missing finally

Prevention Strategies

Technical Improvements

Automated Testing
- ✅ Add integration test for connection cleanup
- ✅ Add linter rule: require-connection-cleanup
- ✅ Test error paths (not just happy path)
Monitoring & Alerting
- ✅ Alert at 80% connection pool usage (early warning)
- ✅ Alert on increasing connection count (detect leaks early)
- ✅ Dashboard for connection pool metrics
Capacity Planning
- ✅ Load test before high-traffic events
- ✅ Review connection pool size quarterly
- ✅ Auto-scaling for application (not just database)
Resilience Patterns
- ⏳ Circuit breaker (prevent cascade failures)
- ⏳ Connection timeout (30s)
- ⏳ Graceful degradation (fallback data)

Process Improvements

Code Review
- ✅ Checklist: Connection cleanup in error paths
- ✅ Required reviewer: Someone familiar with DB best practices
- ✅ Automated checks (linter, tests)
Runbooks
- ✅ Create runbook: Connection Pool Exhaustion
- ⏳ Create runbook: Database Performance Issues
- ⏳ Quarterly runbook review/update
Training
- ⏳ Database best practices training for developers
- ⏳ Connection pool management workshop
- ⏳ Incident response training
Capacity Planning
- ✅ Load test before high-traffic events (Black Friday, launch days)
- ⏳ Quarterly capacity review
- ⏳ Traffic forecasting for events

Cultural Improvements

Blameless Culture
- This post-mortem focuses on systems, not individuals
- Goal: Learn and improve, not blame
Psychological Safety
- Encourage raising concerns (e.g., "I'm not sure about error handling")
- No punishment for mistakes
Continuous Learning
- Share post-mortems org-wide
- Regular incident review meetings
- Learn from other teams' incidents

Recommendations

Immediate (This Week)

Fix connection leak in code (DONE)
Add connection pool monitoring at 80% (DONE)
Create runbook for connection pool issues (DONE)
Add automated test for connection cleanup
Add linter rule for connection cleanup

Short-term (This Month)

Add connection timeout configuration
Review all database queries for leaks
Load test with 10x traffic
Database best practices training

Long-term (This Quarter)

Implement circuit breakers
Quarterly capacity planning process
Add auto-scaling for application tier
Regular runbook review/update process

Supporting Information

2025-09-15: Database connection pool exhausted (similar issue)
- Same root cause (connection leak)
- Should have prevented this incident!
2025-08-10: Payment service OOM crash
- Memory leak, different symptom

Metrics

Availability:

Monthly uptime target: 99.9% (43.2 min downtime allowed)
This month actual: 99.93% (30 min downtime)
Status: ✅ Within SLA

MTTR (Mean Time To Resolution):

This incident: 30 minutes
Team average: 45 minutes
Status: ✅ Better than average

Acknowledgments

Thanks to:

Jane (SRE) - Quick diagnosis and mitigation
Mike (Developer) - Fast code fix
Tom (DBA) - Connection pool scaling
Customer Support team - Handling user complaints

Sign-off

This post-mortem has been reviewed and approved:

Author: Jane (SRE) - YYYY-MM-DD
Engineering Lead: Mike - YYYY-MM-DD
Manager: Sarah - YYYY-MM-DD
Action items tracked in: JIRA-1234

Next Review: [Date] - Check action item progress

Remember: Incidents are learning opportunities. The goal is not to find fault, but to improve our systems and processes.

12 KiB Raw Blame History