7.8 KiB
7.8 KiB
Incident Report: [Incident Title]
Date: YYYY-MM-DD Time Started: HH:MM UTC Time Resolved: HH:MM UTC (or "Ongoing") Duration: X hours Y minutes Severity: SEV1 / SEV2 / SEV3 Status: Investigating / Mitigating / Resolved
Summary
Brief one-paragraph description of what happened, impact, and current status.
Example:
On 2025-10-26 at 14:00 UTC, the API service became unavailable due to database connection pool exhaustion. All users were unable to access the application. The issue was resolved at 14:30 UTC by restarting the database and fixing a connection leak in the payment service. Total downtime: 30 minutes.
Impact
Users Affected
- Scope: All users / Partial / Specific region / Specific feature
- Count: X,XXX users (or percentage)
- Duration: HH:MM (how long were they affected)
Services Affected
- Frontend/UI
- Backend API
- Database
- Payment processing
- Authentication
- [Other service]
Business Impact
- Revenue Lost: $X,XXX (if calculable)
- SLA Breach: Yes / No (if applicable)
- Customer Complaints: X tickets/emails
- Reputation: Social media mentions, press coverage
Timeline
Detailed chronological timeline of events with timestamps.
| Time (UTC) | Event | Action Taken | By Whom |
|---|---|---|---|
| 14:00 | First alert: "Database connection pool exhausted" | Alert triggered | Monitoring |
| 14:02 | On-call engineer paged | Acknowledged alert | SRE (Jane) |
| 14:05 | Confirmed database connections at max (100/100) | Checked pg_stat_activity | SRE (Jane) |
| 14:10 | Identified connection leak in payment service | Reviewed application logs | SRE (Jane) |
| 14:15 | Restarted payment service | systemctl restart payment | SRE (Jane) |
| 14:20 | Database connections normalized (20/100) | Monitored connections | SRE (Jane) |
| 14:25 | Health checks passing | Verified /health endpoint | SRE (Jane) |
| 14:30 | Incident resolved | Declared incident resolved | SRE (Jane) |
Root Cause
What broke: Payment service had connection leak (connections not released after query)
Why it broke: Missing conn.close() in error handling path
What triggered it: High payment volume (Black Friday sale)
Contributing factors:
- Database connection pool size too small (100 connections)
- No connection timeout configured
- No monitoring alert for connection pool usage
Detection
How We Detected
- Automated monitoring alert
- User report
- Internal team noticed
- External vendor notification
Alert Details:
- Alert name: "Database Connection Pool Exhausted"
- Alert triggered at: 14:00 UTC
- Time to detection: <1 minute (automated)
- Time to acknowledgment: 2 minutes
Detection Quality
- Good: Alert fired quickly (<1 min)
- To Improve: Need alert BEFORE pool exhausted (at 80% usage)
Response
Immediate Actions Taken
- ✅ Acknowledged alert (14:02)
- ✅ Checked database connection pool (14:05)
- ✅ Identified connection leak (14:10)
- ✅ Restarted payment service (14:15)
- ✅ Verified resolution (14:30)
What Worked Well
- Monitoring detected issue quickly
- Clear runbook for connection pool issues
- SRE responded within 2 minutes
- Root cause identified in 10 minutes
What Could Be Improved
- Connection leak should have been caught in code review
- No automated tests for connection cleanup
- Connection pool too small for Black Friday traffic
- No early warning alert (only alerted when 100% full)
Resolution
Short-term Fix (Immediate)
- Restarted payment service to release connections
- Manually monitored connection pool for 30 minutes
Long-term Fix (To Prevent Recurrence)
- Fix connection leak in payment service code (PRIORITY 1)
- Add automated test for connection cleanup (PRIORITY 1)
- Increase connection pool size (100 → 200) (PRIORITY 2)
- Add connection pool monitoring alert (>80%) (PRIORITY 2)
- Add connection timeout (30 seconds) (PRIORITY 3)
- Review all database queries for connection leaks (PRIORITY 3)
Communication
Internal Communication
- Incident channel: #incident-20251026-db-pool
- Participants: SRE (Jane), DevOps (John), Manager (Sarah)
- Updates posted: Every 10 minutes
External Communication
- Status page: Updated at 14:05, 14:20, 14:30
- Customer email: Sent at 15:00 (post-incident)
- Social media: Tweet at 14:10 acknowledging issue
Sample Status Page Update:
[14:05] Investigating: We are currently investigating an issue affecting API availability. Our team is actively working on a resolution.
[14:20] Monitoring: We have identified the issue and implemented a fix. We are monitoring the situation to ensure stability.
[14:30] Resolved: The issue has been resolved. All services are now operating normally. We apologize for the inconvenience.
Metrics
Response Time
- Time to detect: <1 minute (excellent)
- Time to acknowledge: 2 minutes (good)
- Time to triage: 5 minutes (good)
- Time to identify root cause: 10 minutes (good)
- Time to resolution: 30 minutes (acceptable)
Availability
- Uptime target: 99.9% (43.2 minutes downtime/month)
- Actual downtime: 30 minutes
- SLA breach: No (within monthly budget)
Error Rate
- Normal error rate: 0.1%
- During incident: 100% (complete outage)
- Peak error count: 10,000 errors
Action Items
| # | Action | Owner | Priority | Due Date | Status |
|---|---|---|---|---|---|
| 1 | Fix connection leak in payment service | Dev (Mike) | P1 | 2025-10-27 | Pending |
| 2 | Add automated test for connection cleanup | QA (Lisa) | P1 | 2025-10-27 | Pending |
| 3 | Increase connection pool size (100 → 200) | DBA (Tom) | P2 | 2025-10-28 | Pending |
| 4 | Add connection pool monitoring (>80%) | SRE (Jane) | P2 | 2025-10-28 | Pending |
| 5 | Add connection timeout (30s) | DBA (Tom) | P3 | 2025-10-30 | Pending |
| 6 | Review all queries for connection leaks | Dev (Mike) | P3 | 2025-11-02 | Pending |
| 7 | Load test for Black Friday traffic | DevOps (John) | P3 | 2025-11-10 | Pending |
Lessons Learned
What Went Well
- ✅ Monitoring detected issue immediately
- ✅ Clear escalation path (on-call responded quickly)
- ✅ Runbook helped identify issue faster
- ✅ Communication was clear and timely
What Went Wrong
- ❌ Connection leak made it to production (code review miss)
- ❌ No automated test for connection cleanup
- ❌ Connection pool too small for high-traffic event
- ❌ No early warning alert (only alerted at 100%)
Action Items to Prevent Recurrence
- Code Quality: Add linter rule to check connection cleanup
- Testing: Add integration test for connection pool under load
- Monitoring: Add alert at 80% connection pool usage
- Capacity Planning: Review capacity before high-traffic events
- Runbook Update: Document connection leak troubleshooting
Appendices
Related Incidents
- [2025-09-15] Database connection pool exhausted (similar issue)
- [2025-08-10] Payment service OOM crash
Related Documentation
- Runbook: Connection Pool Issues
- Post-mortem: 2025-09-15 Database Incident
- Code: Payment Service
Commands Run
# Check connection pool
SELECT count(*) FROM pg_stat_activity;
# Identify blocking queries
SELECT * FROM pg_stat_activity WHERE state != 'idle';
# Restart service
systemctl restart payment-service
# Monitor connections
watch -n 5 'psql -c "SELECT count(*) FROM pg_stat_activity"'
Report Created By: Jane (SRE) Report Date: 2025-10-26 Review Status: Pending / Reviewed / Approved Reviewed By: [Name, Date]