zhongwei/gh-anton-abyzov-specweave-plugins-specweave-infrastructure

Files

Zhongwei Li 9427ed1eea Initial commit

2025-11-29 17:56:41 +08:00

7.8 KiB

Raw Permalink Blame History

Incident Report: [Incident Title]

Date: YYYY-MM-DD Time Started: HH:MM UTC Time Resolved: HH:MM UTC (or "Ongoing") Duration: X hours Y minutes Severity: SEV1 / SEV2 / SEV3 Status: Investigating / Mitigating / Resolved

Summary

Brief one-paragraph description of what happened, impact, and current status.

Example:

On 2025-10-26 at 14:00 UTC, the API service became unavailable due to database connection pool exhaustion. All users were unable to access the application. The issue was resolved at 14:30 UTC by restarting the database and fixing a connection leak in the payment service. Total downtime: 30 minutes.

Impact

Users Affected

Scope: All users / Partial / Specific region / Specific feature
Count: X,XXX users (or percentage)
Duration: HH:MM (how long were they affected)

Services Affected

Frontend/UI
Backend API
Database
Payment processing
Authentication
[Other service]

Business Impact

Revenue Lost: $X,XXX (if calculable)
SLA Breach: Yes / No (if applicable)
Customer Complaints: X tickets/emails
Reputation: Social media mentions, press coverage

Timeline

Detailed chronological timeline of events with timestamps.

Time (UTC)	Event	Action Taken	By Whom
14:00	First alert: "Database connection pool exhausted"	Alert triggered	Monitoring
14:02	On-call engineer paged	Acknowledged alert	SRE (Jane)
14:05	Confirmed database connections at max (100/100)	Checked pg_stat_activity	SRE (Jane)
14:10	Identified connection leak in payment service	Reviewed application logs	SRE (Jane)
14:15	Restarted payment service	systemctl restart payment	SRE (Jane)
14:20	Database connections normalized (20/100)	Monitored connections	SRE (Jane)
14:25	Health checks passing	Verified /health endpoint	SRE (Jane)
14:30	Incident resolved	Declared incident resolved	SRE (Jane)

Root Cause

What broke: Payment service had connection leak (connections not released after query)

Why it broke: Missing conn.close() in error handling path

What triggered it: High payment volume (Black Friday sale)

Contributing factors:

Database connection pool size too small (100 connections)
No connection timeout configured
No monitoring alert for connection pool usage

Detection

How We Detected

Automated monitoring alert
User report
Internal team noticed
External vendor notification

Alert Details:

Alert name: "Database Connection Pool Exhausted"
Alert triggered at: 14:00 UTC
Time to detection: <1 minute (automated)
Time to acknowledgment: 2 minutes

Detection Quality

Good: Alert fired quickly (<1 min)
To Improve: Need alert BEFORE pool exhausted (at 80% usage)

Response

Immediate Actions Taken

✅ Acknowledged alert (14:02)
✅ Checked database connection pool (14:05)
✅ Identified connection leak (14:10)
✅ Restarted payment service (14:15)
✅ Verified resolution (14:30)

What Worked Well

Monitoring detected issue quickly
Clear runbook for connection pool issues
SRE responded within 2 minutes
Root cause identified in 10 minutes

What Could Be Improved

Connection leak should have been caught in code review
No automated tests for connection cleanup
Connection pool too small for Black Friday traffic
No early warning alert (only alerted when 100% full)

Resolution

Short-term Fix (Immediate)

Restarted payment service to release connections
Manually monitored connection pool for 30 minutes

Long-term Fix (To Prevent Recurrence)

Fix connection leak in payment service code (PRIORITY 1)
Add automated test for connection cleanup (PRIORITY 1)
Increase connection pool size (100 → 200) (PRIORITY 2)
Add connection pool monitoring alert (>80%) (PRIORITY 2)
Add connection timeout (30 seconds) (PRIORITY 3)
Review all database queries for connection leaks (PRIORITY 3)

Communication

Internal Communication

Incident channel: #incident-20251026-db-pool
Participants: SRE (Jane), DevOps (John), Manager (Sarah)
Updates posted: Every 10 minutes

External Communication

Status page: Updated at 14:05, 14:20, 14:30
Customer email: Sent at 15:00 (post-incident)
Social media: Tweet at 14:10 acknowledging issue

Sample Status Page Update:

[14:05] Investigating: We are currently investigating an issue affecting API availability. Our team is actively working on a resolution.

[14:20] Monitoring: We have identified the issue and implemented a fix. We are monitoring the situation to ensure stability.

[14:30] Resolved: The issue has been resolved. All services are now operating normally. We apologize for the inconvenience.

Metrics

Response Time

Time to detect: <1 minute (excellent)
Time to acknowledge: 2 minutes (good)
Time to triage: 5 minutes (good)
Time to identify root cause: 10 minutes (good)
Time to resolution: 30 minutes (acceptable)

Availability

Uptime target: 99.9% (43.2 minutes downtime/month)
Actual downtime: 30 minutes
SLA breach: No (within monthly budget)

Error Rate

Normal error rate: 0.1%
During incident: 100% (complete outage)
Peak error count: 10,000 errors

Action Items

#	Action	Owner	Priority	Due Date	Status
1	Fix connection leak in payment service	Dev (Mike)	P1	2025-10-27	Pending
2	Add automated test for connection cleanup	QA (Lisa)	P1	2025-10-27	Pending
3	Increase connection pool size (100 → 200)	DBA (Tom)	P2	2025-10-28	Pending
4	Add connection pool monitoring (>80%)	SRE (Jane)	P2	2025-10-28	Pending
5	Add connection timeout (30s)	DBA (Tom)	P3	2025-10-30	Pending
6	Review all queries for connection leaks	Dev (Mike)	P3	2025-11-02	Pending
7	Load test for Black Friday traffic	DevOps (John)	P3	2025-11-10	Pending

Lessons Learned

What Went Well

✅ Monitoring detected issue immediately
✅ Clear escalation path (on-call responded quickly)
✅ Runbook helped identify issue faster
✅ Communication was clear and timely

What Went Wrong

❌ Connection leak made it to production (code review miss)
❌ No automated test for connection cleanup
❌ Connection pool too small for high-traffic event
❌ No early warning alert (only alerted at 100%)

Action Items to Prevent Recurrence

Code Quality: Add linter rule to check connection cleanup
Testing: Add integration test for connection pool under load
Monitoring: Add alert at 80% connection pool usage
Capacity Planning: Review capacity before high-traffic events
Runbook Update: Document connection leak troubleshooting

Appendices

[2025-09-15] Database connection pool exhausted (similar issue)
[2025-08-10] Payment service OOM crash

Runbook: Connection Pool Issues
Post-mortem: 2025-09-15 Database Incident
Code: Payment Service

Commands Run

# Check connection pool
SELECT count(*) FROM pg_stat_activity;

# Identify blocking queries
SELECT * FROM pg_stat_activity WHERE state != 'idle';

# Restart service
systemctl restart payment-service

# Monitor connections
watch -n 5 'psql -c "SELECT count(*) FROM pg_stat_activity"'

Report Created By: Jane (SRE) Report Date: 2025-10-26 Review Status: Pending / Reviewed / Approved Reviewed By: [Name, Date]

7.8 KiB Raw Permalink Blame History