Initial commit

2025-11-29 17:56:41 +08:00
commit 9427ed1eea
40 changed files with 15189 additions and 0 deletions
--- a/agents/sre/templates/incident-report.md
+++ b/agents/sre/templates/incident-report.md
@@ -0,0 +1,249 @@
+# Incident Report: [Incident Title]
+
+**Date**: YYYY-MM-DD
+**Time Started**: HH:MM UTC
+**Time Resolved**: HH:MM UTC (or "Ongoing")
+**Duration**: X hours Y minutes
+**Severity**: SEV1 / SEV2 / SEV3
+**Status**: Investigating / Mitigating / Resolved
+
+---
+
+## Summary
+
+Brief one-paragraph description of what happened, impact, and current status.
+
+**Example**:
+```
+On 2025-10-26 at 14:00 UTC, the API service became unavailable due to database connection pool exhaustion. All users were unable to access the application. The issue was resolved at 14:30 UTC by restarting the database and fixing a connection leak in the payment service. Total downtime: 30 minutes.
+```
+
+---
+
+## Impact
+
+### Users Affected
+- **Scope**: All users / Partial / Specific region / Specific feature
+- **Count**: X,XXX users (or percentage)
+- **Duration**: HH:MM (how long were they affected)
+
+### Services Affected
+- [ ] Frontend/UI
+- [ ] Backend API
+- [ ] Database
+- [ ] Payment processing
+- [ ] Authentication
+- [ ] [Other service]
+
+### Business Impact
+- **Revenue Lost**: $X,XXX (if calculable)
+- **SLA Breach**: Yes / No (if applicable)
+- **Customer Complaints**: X tickets/emails
+- **Reputation**: Social media mentions, press coverage
+
+---
+
+## Timeline
+
+Detailed chronological timeline of events with timestamps.
+
+| Time (UTC) | Event | Action Taken | By Whom |
+|------------|-------|--------------|---------|
+| 14:00 | First alert: "Database connection pool exhausted" | Alert triggered | Monitoring |
+| 14:02 | On-call engineer paged | Acknowledged alert | SRE (Jane) |
+| 14:05 | Confirmed database connections at max (100/100) | Checked pg_stat_activity | SRE (Jane) |
+| 14:10 | Identified connection leak in payment service | Reviewed application logs | SRE (Jane) |
+| 14:15 | Restarted payment service | systemctl restart payment | SRE (Jane) |
+| 14:20 | Database connections normalized (20/100) | Monitored connections | SRE (Jane) |
+| 14:25 | Health checks passing | Verified /health endpoint | SRE (Jane) |
+| 14:30 | Incident resolved | Declared incident resolved | SRE (Jane) |
+
+---
+
+## Root Cause
+
+**What broke**: Payment service had connection leak (connections not released after query)
+
+**Why it broke**: Missing `conn.close()` in error handling path
+
+**What triggered it**: High payment volume (Black Friday sale)
+
+**Contributing factors**:
+- Database connection pool size too small (100 connections)
+- No connection timeout configured
+- No monitoring alert for connection pool usage
+
+---
+
+## Detection
+
+### How We Detected
+- [X] Automated monitoring alert
+- [ ] User report
+- [ ] Internal team noticed
+- [ ] External vendor notification
+
+**Alert Details**:
+- Alert name: "Database Connection Pool Exhausted"
+- Alert triggered at: 14:00 UTC
+- Time to detection: <1 minute (automated)
+- Time to acknowledgment: 2 minutes
+
+### Detection Quality
+- **Good**: Alert fired quickly (<1 min)
+- **To Improve**: Need alert BEFORE pool exhausted (at 80% usage)
+
+---
+
+## Response
+
+### Immediate Actions Taken
+1. ✅ Acknowledged alert (14:02)
+2. ✅ Checked database connection pool (14:05)
+3. ✅ Identified connection leak (14:10)
+4. ✅ Restarted payment service (14:15)
+5. ✅ Verified resolution (14:30)
+
+### What Worked Well
+- Monitoring detected issue quickly
+- Clear runbook for connection pool issues
+- SRE responded within 2 minutes
+- Root cause identified in 10 minutes
+
+### What Could Be Improved
+- Connection leak should have been caught in code review
+- No automated tests for connection cleanup
+- Connection pool too small for Black Friday traffic
+- No early warning alert (only alerted when 100% full)
+
+---
+
+## Resolution
+
+### Short-term Fix (Immediate)
+- Restarted payment service to release connections
+- Manually monitored connection pool for 30 minutes
+
+### Long-term Fix (To Prevent Recurrence)
+- [ ] Fix connection leak in payment service code (PRIORITY 1)
+- [ ] Add automated test for connection cleanup (PRIORITY 1)
+- [ ] Increase connection pool size (100 → 200) (PRIORITY 2)
+- [ ] Add connection pool monitoring alert (>80%) (PRIORITY 2)
+- [ ] Add connection timeout (30 seconds) (PRIORITY 3)
+- [ ] Review all database queries for connection leaks (PRIORITY 3)
+
+---
+
+## Communication
+
+### Internal Communication
+- **Incident channel**: #incident-20251026-db-pool
+- **Participants**: SRE (Jane), DevOps (John), Manager (Sarah)
+- **Updates posted**: Every 10 minutes
+
+### External Communication
+- **Status page**: Updated at 14:05, 14:20, 14:30
+- **Customer email**: Sent at 15:00 (post-incident)
+- **Social media**: Tweet at 14:10 acknowledging issue
+
+**Sample Status Page Update**:
+```
+[14:05] Investigating: We are currently investigating an issue affecting API availability. Our team is actively working on a resolution.
+
+[14:20] Monitoring: We have identified the issue and implemented a fix. We are monitoring the situation to ensure stability.
+
+[14:30] Resolved: The issue has been resolved. All services are now operating normally. We apologize for the inconvenience.
+```
+
+---
+
+## Metrics
+
+### Response Time
+- **Time to detect**: <1 minute (excellent)
+- **Time to acknowledge**: 2 minutes (good)
+- **Time to triage**: 5 minutes (good)
+- **Time to identify root cause**: 10 minutes (good)
+- **Time to resolution**: 30 minutes (acceptable)
+
+### Availability
+- **Uptime target**: 99.9% (43.2 minutes downtime/month)
+- **Actual downtime**: 30 minutes
+- **SLA breach**: No (within monthly budget)
+
+### Error Rate
+- **Normal error rate**: 0.1%
+- **During incident**: 100% (complete outage)
+- **Peak error count**: 10,000 errors
+
+---
+
+## Action Items
+
+| # | Action | Owner | Priority | Due Date | Status |
+|---|--------|-------|----------|----------|--------|
+| 1 | Fix connection leak in payment service | Dev (Mike) | P1 | 2025-10-27 | Pending |
+| 2 | Add automated test for connection cleanup | QA (Lisa) | P1 | 2025-10-27 | Pending |
+| 3 | Increase connection pool size (100 → 200) | DBA (Tom) | P2 | 2025-10-28 | Pending |
+| 4 | Add connection pool monitoring (>80%) | SRE (Jane) | P2 | 2025-10-28 | Pending |
+| 5 | Add connection timeout (30s) | DBA (Tom) | P3 | 2025-10-30 | Pending |
+| 6 | Review all queries for connection leaks | Dev (Mike) | P3 | 2025-11-02 | Pending |
+| 7 | Load test for Black Friday traffic | DevOps (John) | P3 | 2025-11-10 | Pending |
+
+---
+
+## Lessons Learned
+
+### What Went Well
+- ✅ Monitoring detected issue immediately
+- ✅ Clear escalation path (on-call responded quickly)
+- ✅ Runbook helped identify issue faster
+- ✅ Communication was clear and timely
+
+### What Went Wrong
+- ❌ Connection leak made it to production (code review miss)
+- ❌ No automated test for connection cleanup
+- ❌ Connection pool too small for high-traffic event
+- ❌ No early warning alert (only alerted at 100%)
+
+### Action Items to Prevent Recurrence
+1. **Code Quality**: Add linter rule to check connection cleanup
+2. **Testing**: Add integration test for connection pool under load
+3. **Monitoring**: Add alert at 80% connection pool usage
+4. **Capacity Planning**: Review capacity before high-traffic events
+5. **Runbook Update**: Document connection leak troubleshooting
+
+---
+
+## Appendices
+
+### Related Incidents
+- [2025-09-15] Database connection pool exhausted (similar issue)
+- [2025-08-10] Payment service OOM crash
+
+### Related Documentation
+- Runbook: [Connection Pool Issues](../playbooks/connection-pool-exhausted.md)
+- Post-mortem: [2025-09-15 Database Incident](../post-mortems/2025-09-15-db-pool.md)
+- Code: [Payment Service](https://github.com/example/payment-service)
+
+### Commands Run
+```bash
+# Check connection pool
+SELECT count(*) FROM pg_stat_activity;
+
+# Identify blocking queries
+SELECT * FROM pg_stat_activity WHERE state != 'idle';
+
+# Restart service
+systemctl restart payment-service
+
+# Monitor connections
+watch -n 5 'psql -c "SELECT count(*) FROM pg_stat_activity"'
+```
+
+---
+
+**Report Created By**: Jane (SRE)
+**Report Date**: 2025-10-26
+**Review Status**: Pending / Reviewed / Approved
+**Reviewed By**: [Name, Date]