Initial commit

2025-11-29 17:56:41 +08:00
commit 9427ed1eea
40 changed files with 15189 additions and 0 deletions
--- a/agents/sre/templates/incident-report.md
+++ b/agents/sre/templates/incident-report.md
@@ -0,0 +1,249 @@
+# Incident Report: [Incident Title]
+
+**Date**: YYYY-MM-DD
+**Time Started**: HH:MM UTC
+**Time Resolved**: HH:MM UTC (or "Ongoing")
+**Duration**: X hours Y minutes
+**Severity**: SEV1 / SEV2 / SEV3
+**Status**: Investigating / Mitigating / Resolved
+
+---
+
+## Summary
+
+Brief one-paragraph description of what happened, impact, and current status.
+
+**Example**:
+```
+On 2025-10-26 at 14:00 UTC, the API service became unavailable due to database connection pool exhaustion. All users were unable to access the application. The issue was resolved at 14:30 UTC by restarting the database and fixing a connection leak in the payment service. Total downtime: 30 minutes.
+```
+
+---
+
+## Impact
+
+### Users Affected
+- **Scope**: All users / Partial / Specific region / Specific feature
+- **Count**: X,XXX users (or percentage)
+- **Duration**: HH:MM (how long were they affected)
+
+### Services Affected
+- [ ] Frontend/UI
+- [ ] Backend API
+- [ ] Database
+- [ ] Payment processing
+- [ ] Authentication
+- [ ] [Other service]
+
+### Business Impact
+- **Revenue Lost**: $X,XXX (if calculable)
+- **SLA Breach**: Yes / No (if applicable)
+- **Customer Complaints**: X tickets/emails
+- **Reputation**: Social media mentions, press coverage
+
+---
+
+## Timeline
+
+Detailed chronological timeline of events with timestamps.
+
+| Time (UTC) | Event | Action Taken | By Whom |
+|------------|-------|--------------|---------|
+| 14:00 | First alert: "Database connection pool exhausted" | Alert triggered | Monitoring |
+| 14:02 | On-call engineer paged | Acknowledged alert | SRE (Jane) |
+| 14:05 | Confirmed database connections at max (100/100) | Checked pg_stat_activity | SRE (Jane) |
+| 14:10 | Identified connection leak in payment service | Reviewed application logs | SRE (Jane) |
+| 14:15 | Restarted payment service | systemctl restart payment | SRE (Jane) |
+| 14:20 | Database connections normalized (20/100) | Monitored connections | SRE (Jane) |
+| 14:25 | Health checks passing | Verified /health endpoint | SRE (Jane) |
+| 14:30 | Incident resolved | Declared incident resolved | SRE (Jane) |
+
+---
+
+## Root Cause
+
+**What broke**: Payment service had connection leak (connections not released after query)
+
+**Why it broke**: Missing `conn.close()` in error handling path
+
+**What triggered it**: High payment volume (Black Friday sale)
+
+**Contributing factors**:
+- Database connection pool size too small (100 connections)
+- No connection timeout configured
+- No monitoring alert for connection pool usage
+
+---
+
+## Detection
+
+### How We Detected
+- [X] Automated monitoring alert
+- [ ] User report
+- [ ] Internal team noticed
+- [ ] External vendor notification
+
+**Alert Details**:
+- Alert name: "Database Connection Pool Exhausted"
+- Alert triggered at: 14:00 UTC
+- Time to detection: <1 minute (automated)
+- Time to acknowledgment: 2 minutes
+
+### Detection Quality
+- **Good**: Alert fired quickly (<1 min)
+- **To Improve**: Need alert BEFORE pool exhausted (at 80% usage)
+
+---
+
+## Response
+
+### Immediate Actions Taken
+1. ✅ Acknowledged alert (14:02)
+2. ✅ Checked database connection pool (14:05)
+3. ✅ Identified connection leak (14:10)
+4. ✅ Restarted payment service (14:15)
+5. ✅ Verified resolution (14:30)
+
+### What Worked Well
+- Monitoring detected issue quickly
+- Clear runbook for connection pool issues
+- SRE responded within 2 minutes
+- Root cause identified in 10 minutes
+
+### What Could Be Improved
+- Connection leak should have been caught in code review
+- No automated tests for connection cleanup
+- Connection pool too small for Black Friday traffic
+- No early warning alert (only alerted when 100% full)
+
+---
+
+## Resolution
+
+### Short-term Fix (Immediate)
+- Restarted payment service to release connections
+- Manually monitored connection pool for 30 minutes
+
+### Long-term Fix (To Prevent Recurrence)
+- [ ] Fix connection leak in payment service code (PRIORITY 1)
+- [ ] Add automated test for connection cleanup (PRIORITY 1)
+- [ ] Increase connection pool size (100 → 200) (PRIORITY 2)
+- [ ] Add connection pool monitoring alert (>80%) (PRIORITY 2)
+- [ ] Add connection timeout (30 seconds) (PRIORITY 3)
+- [ ] Review all database queries for connection leaks (PRIORITY 3)
+
+---
+
+## Communication
+
+### Internal Communication
+- **Incident channel**: #incident-20251026-db-pool
+- **Participants**: SRE (Jane), DevOps (John), Manager (Sarah)
+- **Updates posted**: Every 10 minutes
+
+### External Communication
+- **Status page**: Updated at 14:05, 14:20, 14:30
+- **Customer email**: Sent at 15:00 (post-incident)
+- **Social media**: Tweet at 14:10 acknowledging issue
+
+**Sample Status Page Update**:
+```
+[14:05] Investigating: We are currently investigating an issue affecting API availability. Our team is actively working on a resolution.
+
+[14:20] Monitoring: We have identified the issue and implemented a fix. We are monitoring the situation to ensure stability.
+
+[14:30] Resolved: The issue has been resolved. All services are now operating normally. We apologize for the inconvenience.
+```
+
+---
+
+## Metrics
+
+### Response Time
+- **Time to detect**: <1 minute (excellent)
+- **Time to acknowledge**: 2 minutes (good)
+- **Time to triage**: 5 minutes (good)
+- **Time to identify root cause**: 10 minutes (good)
+- **Time to resolution**: 30 minutes (acceptable)
+
+### Availability
+- **Uptime target**: 99.9% (43.2 minutes downtime/month)
+- **Actual downtime**: 30 minutes
+- **SLA breach**: No (within monthly budget)
+
+### Error Rate
+- **Normal error rate**: 0.1%
+- **During incident**: 100% (complete outage)
+- **Peak error count**: 10,000 errors
+
+---
+
+## Action Items
+
+| # | Action | Owner | Priority | Due Date | Status |
+|---|--------|-------|----------|----------|--------|
+| 1 | Fix connection leak in payment service | Dev (Mike) | P1 | 2025-10-27 | Pending |
+| 2 | Add automated test for connection cleanup | QA (Lisa) | P1 | 2025-10-27 | Pending |
+| 3 | Increase connection pool size (100 → 200) | DBA (Tom) | P2 | 2025-10-28 | Pending |
+| 4 | Add connection pool monitoring (>80%) | SRE (Jane) | P2 | 2025-10-28 | Pending |
+| 5 | Add connection timeout (30s) | DBA (Tom) | P3 | 2025-10-30 | Pending |
+| 6 | Review all queries for connection leaks | Dev (Mike) | P3 | 2025-11-02 | Pending |
+| 7 | Load test for Black Friday traffic | DevOps (John) | P3 | 2025-11-10 | Pending |
+
+---
+
+## Lessons Learned
+
+### What Went Well
+- ✅ Monitoring detected issue immediately
+- ✅ Clear escalation path (on-call responded quickly)
+- ✅ Runbook helped identify issue faster
+- ✅ Communication was clear and timely
+
+### What Went Wrong
+- ❌ Connection leak made it to production (code review miss)
+- ❌ No automated test for connection cleanup
+- ❌ Connection pool too small for high-traffic event
+- ❌ No early warning alert (only alerted at 100%)
+
+### Action Items to Prevent Recurrence
+1. **Code Quality**: Add linter rule to check connection cleanup
+2. **Testing**: Add integration test for connection pool under load
+3. **Monitoring**: Add alert at 80% connection pool usage
+4. **Capacity Planning**: Review capacity before high-traffic events
+5. **Runbook Update**: Document connection leak troubleshooting
+
+---
+
+## Appendices
+
+### Related Incidents
+- [2025-09-15] Database connection pool exhausted (similar issue)
+- [2025-08-10] Payment service OOM crash
+
+### Related Documentation
+- Runbook: [Connection Pool Issues](../playbooks/connection-pool-exhausted.md)
+- Post-mortem: [2025-09-15 Database Incident](../post-mortems/2025-09-15-db-pool.md)
+- Code: [Payment Service](https://github.com/example/payment-service)
+
+### Commands Run
+```bash
+# Check connection pool
+SELECT count(*) FROM pg_stat_activity;
+
+# Identify blocking queries
+SELECT * FROM pg_stat_activity WHERE state != 'idle';
+
+# Restart service
+systemctl restart payment-service
+
+# Monitor connections
+watch -n 5 'psql -c "SELECT count(*) FROM pg_stat_activity"'
+```
+
+---
+
+**Report Created By**: Jane (SRE)
+**Report Date**: 2025-10-26
+**Review Status**: Pending / Reviewed / Approved
+**Reviewed By**: [Name, Date]
--- a/agents/sre/templates/mitigation-plan.md
+++ b/agents/sre/templates/mitigation-plan.md
@@ -0,0 +1,375 @@
+# Mitigation Plan: [Incident Title]
+
+**Date**: YYYY-MM-DD HH:MM UTC
+**Incident**: [Brief description]
+**Root Cause**: [Root cause if known, or "Under investigation"]
+**Severity**: SEV1 / SEV2 / SEV3
+**Created By**: [Name]
+
+---
+
+## Executive Summary
+
+**Problem**: [What's broken in one sentence]
+
+**Impact**: [Who's affected and how]
+
+**Solution**: [High-level approach]
+
+**ETA**: [Estimated time to resolution]
+
+**Example**:
+```
+Problem: Database connection pool exhausted due to connection leak
+Impact: All users unable to access application (100% downtime)
+Solution: Restart application + fix connection leak in code
+ETA: 30 minutes (service restored in 5 min, permanent fix in 30 min)
+```
+
+---
+
+## Three-Horizon Mitigation
+
+### Immediate (Now - 5 minutes)
+
+**Goal**: Stop the bleeding, restore service immediately
+
+**Actions**:
+- [ ] [Action 1]
+  - **What**: [Detailed description]
+  - **How**: [Commands/steps]
+  - **Impact**: [Expected improvement]
+  - **Risk**: [Low/Medium/High + explanation]
+  - **Rollback**: [How to undo if it fails]
+  - **ETA**: [Time to execute]
+  - **Owner**: [Who will do this]
+
+**Example**:
+```
+- [ ] Restart payment service to release connections
+  - What: Restart payment service to release database connections
+  - How: `systemctl restart payment-service`
+  - Impact: All 100 connections released, service restored
+  - Risk: Low (stateless service, graceful restart)
+  - Rollback: N/A (restart is safe)
+  - ETA: 2 minutes
+  - Owner: Jane (SRE)
+
+- [ ] Monitor connection pool for 5 minutes
+  - What: Verify connections stay below 80%
+  - How: `watch -n 5 'psql -c "SELECT count(*) FROM pg_stat_activity"'`
+  - Impact: Early detection if issue recurs
+  - Risk: None (monitoring only)
+  - Rollback: N/A
+  - ETA: 5 minutes
+  - Owner: Jane (SRE)
+```
+
+**Success Criteria**:
+- [ ] Service health check passing
+- [ ] Users able to access application
+- [ ] Connection pool <80% of max
+- [ ] No active alerts
+
+---
+
+### Short-term (5 minutes - 1 hour)
+
+**Goal**: Tactical fix to prevent immediate recurrence
+
+**Actions**:
+- [ ] [Action 1]
+  - **What**: [Detailed description]
+  - **How**: [Commands/steps]
+  - **Impact**: [Expected improvement]
+  - **Risk**: [Low/Medium/High + explanation]
+  - **Rollback**: [How to undo if it fails]
+  - **ETA**: [Time to execute]
+  - **Owner**: [Who will do this]
+
+**Example**:
+```
+- [ ] Fix connection leak in payment service code
+  - What: Add `finally` block to close connection in error path
+  - How: Deploy hotfix branch `fix/connection-leak`
+  - Impact: Connections properly closed, no leak
+  - Risk: Medium (code change requires testing)
+  - Rollback: `git revert <commit>` + redeploy
+  - ETA: 30 minutes (test + deploy)
+  - Owner: Mike (Developer)
+
+- [ ] Increase connection pool size
+  - What: Increase max_connections from 100 to 200
+  - How: ALTER SYSTEM SET max_connections = 200; SELECT pg_reload_conf();
+  - Impact: More headroom for traffic spikes
+  - Risk: Low (more connections = more memory, but server has capacity)
+  - Rollback: ALTER SYSTEM SET max_connections = 100; SELECT pg_reload_conf();
+  - ETA: 5 minutes
+  - Owner: Tom (DBA)
+
+- [ ] Add connection pool monitoring alert
+  - What: Alert when connections >80% of max
+  - How: Create CloudWatch/Grafana alert
+  - Impact: Early warning before exhaustion
+  - Risk: None (monitoring only)
+  - Rollback: Disable alert
+  - ETA: 15 minutes
+  - Owner: Jane (SRE)
+```
+
+**Success Criteria**:
+- [ ] Code fix deployed and verified
+- [ ] Connection pool increased
+- [ ] Monitoring alert configured
+- [ ] No recurrence in 1 hour
+- [ ] Load test passed (if applicable)
+
+---
+
+### Long-term (1 hour - days/weeks)
+
+**Goal**: Permanent fix and prevention
+
+**Actions**:
+- [ ] [Action 1]
+  - **What**: [Detailed description]
+  - **Priority**: P1 / P2 / P3
+  - **Due Date**: [YYYY-MM-DD]
+  - **Owner**: [Who will do this]
+
+**Example**:
+```
+- [ ] Add automated test for connection cleanup
+  - What: Integration test that verifies connections are closed in error paths
+  - Priority: P1
+  - Due Date: 2025-10-27
+  - Owner: Lisa (QA)
+
+- [ ] Add connection timeout configuration
+  - What: Set connection_timeout = 30s in database config
+  - Priority: P2
+  - Due Date: 2025-10-28
+  - Owner: Tom (DBA)
+
+- [ ] Review all database queries for connection leaks
+  - What: Audit all DB queries to ensure proper cleanup
+  - Priority: P3
+  - Due Date: 2025-11-02
+  - Owner: Mike (Developer)
+
+- [ ] Load test for high-traffic events
+  - What: Load test with 10x normal traffic to find bottlenecks
+  - Priority: P3
+  - Due Date: 2025-11-10
+  - Owner: John (DevOps)
+
+- [ ] Update runbook with new findings
+  - What: Document connection leak troubleshooting steps
+  - Priority: P3
+  - Due Date: 2025-10-28
+  - Owner: Jane (SRE)
+```
+
+**Success Criteria**:
+- [ ] All P1 actions completed
+- [ ] Regression test added (prevents future occurrences)
+- [ ] Monitoring improved (detect earlier)
+- [ ] Runbook updated
+- [ ] Post-mortem published
+
+---
+
+## Risk Assessment
+
+### Risks of Mitigation Actions
+
+| Action | Risk Level | Risk Description | Mitigation |
+|--------|------------|------------------|------------|
+| [Action 1] | Low/Med/High | [What could go wrong] | [How to reduce risk] |
+
+**Example**:
+```
+| Restart service | Low | Brief downtime (5s) | Use graceful restart, off-peak time |
+| Deploy code fix | Medium | Bug in fix could worsen issue | Test in staging first, have rollback ready |
+| Increase connection pool | Low | More memory usage | Server has capacity, monitor memory |
+```
+
+### Risks of NOT Mitigating
+
+| Risk | Impact | Probability |
+|------|--------|-------------|
+| [Risk 1] | [Impact if we do nothing] | High/Med/Low |
+
+**Example**:
+```
+| Service remains down | All users affected, revenue loss | High (will recur) |
+| Connection leak worsens | Database crashes | High |
+| SLA breach | Customer refunds, reputation damage | Medium |
+```
+
+---
+
+## Communication Plan
+
+### Internal Communication
+
+**Incident Channel**: #incident-YYYYMMDD-title
+
+**Update Frequency**: Every [X] minutes
+
+**Stakeholders to Notify**:
+- [ ] Engineering team (#engineering)
+- [ ] Customer support (#support)
+- [ ] Management (#management)
+- [ ] [Other teams]
+
+**Update Template**:
+```markdown
+[HH:MM] Update:
+- Status: [Investigating / Mitigating / Resolved]
+- Root Cause: [Known / Under investigation]
+- Current Action: [What we're doing now]
+- Next Steps: [What's next]
+- ETA: [Estimated resolution time]
+```
+
+---
+
+### External Communication
+
+**Status Page**: [URL]
+
+**Update Frequency**: Every [X] minutes or when status changes
+
+**Status Page Template**:
+```markdown
+[HH:MM] Investigating: We are currently investigating [issue description]. Our team is actively working on a resolution.
+
+[HH:MM] Identified: We have identified the issue as [root cause]. We are implementing a fix. ETA: [time].
+
+[HH:MM] Monitoring: The fix has been deployed. We are monitoring to ensure stability.
+
+[HH:MM] Resolved: The issue has been fully resolved. All services are operating normally. We apologize for the inconvenience.
+```
+
+**Customer Email** (if needed):
+- [ ] Draft email
+- [ ] Approve with management
+- [ ] Send to affected customers
+
+---
+
+## Validation
+
+### Before Declaring Resolved
+
+Verify all of the following:
+
+- [ ] Root cause identified
+- [ ] Immediate fix deployed and verified
+- [ ] Service health check passing for >30 minutes
+- [ ] Users able to access application
+- [ ] Metrics returned to normal (response time, error rate, etc.)
+- [ ] No active alerts
+- [ ] Load test passed (if applicable)
+- [ ] Customer support confirms no ongoing issues
+
+### Monitoring After Resolution
+
+Monitor for [X] hours after declaring resolved:
+
+- [ ] [Metric 1] within normal range
+- [ ] [Metric 2] within normal range
+- [ ] [Metric 3] within normal range
+- [ ] No error spikes
+- [ ] No user complaints
+
+**Example**:
+```
+- [ ] Connection pool <50% of max
+- [ ] API response time <200ms (p95)
+- [ ] Error rate <0.1%
+- [ ] Database CPU <70%
+```
+
+---
+
+## Rollback Plan
+
+If mitigation actions fail or make things worse:
+
+### Immediate Rollback
+
+```bash
+# Rollback code deployment
+git revert <commit>
+npm run deploy
+
+# Rollback database config
+ALTER SYSTEM SET max_connections = 100;
+SELECT pg_reload_conf();
+
+# Verify rollback
+curl http://localhost/health
+```
+
+### When to Rollback
+
+Rollback if:
+- [ ] Issue worsens after mitigation
+- [ ] New errors appear
+- [ ] Service remains down >X minutes after mitigation
+- [ ] Metrics worsen (response time, error rate)
+
+---
+
+## Next Steps
+
+After incident is resolved:
+
+1. [ ] Create post-mortem (within 24 hours)
+   - Owner: [Name]
+   - Due: [Date]
+
+2. [ ] Schedule post-mortem review meeting
+   - Date: [Date]
+   - Attendees: [List]
+
+3. [ ] Track action items to completion
+   - Use: [JIRA/GitHub/etc.]
+   - Review: Weekly in team meeting
+
+4. [ ] Update runbooks based on learnings
+   - Owner: [Name]
+   - Due: [Date]
+
+5. [ ] Share learnings with organization
+   - Format: All-hands presentation / Email / Wiki
+   - Owner: [Name]
+   - Due: [Date]
+
+---
+
+## Appendix
+
+### Commands Reference
+
+```bash
+# Useful commands for this incident
+<command1>
+<command2>
+<command3>
+```
+
+### Links
+
+- **Monitoring Dashboard**: [URL]
+- **Runbook**: [URL]
+- **Related Incidents**: [URL]
+- **Incident Channel**: [Slack/Teams URL]
+
+---
+
+**Plan Created**: YYYY-MM-DD HH:MM UTC
+**Plan Updated**: YYYY-MM-DD HH:MM UTC
+**Status**: Active / Executed / Superseded
--- a/agents/sre/templates/post-mortem.md
+++ b/agents/sre/templates/post-mortem.md
@@ -0,0 +1,418 @@
+# Post-Mortem: [Incident Title]
+
+**Date of Incident**: YYYY-MM-DD
+**Date of Post-Mortem**: YYYY-MM-DD
+**Author**: [Name]
+**Reviewers**: [Names]
+**Severity**: SEV1 / SEV2 / SEV3
+
+---
+
+## Executive Summary
+
+**What Happened**: [One-paragraph summary of incident]
+
+**Impact**: [Brief impact summary - users, duration, business]
+
+**Root Cause**: [Root cause in one sentence]
+
+**Resolution**: [How it was fixed]
+
+**Example**:
+```
+What Happened: On October 26, 2025, the application became unavailable for 30 minutes due to database connection pool exhaustion.
+
+Impact: All users were unable to access the application from 14:00-14:30 UTC. Approximately 10,000 users affected.
+
+Root Cause: Payment service had a connection leak (connections not properly closed in error handling path), which exhausted the database connection pool during high traffic.
+
+Resolution: Application was restarted to release connections (immediate fix), and the connection leak was fixed in code (permanent fix).
+```
+
+---
+
+## Incident Details
+
+### Timeline
+
+| Time (UTC) | Event | Actor |
+|------------|-------|-------|
+| 14:00 | Alert: "Database Connection Pool Exhausted" | Monitoring |
+| 14:02 | On-call engineer paged | PagerDuty |
+| 14:02 | Jane acknowledged alert | SRE (Jane) |
+| 14:05 | Confirmed database connections at max (100/100) | SRE (Jane) |
+| 14:08 | Checked application logs for connection usage | SRE (Jane) |
+| 14:10 | Identified connection leak in payment service | SRE (Jane) |
+| 14:12 | Decision: Restart payment service to free connections | SRE (Jane) |
+| 14:15 | Payment service restarted | SRE (Jane) |
+| 14:17 | Database connections dropped to 20/100 | SRE (Jane) |
+| 14:20 | Health checks passing, traffic restored | SRE (Jane) |
+| 14:25 | Monitoring for stability | SRE (Jane) |
+| 14:30 | Incident declared resolved | SRE (Jane) |
+| 15:00 | Developer identified code fix | Dev (Mike) |
+| 16:00 | Code fix deployed to production | Dev (Mike) |
+| 16:30 | Verified no recurrence after 1 hour | SRE (Jane) |
+
+**Total Duration**: 30 minutes (outage) + 2.5 hours (full resolution)
+
+---
+
+### Impact
+
+**Users Affected**:
+- **Scope**: All users (100%)
+- **Count**: ~10,000 active users
+- **Duration**: 30 minutes complete outage
+
+**Services Affected**:
+- ✅ Frontend (down - unable to reach backend)
+- ✅ Backend API (degraded - connection pool exhausted)
+- ✅ Database (saturated - all connections in use)
+- ❌ Authentication (not affected - separate service)
+- ❌ Payment processing (not affected - queued transactions)
+
+**Business Impact**:
+- **Revenue Lost**: $5,000 (estimated, based on 30 min downtime)
+- **SLA Breach**: No (30 min < 43.2 min monthly budget for 99.9%)
+- **Customer Complaints**: 47 support tickets, 12 social media mentions
+- **Reputation**: Minor (quickly resolved, transparent communication)
+
+---
+
+## Root Cause Analysis
+
+### The Five Whys
+
+**1. Why did the application become unavailable?**
+→ Database connection pool was exhausted (100/100 connections in use)
+
+**2. Why was the connection pool exhausted?**
+→ Payment service had a connection leak (connections not being released)
+
+**3. Why were connections not being released?**
+→ Error handling path in payment service missing `conn.close()` in `finally` block
+
+**4. Why was the error path missing `conn.close()`?**
+→ Developer oversight during code review
+
+**5. Why didn't code review catch this?**
+→ No automated test or linter to check connection cleanup
+
+**Root Cause**: Connection leak in payment service error handling path, compounded by lack of automated testing for connection cleanup.
+
+---
+
+### Contributing Factors
+
+**Technical Factors**:
+1. Connection pool size too small (100 connections) for Black Friday traffic
+2. No connection timeout configured (connections held indefinitely)
+3. No monitoring alert for connection pool usage (only alerted at 100%)
+4. No circuit breaker to prevent cascade failures
+
+**Process Factors**:
+1. Code review missed connection leak
+2. No automated test for connection cleanup
+3. No load testing before high-traffic event (Black Friday)
+4. No runbook for connection pool exhaustion
+
+**Human Factors**:
+1. Developer unfamiliar with connection pool best practices
+2. Time pressure during feature development (rushed code review)
+
+---
+
+## Detection and Response
+
+### Detection
+
+**How Detected**: Automated monitoring alert
+
+**Alert**: "Database Connection Pool Exhausted"
+- **Trigger**: `SELECT count(*) FROM pg_stat_activity >= 100`
+- **Alert latency**: <1 minute (excellent)
+- **False positive rate**: 0% (first time this alert fired)
+
+**Detection Quality**:
+- ✅ **Good**: Alert fired quickly (<1 min after issue started)
+- ❌ **To Improve**: No early warning (should alert at 80%, not 100%)
+
+---
+
+### Response
+
+**Response Timeline**:
+- **Time to acknowledge**: 2 minutes (target: <5 min) ✅
+- **Time to triage**: 5 minutes (target: <10 min) ✅
+- **Time to identify root cause**: 10 minutes (target: <30 min) ✅
+- **Time to mitigate**: 15 minutes (target: <30 min) ✅
+- **Time to resolve**: 30 minutes (target: <60 min) ✅
+
+**What Worked Well**:
+- ✅ Monitoring detected issue immediately
+- ✅ Clear escalation path (on-call responded in 2 min)
+- ✅ Good communication (updates every 10 min)
+- ✅ Quick diagnosis (root cause found in 10 min)
+
+**What Could Be Improved**:
+- ❌ No runbook for this scenario (had to figure out on the spot)
+- ❌ No early warning alert (only alerted when 100% full)
+- ❌ Connection pool too small (should have been sized for traffic)
+
+---
+
+## Resolution
+
+### Short-term Fix
+
+**Immediate** (Restore service):
+1. Restarted payment service to release connections
+   - `systemctl restart payment-service`
+   - Impact: Service restored in 2 minutes
+
+2. Monitored connection pool for 30 minutes
+   - Verified connections stayed <50%
+   - No recurrence
+
+**Short-term** (Prevent immediate recurrence):
+1. Fixed connection leak in payment service code
+   - Added `finally` block with `conn.close()`
+   - Deployed hotfix at 16:00 UTC
+   - Verified no leak with load test
+
+2. Increased connection pool size
+   - Changed `max_connections` from 100 to 200
+   - Provides headroom for traffic spikes
+
+3. Added connection pool monitoring alert
+   - Alert at 80% usage (early warning)
+   - Prevents exhaustion
+
+---
+
+### Long-term Prevention
+
+**Action Items** (with owners and deadlines):
+
+| # | Action | Priority | Owner | Due Date | Status |
+|---|--------|----------|-------|----------|--------|
+| 1 | Add automated test for connection cleanup | P1 | Lisa (QA) | 2025-10-27 | ✅ Done |
+| 2 | Add linter rule to check connection cleanup | P1 | Mike (Dev) | 2025-10-27 | ✅ Done |
+| 3 | Add connection timeout (30s) | P2 | Tom (DBA) | 2025-10-28 | ⏳ In Progress |
+| 4 | Review all DB queries for connection leaks | P2 | Mike (Dev) | 2025-11-02 | 📅 Planned |
+| 5 | Load test before high-traffic events | P3 | John (DevOps) | 2025-11-10 | 📅 Planned |
+| 6 | Create runbook: Connection Pool Issues | P3 | Jane (SRE) | 2025-10-28 | ✅ Done |
+| 7 | Add circuit breaker to prevent cascades | P3 | Mike (Dev) | 2025-11-15 | 📅 Planned |
+
+---
+
+## Lessons Learned
+
+### What Went Well
+
+1. **Monitoring was effective**
+   - Alert fired within 1 minute of issue
+   - Clear symptoms (connection pool full)
+
+2. **Response was fast**
+   - On-call responded in 2 minutes
+   - Root cause identified in 10 minutes
+   - Service restored in 15 minutes
+
+3. **Communication was clear**
+   - Updates every 10 minutes
+   - Status page updated promptly
+   - Customer support informed
+
+4. **Team collaboration**
+   - SRE diagnosed, Developer fixed, DBA scaled
+   - Clear roles and responsibilities
+
+---
+
+### What Went Wrong
+
+1. **Connection leak in production**
+   - Code review missed the leak
+   - No automated test or linter
+   - Developer unfamiliar with best practices
+
+2. **No early warning**
+   - Alert only fired at 100% (too late)
+   - Should alert at 80% for early action
+
+3. **Capacity planning gap**
+   - Connection pool too small for Black Friday
+   - No load testing before high-traffic event
+
+4. **No runbook**
+   - Had to figure out diagnosis on the fly
+   - Runbook would have saved 5-10 minutes
+
+5. **No circuit breaker**
+   - Could have prevented full outage
+   - Should fail gracefully, not cascade
+
+---
+
+### Preventable?
+
+**YES** - This incident was preventable.
+
+**How it could have been prevented**:
+1. ✅ Automated test for connection cleanup → Would have caught leak
+2. ✅ Linter rule for connection cleanup → Would have caught in CI
+3. ✅ Load testing before Black Friday → Would have found pool too small
+4. ✅ Connection pool monitoring at 80% → Would have given early warning
+5. ✅ Code review focus on error paths → Would have caught missing `finally`
+
+---
+
+## Prevention Strategies
+
+### Technical Improvements
+
+1. **Automated Testing**
+   - ✅ Add integration test for connection cleanup
+   - ✅ Add linter rule: `require-connection-cleanup`
+   - ✅ Test error paths (not just happy path)
+
+2. **Monitoring & Alerting**
+   - ✅ Alert at 80% connection pool usage (early warning)
+   - ✅ Alert on increasing connection count (detect leaks early)
+   - ✅ Dashboard for connection pool metrics
+
+3. **Capacity Planning**
+   - ✅ Load test before high-traffic events
+   - ✅ Review connection pool size quarterly
+   - ✅ Auto-scaling for application (not just database)
+
+4. **Resilience Patterns**
+   - ⏳ Circuit breaker (prevent cascade failures)
+   - ⏳ Connection timeout (30s)
+   - ⏳ Graceful degradation (fallback data)
+
+---
+
+### Process Improvements
+
+1. **Code Review**
+   - ✅ Checklist: Connection cleanup in error paths
+   - ✅ Required reviewer: Someone familiar with DB best practices
+   - ✅ Automated checks (linter, tests)
+
+2. **Runbooks**
+   - ✅ Create runbook: Connection Pool Exhaustion
+   - ⏳ Create runbook: Database Performance Issues
+   - ⏳ Quarterly runbook review/update
+
+3. **Training**
+   - ⏳ Database best practices training for developers
+   - ⏳ Connection pool management workshop
+   - ⏳ Incident response training
+
+4. **Capacity Planning**
+   - ✅ Load test before high-traffic events (Black Friday, launch days)
+   - ⏳ Quarterly capacity review
+   - ⏳ Traffic forecasting for events
+
+---
+
+### Cultural Improvements
+
+1. **Blameless Culture**
+   - This post-mortem focuses on systems, not individuals
+   - Goal: Learn and improve, not blame
+
+2. **Psychological Safety**
+   - Encourage raising concerns (e.g., "I'm not sure about error handling")
+   - No punishment for mistakes
+
+3. **Continuous Learning**
+   - Share post-mortems org-wide
+   - Regular incident review meetings
+   - Learn from other teams' incidents
+
+---
+
+## Recommendations
+
+### Immediate (This Week)
+
+- [x] Fix connection leak in code (DONE)
+- [x] Add connection pool monitoring at 80% (DONE)
+- [x] Create runbook for connection pool issues (DONE)
+- [ ] Add automated test for connection cleanup
+- [ ] Add linter rule for connection cleanup
+
+### Short-term (This Month)
+
+- [ ] Add connection timeout configuration
+- [ ] Review all database queries for leaks
+- [ ] Load test with 10x traffic
+- [ ] Database best practices training
+
+### Long-term (This Quarter)
+
+- [ ] Implement circuit breakers
+- [ ] Quarterly capacity planning process
+- [ ] Add auto-scaling for application tier
+- [ ] Regular runbook review/update process
+
+---
+
+## Supporting Information
+
+### Related Incidents
+
+- **2025-09-15**: Database connection pool exhausted (similar issue)
+  - Same root cause (connection leak)
+  - Should have prevented this incident!
+
+- **2025-08-10**: Payment service OOM crash
+  - Memory leak, different symptom
+
+### Related Documentation
+
+- [Database Architecture](https://wiki.example.com/db-arch)
+- [Connection Pool Best Practices](https://wiki.example.com/db-pool)
+- [Incident Response Process](https://wiki.example.com/incident-response)
+
+### Metrics
+
+**Availability**:
+- Monthly uptime target: 99.9% (43.2 min downtime allowed)
+- This month actual: 99.93% (30 min downtime)
+- Status: ✅ Within SLA
+
+**MTTR** (Mean Time To Resolution):
+- This incident: 30 minutes
+- Team average: 45 minutes
+- Status: ✅ Better than average
+
+---
+
+## Acknowledgments
+
+**Thanks to**:
+- Jane (SRE) - Quick diagnosis and mitigation
+- Mike (Developer) - Fast code fix
+- Tom (DBA) - Connection pool scaling
+- Customer Support team - Handling user complaints
+
+---
+
+## Sign-off
+
+This post-mortem has been reviewed and approved:
+
+- [x] Author: Jane (SRE) - YYYY-MM-DD
+- [x] Engineering Lead: Mike - YYYY-MM-DD
+- [x] Manager: Sarah - YYYY-MM-DD
+- [x] Action items tracked in: [JIRA-1234](link)
+
+**Next Review**: [Date] - Check action item progress
+
+---
+
+**Remember**: Incidents are learning opportunities. The goal is not to find fault, but to improve our systems and processes.
--- a/agents/sre/templates/runbook-template.md
+++ b/agents/sre/templates/runbook-template.md
@@ -0,0 +1,412 @@
+# Runbook: [Incident Type Title]
+
+**Last Updated**: YYYY-MM-DD
+**Owner**: Team/Person Name
+**Severity**: SEV1 / SEV2 / SEV3
+**Expected Time to Resolve**: X minutes
+
+---
+
+## Purpose
+
+Brief description of what this runbook covers and when to use it.
+
+**Example**:
+```
+This runbook provides step-by-step instructions for diagnosing and resolving database connection pool exhaustion issues. Use this runbook when you receive alerts about database connections reaching the maximum limit or when applications are unable to connect to the database.
+```
+
+---
+
+## Symptoms
+
+List of symptoms that indicate this issue.
+
+- [ ] Alert: "[Alert Name]" triggered
+- [ ] Error message: "[Specific error message]"
+- [ ] Users report: "[User-facing symptom]"
+- [ ] Monitoring shows: "[Metric/graph pattern]"
+
+**Example**:
+```
+- [ ] Alert: "Database Connection Pool Exhausted" triggered
+- [ ] Error message: "FATAL: remaining connection slots are reserved"
+- [ ] Users report: Unable to log in or load pages
+- [ ] Monitoring shows: Connection count = max_connections
+```
+
+---
+
+## Prerequisites
+
+What you need before starting:
+
+- [ ] Access to: [Systems/tools required]
+- [ ] Permissions: [Required permissions]
+- [ ] Tools installed: [Required tools]
+- [ ] Contact info: [Who to escalate to]
+
+**Example**:
+```
+- [ ] SSH access to database server
+- [ ] sudo privileges
+- [ ] Database admin credentials
+- [ ] Access to monitoring dashboard
+- [ ] Escalation: DBA team (#database-team)
+```
+
+---
+
+## Quick Reference
+
+**TL;DR** for experienced responders:
+
+```bash
+# 1. Check connection count
+psql -c "SELECT count(*) FROM pg_stat_activity"
+
+# 2. Identify connections
+psql -c "SELECT * FROM pg_stat_activity WHERE state != 'idle'"
+
+# 3. Kill idle connections
+psql -c "SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE state = 'idle in transaction'"
+
+# 4. Restart application
+systemctl restart application
+
+# 5. Monitor
+watch -n 5 'psql -c "SELECT count(*) FROM pg_stat_activity"'
+```
+
+---
+
+## Detailed Diagnosis
+
+Step-by-step diagnostic process.
+
+### Step 1: [First Diagnostic Step]
+
+**What to do**:
+```bash
+# Commands to run
+<command>
+```
+
+**What to look for**:
+- [ ] Expected output: `<expected>`
+- [ ] Problem indicator: `<problem>`
+
+**Example**:
+```bash
+# Check current connection count
+psql -c "SELECT count(*) FROM pg_stat_activity"
+```
+
+**What to look for**:
+- [ ] Normal: count < 80 (if max = 100)
+- [ ] Warning: count 80-95
+- [ ] Critical: count >= 100
+
+---
+
+### Step 2: [Second Diagnostic Step]
+
+**What to do**:
+```bash
+# Commands to run
+<command>
+```
+
+**What to look for**:
+- [ ] Expected output: `<expected>`
+- [ ] Problem indicator: `<problem>`
+
+**Example**:
+```bash
+# Identify idle connections
+psql -c "SELECT * FROM pg_stat_activity WHERE state = 'idle in transaction'"
+```
+
+**What to look for**:
+- [ ] No results: No idle transactions (good)
+- [ ] Many results: Connection leak (problem)
+
+---
+
+### Step 3: [Identify Root Cause]
+
+Based on symptoms, identify likely root cause:
+
+| Symptom | Root Cause |
+|---------|------------|
+| [Symptom 1] | [Likely cause 1] |
+| [Symptom 2] | [Likely cause 2] |
+| [Symptom 3] | [Likely cause 3] |
+
+**Example**:
+```
+| Many idle transactions | Connection leak (connections not closed) |
+| All connections active | High load (scale up) |
+| Specific app connections | Application issue |
+```
+
+---
+
+## Mitigation
+
+### Immediate (Now - 5 min)
+
+**Goal**: Stop the bleeding, restore service
+
+**Option A: [Immediate Fix Option 1]**
+```bash
+# Commands
+<command>
+```
+
+**Impact**: [What this does]
+**Risk**: [Potential risks]
+**When to use**: [When this option is appropriate]
+
+---
+
+**Option B: [Immediate Fix Option 2]**
+```bash
+# Commands
+<command>
+```
+
+**Impact**: [What this does]
+**Risk**: [Potential risks]
+**When to use**: [When this option is appropriate]
+
+---
+
+### Short-term (5 min - 1 hour)
+
+**Goal**: Tactical fix to prevent immediate recurrence
+
+**Steps**:
+1. [ ] [Action 1]
+2. [ ] [Action 2]
+3. [ ] [Action 3]
+
+**Commands**:
+```bash
+# Step 1
+<command>
+
+# Step 2
+<command>
+```
+
+---
+
+### Long-term (1 hour+)
+
+**Goal**: Permanent fix to prevent future occurrences
+
+**Action Items**:
+- [ ] [Long-term fix 1]
+  - Owner: [Name/Team]
+  - Due: [Date]
+
+- [ ] [Long-term fix 2]
+  - Owner: [Name/Team]
+  - Due: [Date]
+
+- [ ] [Long-term fix 3]
+  - Owner: [Name/Team]
+  - Due: [Date]
+
+---
+
+## Verification
+
+How to verify the issue is resolved:
+
+- [ ] [Verification step 1]
+- [ ] [Verification step 2]
+- [ ] [Verification step 3]
+- [ ] [Verification step 4]
+
+**Example**:
+```
+- [ ] Connection count < 80% of max
+- [ ] No active alerts
+- [ ] Application health check passing
+- [ ] Users able to access application
+- [ ] Monitor for 30 minutes (no recurrence)
+```
+
+**Commands**:
+```bash
+# Verify connection count
+psql -c "SELECT count(*) FROM pg_stat_activity"
+
+# Verify health check
+curl http://localhost/health
+```
+
+---
+
+## Communication
+
+### Status Page Update Template
+
+```markdown
+[HH:MM] Investigating: We are currently investigating [issue description]. Our team is actively working on a resolution.
+
+[HH:MM] Identified: We have identified the issue as [root cause]. We are implementing a fix.
+
+[HH:MM] Monitoring: The fix has been deployed. We are monitoring to ensure stability.
+
+[HH:MM] Resolved: The issue has been fully resolved. All services are operating normally.
+```
+
+### Internal Communication
+
+**Slack Template**:
+```
+:rotating_light: Incident: [Incident Title]
+Severity: SEV1/SEV2/SEV3
+Impact: [Brief impact description]
+Status: Investigating / Mitigating / Resolved
+ETA: [Estimated resolution time]
+Incident Channel: #incident-YYYYMMDD-name
+```
+
+---
+
+## Escalation
+
+### When to Escalate
+
+Escalate if:
+- [ ] Issue not resolved in [X] minutes
+- [ ] Root cause unclear after [Y] attempts
+- [ ] Impact spreading to other services
+- [ ] Require permissions you don't have
+- [ ] Need additional expertise
+
+### Escalation Contacts
+
+| Role | Contact | When to Escalate |
+|------|---------|------------------|
+| [Role 1] | [Name/Slack/Phone] | [Escalation criteria] |
+| [Role 2] | [Name/Slack/Phone] | [Escalation criteria] |
+| [Manager] | [Name/Slack/Phone] | [Escalation criteria] |
+
+**Example**:
+```
+| DBA | @tom-dba / +1-555-0100 | Database configuration issue |
+| Dev Lead | @mike-dev / +1-555-0200 | Application code issue |
+| On-call Manager | @sarah-manager / +1-555-0300 | Cannot resolve in 30 minutes |
+```
+
+---
+
+## Prevention
+
+### Monitoring
+
+Alerts to have in place:
+
+- [ ] Alert: [Alert name] when [condition]
+  - Threshold: [Value]
+  - Action: [What to do]
+
+**Example**:
+```
+- [ ] Alert: "Connection Pool Warning" when connections >80%
+  - Threshold: 80 connections (max 100)
+  - Action: Investigate connection usage
+```
+
+### Best Practices
+
+To prevent this issue:
+- [ ] [Best practice 1]
+- [ ] [Best practice 2]
+- [ ] [Best practice 3]
+
+**Example**:
+```
+- [ ] Always close database connections in finally block
+- [ ] Use connection pooling with timeout
+- [ ] Monitor connection pool usage
+- [ ] Load test before high-traffic events
+```
+
+---
+
+## Related Incidents
+
+Links to past incidents of this type:
+
+- [YYYY-MM-DD] [Incident title] - [Brief description] - [Link to post-mortem]
+
+**Example**:
+```
+- [2025-09-15] Database Connection Pool Exhausted - Payment service connection leak - [Post-mortem](../post-mortems/2025-09-15.md)
+```
+
+---
+
+## Related Documentation
+
+Links to related runbooks, documentation, architecture diagrams:
+
+- [Link 1] - [Description]
+- [Link 2] - [Description]
+- [Link 3] - [Description]
+
+**Example**:
+```
+- [Database Architecture](https://wiki.example.com/db-architecture) - Database setup and configuration
+- [Application Deployment](https://wiki.example.com/deploy) - How to deploy application
+- [Monitoring Dashboard](https://grafana.example.com/d/database) - Database metrics
+```
+
+---
+
+## Appendix
+
+### Useful Commands
+
+```bash
+# Command 1: [Description]
+<command>
+
+# Command 2: [Description]
+<command>
+
+# Command 3: [Description]
+<command>
+```
+
+### Logs to Check
+
+- **Application logs**: `/var/log/application/error.log`
+- **System logs**: `/var/log/syslog`
+- **Database logs**: `/var/log/postgresql/postgresql.log`
+
+### Configuration Files
+
+- **Application config**: `/etc/application/config.yaml`
+- **Database config**: `/etc/postgresql/postgresql.conf`
+- **Nginx config**: `/etc/nginx/nginx.conf`
+
+---
+
+## Changelog
+
+| Date | Change | By Whom |
+|------|--------|---------|
+| YYYY-MM-DD | Initial creation | [Name] |
+| YYYY-MM-DD | Added Step X based on incident | [Name] |
+| YYYY-MM-DD | Updated escalation contacts | [Name] |
+
+---
+
+**Questions or updates?** Contact [Owner] or update this runbook directly.