9.1 KiB
Mitigation Plan: [Incident Title]
Date: YYYY-MM-DD HH:MM UTC Incident: [Brief description] Root Cause: [Root cause if known, or "Under investigation"] Severity: SEV1 / SEV2 / SEV3 Created By: [Name]
Executive Summary
Problem: [What's broken in one sentence]
Impact: [Who's affected and how]
Solution: [High-level approach]
ETA: [Estimated time to resolution]
Example:
Problem: Database connection pool exhausted due to connection leak
Impact: All users unable to access application (100% downtime)
Solution: Restart application + fix connection leak in code
ETA: 30 minutes (service restored in 5 min, permanent fix in 30 min)
Three-Horizon Mitigation
Immediate (Now - 5 minutes)
Goal: Stop the bleeding, restore service immediately
Actions:
- [Action 1]
- What: [Detailed description]
- How: [Commands/steps]
- Impact: [Expected improvement]
- Risk: [Low/Medium/High + explanation]
- Rollback: [How to undo if it fails]
- ETA: [Time to execute]
- Owner: [Who will do this]
Example:
- [ ] Restart payment service to release connections
- What: Restart payment service to release database connections
- How: `systemctl restart payment-service`
- Impact: All 100 connections released, service restored
- Risk: Low (stateless service, graceful restart)
- Rollback: N/A (restart is safe)
- ETA: 2 minutes
- Owner: Jane (SRE)
- [ ] Monitor connection pool for 5 minutes
- What: Verify connections stay below 80%
- How: `watch -n 5 'psql -c "SELECT count(*) FROM pg_stat_activity"'`
- Impact: Early detection if issue recurs
- Risk: None (monitoring only)
- Rollback: N/A
- ETA: 5 minutes
- Owner: Jane (SRE)
Success Criteria:
- Service health check passing
- Users able to access application
- Connection pool <80% of max
- No active alerts
Short-term (5 minutes - 1 hour)
Goal: Tactical fix to prevent immediate recurrence
Actions:
- [Action 1]
- What: [Detailed description]
- How: [Commands/steps]
- Impact: [Expected improvement]
- Risk: [Low/Medium/High + explanation]
- Rollback: [How to undo if it fails]
- ETA: [Time to execute]
- Owner: [Who will do this]
Example:
- [ ] Fix connection leak in payment service code
- What: Add `finally` block to close connection in error path
- How: Deploy hotfix branch `fix/connection-leak`
- Impact: Connections properly closed, no leak
- Risk: Medium (code change requires testing)
- Rollback: `git revert <commit>` + redeploy
- ETA: 30 minutes (test + deploy)
- Owner: Mike (Developer)
- [ ] Increase connection pool size
- What: Increase max_connections from 100 to 200
- How: ALTER SYSTEM SET max_connections = 200; SELECT pg_reload_conf();
- Impact: More headroom for traffic spikes
- Risk: Low (more connections = more memory, but server has capacity)
- Rollback: ALTER SYSTEM SET max_connections = 100; SELECT pg_reload_conf();
- ETA: 5 minutes
- Owner: Tom (DBA)
- [ ] Add connection pool monitoring alert
- What: Alert when connections >80% of max
- How: Create CloudWatch/Grafana alert
- Impact: Early warning before exhaustion
- Risk: None (monitoring only)
- Rollback: Disable alert
- ETA: 15 minutes
- Owner: Jane (SRE)
Success Criteria:
- Code fix deployed and verified
- Connection pool increased
- Monitoring alert configured
- No recurrence in 1 hour
- Load test passed (if applicable)
Long-term (1 hour - days/weeks)
Goal: Permanent fix and prevention
Actions:
- [Action 1]
- What: [Detailed description]
- Priority: P1 / P2 / P3
- Due Date: [YYYY-MM-DD]
- Owner: [Who will do this]
Example:
- [ ] Add automated test for connection cleanup
- What: Integration test that verifies connections are closed in error paths
- Priority: P1
- Due Date: 2025-10-27
- Owner: Lisa (QA)
- [ ] Add connection timeout configuration
- What: Set connection_timeout = 30s in database config
- Priority: P2
- Due Date: 2025-10-28
- Owner: Tom (DBA)
- [ ] Review all database queries for connection leaks
- What: Audit all DB queries to ensure proper cleanup
- Priority: P3
- Due Date: 2025-11-02
- Owner: Mike (Developer)
- [ ] Load test for high-traffic events
- What: Load test with 10x normal traffic to find bottlenecks
- Priority: P3
- Due Date: 2025-11-10
- Owner: John (DevOps)
- [ ] Update runbook with new findings
- What: Document connection leak troubleshooting steps
- Priority: P3
- Due Date: 2025-10-28
- Owner: Jane (SRE)
Success Criteria:
- All P1 actions completed
- Regression test added (prevents future occurrences)
- Monitoring improved (detect earlier)
- Runbook updated
- Post-mortem published
Risk Assessment
Risks of Mitigation Actions
| Action | Risk Level | Risk Description | Mitigation |
|---|---|---|---|
| [Action 1] | Low/Med/High | [What could go wrong] | [How to reduce risk] |
Example:
| Restart service | Low | Brief downtime (5s) | Use graceful restart, off-peak time |
| Deploy code fix | Medium | Bug in fix could worsen issue | Test in staging first, have rollback ready |
| Increase connection pool | Low | More memory usage | Server has capacity, monitor memory |
Risks of NOT Mitigating
| Risk | Impact | Probability |
|---|---|---|
| [Risk 1] | [Impact if we do nothing] | High/Med/Low |
Example:
| Service remains down | All users affected, revenue loss | High (will recur) |
| Connection leak worsens | Database crashes | High |
| SLA breach | Customer refunds, reputation damage | Medium |
Communication Plan
Internal Communication
Incident Channel: #incident-YYYYMMDD-title
Update Frequency: Every [X] minutes
Stakeholders to Notify:
- Engineering team (#engineering)
- Customer support (#support)
- Management (#management)
- [Other teams]
Update Template:
[HH:MM] Update:
- Status: [Investigating / Mitigating / Resolved]
- Root Cause: [Known / Under investigation]
- Current Action: [What we're doing now]
- Next Steps: [What's next]
- ETA: [Estimated resolution time]
External Communication
Status Page: [URL]
Update Frequency: Every [X] minutes or when status changes
Status Page Template:
[HH:MM] Investigating: We are currently investigating [issue description]. Our team is actively working on a resolution.
[HH:MM] Identified: We have identified the issue as [root cause]. We are implementing a fix. ETA: [time].
[HH:MM] Monitoring: The fix has been deployed. We are monitoring to ensure stability.
[HH:MM] Resolved: The issue has been fully resolved. All services are operating normally. We apologize for the inconvenience.
Customer Email (if needed):
- Draft email
- Approve with management
- Send to affected customers
Validation
Before Declaring Resolved
Verify all of the following:
- Root cause identified
- Immediate fix deployed and verified
- Service health check passing for >30 minutes
- Users able to access application
- Metrics returned to normal (response time, error rate, etc.)
- No active alerts
- Load test passed (if applicable)
- Customer support confirms no ongoing issues
Monitoring After Resolution
Monitor for [X] hours after declaring resolved:
- [Metric 1] within normal range
- [Metric 2] within normal range
- [Metric 3] within normal range
- No error spikes
- No user complaints
Example:
- [ ] Connection pool <50% of max
- [ ] API response time <200ms (p95)
- [ ] Error rate <0.1%
- [ ] Database CPU <70%
Rollback Plan
If mitigation actions fail or make things worse:
Immediate Rollback
# Rollback code deployment
git revert <commit>
npm run deploy
# Rollback database config
ALTER SYSTEM SET max_connections = 100;
SELECT pg_reload_conf();
# Verify rollback
curl http://localhost/health
When to Rollback
Rollback if:
- Issue worsens after mitigation
- New errors appear
- Service remains down >X minutes after mitigation
- Metrics worsen (response time, error rate)
Next Steps
After incident is resolved:
-
Create post-mortem (within 24 hours)
- Owner: [Name]
- Due: [Date]
-
Schedule post-mortem review meeting
- Date: [Date]
- Attendees: [List]
-
Track action items to completion
- Use: [JIRA/GitHub/etc.]
- Review: Weekly in team meeting
-
Update runbooks based on learnings
- Owner: [Name]
- Due: [Date]
-
Share learnings with organization
- Format: All-hands presentation / Email / Wiki
- Owner: [Name]
- Due: [Date]
Appendix
Commands Reference
# Useful commands for this incident
<command1>
<command2>
<command3>
Links
- Monitoring Dashboard: [URL]
- Runbook: [URL]
- Related Incidents: [URL]
- Incident Channel: [Slack/Teams URL]
Plan Created: YYYY-MM-DD HH:MM UTC Plan Updated: YYYY-MM-DD HH:MM UTC Status: Active / Executed / Superseded