Files
gh-anton-abyzov-specweave-p…/agents/sre/templates/mitigation-plan.md
2025-11-29 17:56:41 +08:00

9.1 KiB

Mitigation Plan: [Incident Title]

Date: YYYY-MM-DD HH:MM UTC Incident: [Brief description] Root Cause: [Root cause if known, or "Under investigation"] Severity: SEV1 / SEV2 / SEV3 Created By: [Name]


Executive Summary

Problem: [What's broken in one sentence]

Impact: [Who's affected and how]

Solution: [High-level approach]

ETA: [Estimated time to resolution]

Example:

Problem: Database connection pool exhausted due to connection leak
Impact: All users unable to access application (100% downtime)
Solution: Restart application + fix connection leak in code
ETA: 30 minutes (service restored in 5 min, permanent fix in 30 min)

Three-Horizon Mitigation

Immediate (Now - 5 minutes)

Goal: Stop the bleeding, restore service immediately

Actions:

  • [Action 1]
    • What: [Detailed description]
    • How: [Commands/steps]
    • Impact: [Expected improvement]
    • Risk: [Low/Medium/High + explanation]
    • Rollback: [How to undo if it fails]
    • ETA: [Time to execute]
    • Owner: [Who will do this]

Example:

- [ ] Restart payment service to release connections
  - What: Restart payment service to release database connections
  - How: `systemctl restart payment-service`
  - Impact: All 100 connections released, service restored
  - Risk: Low (stateless service, graceful restart)
  - Rollback: N/A (restart is safe)
  - ETA: 2 minutes
  - Owner: Jane (SRE)

- [ ] Monitor connection pool for 5 minutes
  - What: Verify connections stay below 80%
  - How: `watch -n 5 'psql -c "SELECT count(*) FROM pg_stat_activity"'`
  - Impact: Early detection if issue recurs
  - Risk: None (monitoring only)
  - Rollback: N/A
  - ETA: 5 minutes
  - Owner: Jane (SRE)

Success Criteria:

  • Service health check passing
  • Users able to access application
  • Connection pool <80% of max
  • No active alerts

Short-term (5 minutes - 1 hour)

Goal: Tactical fix to prevent immediate recurrence

Actions:

  • [Action 1]
    • What: [Detailed description]
    • How: [Commands/steps]
    • Impact: [Expected improvement]
    • Risk: [Low/Medium/High + explanation]
    • Rollback: [How to undo if it fails]
    • ETA: [Time to execute]
    • Owner: [Who will do this]

Example:

- [ ] Fix connection leak in payment service code
  - What: Add `finally` block to close connection in error path
  - How: Deploy hotfix branch `fix/connection-leak`
  - Impact: Connections properly closed, no leak
  - Risk: Medium (code change requires testing)
  - Rollback: `git revert <commit>` + redeploy
  - ETA: 30 minutes (test + deploy)
  - Owner: Mike (Developer)

- [ ] Increase connection pool size
  - What: Increase max_connections from 100 to 200
  - How: ALTER SYSTEM SET max_connections = 200; SELECT pg_reload_conf();
  - Impact: More headroom for traffic spikes
  - Risk: Low (more connections = more memory, but server has capacity)
  - Rollback: ALTER SYSTEM SET max_connections = 100; SELECT pg_reload_conf();
  - ETA: 5 minutes
  - Owner: Tom (DBA)

- [ ] Add connection pool monitoring alert
  - What: Alert when connections >80% of max
  - How: Create CloudWatch/Grafana alert
  - Impact: Early warning before exhaustion
  - Risk: None (monitoring only)
  - Rollback: Disable alert
  - ETA: 15 minutes
  - Owner: Jane (SRE)

Success Criteria:

  • Code fix deployed and verified
  • Connection pool increased
  • Monitoring alert configured
  • No recurrence in 1 hour
  • Load test passed (if applicable)

Long-term (1 hour - days/weeks)

Goal: Permanent fix and prevention

Actions:

  • [Action 1]
    • What: [Detailed description]
    • Priority: P1 / P2 / P3
    • Due Date: [YYYY-MM-DD]
    • Owner: [Who will do this]

Example:

- [ ] Add automated test for connection cleanup
  - What: Integration test that verifies connections are closed in error paths
  - Priority: P1
  - Due Date: 2025-10-27
  - Owner: Lisa (QA)

- [ ] Add connection timeout configuration
  - What: Set connection_timeout = 30s in database config
  - Priority: P2
  - Due Date: 2025-10-28
  - Owner: Tom (DBA)

- [ ] Review all database queries for connection leaks
  - What: Audit all DB queries to ensure proper cleanup
  - Priority: P3
  - Due Date: 2025-11-02
  - Owner: Mike (Developer)

- [ ] Load test for high-traffic events
  - What: Load test with 10x normal traffic to find bottlenecks
  - Priority: P3
  - Due Date: 2025-11-10
  - Owner: John (DevOps)

- [ ] Update runbook with new findings
  - What: Document connection leak troubleshooting steps
  - Priority: P3
  - Due Date: 2025-10-28
  - Owner: Jane (SRE)

Success Criteria:

  • All P1 actions completed
  • Regression test added (prevents future occurrences)
  • Monitoring improved (detect earlier)
  • Runbook updated
  • Post-mortem published

Risk Assessment

Risks of Mitigation Actions

Action Risk Level Risk Description Mitigation
[Action 1] Low/Med/High [What could go wrong] [How to reduce risk]

Example:

| Restart service | Low | Brief downtime (5s) | Use graceful restart, off-peak time |
| Deploy code fix | Medium | Bug in fix could worsen issue | Test in staging first, have rollback ready |
| Increase connection pool | Low | More memory usage | Server has capacity, monitor memory |

Risks of NOT Mitigating

Risk Impact Probability
[Risk 1] [Impact if we do nothing] High/Med/Low

Example:

| Service remains down | All users affected, revenue loss | High (will recur) |
| Connection leak worsens | Database crashes | High |
| SLA breach | Customer refunds, reputation damage | Medium |

Communication Plan

Internal Communication

Incident Channel: #incident-YYYYMMDD-title

Update Frequency: Every [X] minutes

Stakeholders to Notify:

  • Engineering team (#engineering)
  • Customer support (#support)
  • Management (#management)
  • [Other teams]

Update Template:

[HH:MM] Update:
- Status: [Investigating / Mitigating / Resolved]
- Root Cause: [Known / Under investigation]
- Current Action: [What we're doing now]
- Next Steps: [What's next]
- ETA: [Estimated resolution time]

External Communication

Status Page: [URL]

Update Frequency: Every [X] minutes or when status changes

Status Page Template:

[HH:MM] Investigating: We are currently investigating [issue description]. Our team is actively working on a resolution.

[HH:MM] Identified: We have identified the issue as [root cause]. We are implementing a fix. ETA: [time].

[HH:MM] Monitoring: The fix has been deployed. We are monitoring to ensure stability.

[HH:MM] Resolved: The issue has been fully resolved. All services are operating normally. We apologize for the inconvenience.

Customer Email (if needed):

  • Draft email
  • Approve with management
  • Send to affected customers

Validation

Before Declaring Resolved

Verify all of the following:

  • Root cause identified
  • Immediate fix deployed and verified
  • Service health check passing for >30 minutes
  • Users able to access application
  • Metrics returned to normal (response time, error rate, etc.)
  • No active alerts
  • Load test passed (if applicable)
  • Customer support confirms no ongoing issues

Monitoring After Resolution

Monitor for [X] hours after declaring resolved:

  • [Metric 1] within normal range
  • [Metric 2] within normal range
  • [Metric 3] within normal range
  • No error spikes
  • No user complaints

Example:

- [ ] Connection pool <50% of max
- [ ] API response time <200ms (p95)
- [ ] Error rate <0.1%
- [ ] Database CPU <70%

Rollback Plan

If mitigation actions fail or make things worse:

Immediate Rollback

# Rollback code deployment
git revert <commit>
npm run deploy

# Rollback database config
ALTER SYSTEM SET max_connections = 100;
SELECT pg_reload_conf();

# Verify rollback
curl http://localhost/health

When to Rollback

Rollback if:

  • Issue worsens after mitigation
  • New errors appear
  • Service remains down >X minutes after mitigation
  • Metrics worsen (response time, error rate)

Next Steps

After incident is resolved:

  1. Create post-mortem (within 24 hours)

    • Owner: [Name]
    • Due: [Date]
  2. Schedule post-mortem review meeting

    • Date: [Date]
    • Attendees: [List]
  3. Track action items to completion

    • Use: [JIRA/GitHub/etc.]
    • Review: Weekly in team meeting
  4. Update runbooks based on learnings

    • Owner: [Name]
    • Due: [Date]
  5. Share learnings with organization

    • Format: All-hands presentation / Email / Wiki
    • Owner: [Name]
    • Due: [Date]

Appendix

Commands Reference

# Useful commands for this incident
<command1>
<command2>
<command3>
  • Monitoring Dashboard: [URL]
  • Runbook: [URL]
  • Related Incidents: [URL]
  • Incident Channel: [Slack/Teams URL]

Plan Created: YYYY-MM-DD HH:MM UTC Plan Updated: YYYY-MM-DD HH:MM UTC Status: Active / Executed / Superseded