zhongwei/gh-anton-abyzov-specweave-plugins-specweave-infrastructure

Files

Zhongwei Li 9427ed1eea Initial commit

2025-11-29 17:56:41 +08:00

9.1 KiB

Raw Blame History

Mitigation Plan: [Incident Title]

Date: YYYY-MM-DD HH:MM UTC Incident: [Brief description] Root Cause: [Root cause if known, or "Under investigation"] Severity: SEV1 / SEV2 / SEV3 Created By: [Name]

Executive Summary

Problem: [What's broken in one sentence]

Impact: [Who's affected and how]

Solution: [High-level approach]

ETA: [Estimated time to resolution]

Example:

Problem: Database connection pool exhausted due to connection leak
Impact: All users unable to access application (100% downtime)
Solution: Restart application + fix connection leak in code
ETA: 30 minutes (service restored in 5 min, permanent fix in 30 min)

Three-Horizon Mitigation

Immediate (Now - 5 minutes)

Goal: Stop the bleeding, restore service immediately

Actions:

[Action 1]
- What: [Detailed description]
- How: [Commands/steps]
- Impact: [Expected improvement]
- Risk: [Low/Medium/High + explanation]
- Rollback: [How to undo if it fails]
- ETA: [Time to execute]
- Owner: [Who will do this]

Example:

- [ ] Restart payment service to release connections
  - What: Restart payment service to release database connections
  - How: `systemctl restart payment-service`
  - Impact: All 100 connections released, service restored
  - Risk: Low (stateless service, graceful restart)
  - Rollback: N/A (restart is safe)
  - ETA: 2 minutes
  - Owner: Jane (SRE)

- [ ] Monitor connection pool for 5 minutes
  - What: Verify connections stay below 80%
  - How: `watch -n 5 'psql -c "SELECT count(*) FROM pg_stat_activity"'`
  - Impact: Early detection if issue recurs
  - Risk: None (monitoring only)
  - Rollback: N/A
  - ETA: 5 minutes
  - Owner: Jane (SRE)

Success Criteria:

Service health check passing
Users able to access application
Connection pool <80% of max
No active alerts

Short-term (5 minutes - 1 hour)

Goal: Tactical fix to prevent immediate recurrence

Actions:

[Action 1]
- What: [Detailed description]
- How: [Commands/steps]
- Impact: [Expected improvement]
- Risk: [Low/Medium/High + explanation]
- Rollback: [How to undo if it fails]
- ETA: [Time to execute]
- Owner: [Who will do this]

Example:

- [ ] Fix connection leak in payment service code
  - What: Add `finally` block to close connection in error path
  - How: Deploy hotfix branch `fix/connection-leak`
  - Impact: Connections properly closed, no leak
  - Risk: Medium (code change requires testing)
  - Rollback: `git revert <commit>` + redeploy
  - ETA: 30 minutes (test + deploy)
  - Owner: Mike (Developer)

- [ ] Increase connection pool size
  - What: Increase max_connections from 100 to 200
  - How: ALTER SYSTEM SET max_connections = 200; SELECT pg_reload_conf();
  - Impact: More headroom for traffic spikes
  - Risk: Low (more connections = more memory, but server has capacity)
  - Rollback: ALTER SYSTEM SET max_connections = 100; SELECT pg_reload_conf();
  - ETA: 5 minutes
  - Owner: Tom (DBA)

- [ ] Add connection pool monitoring alert
  - What: Alert when connections >80% of max
  - How: Create CloudWatch/Grafana alert
  - Impact: Early warning before exhaustion
  - Risk: None (monitoring only)
  - Rollback: Disable alert
  - ETA: 15 minutes
  - Owner: Jane (SRE)

Success Criteria:

Code fix deployed and verified
Connection pool increased
Monitoring alert configured
No recurrence in 1 hour
Load test passed (if applicable)

Long-term (1 hour - days/weeks)

Goal: Permanent fix and prevention

Actions:

[Action 1]
- What: [Detailed description]
- Priority: P1 / P2 / P3
- Due Date: [YYYY-MM-DD]
- Owner: [Who will do this]

Example:

- [ ] Add automated test for connection cleanup
  - What: Integration test that verifies connections are closed in error paths
  - Priority: P1
  - Due Date: 2025-10-27
  - Owner: Lisa (QA)

- [ ] Add connection timeout configuration
  - What: Set connection_timeout = 30s in database config
  - Priority: P2
  - Due Date: 2025-10-28
  - Owner: Tom (DBA)

- [ ] Review all database queries for connection leaks
  - What: Audit all DB queries to ensure proper cleanup
  - Priority: P3
  - Due Date: 2025-11-02
  - Owner: Mike (Developer)

- [ ] Load test for high-traffic events
  - What: Load test with 10x normal traffic to find bottlenecks
  - Priority: P3
  - Due Date: 2025-11-10
  - Owner: John (DevOps)

- [ ] Update runbook with new findings
  - What: Document connection leak troubleshooting steps
  - Priority: P3
  - Due Date: 2025-10-28
  - Owner: Jane (SRE)

Success Criteria:

All P1 actions completed
Regression test added (prevents future occurrences)
Monitoring improved (detect earlier)
Runbook updated
Post-mortem published

Risk Assessment

Risks of Mitigation Actions

Action	Risk Level	Risk Description	Mitigation
[Action 1]	Low/Med/High	[What could go wrong]	[How to reduce risk]

Example:

| Restart service | Low | Brief downtime (5s) | Use graceful restart, off-peak time |
| Deploy code fix | Medium | Bug in fix could worsen issue | Test in staging first, have rollback ready |
| Increase connection pool | Low | More memory usage | Server has capacity, monitor memory |

Risks of NOT Mitigating

Risk	Impact	Probability
[Risk 1]	[Impact if we do nothing]	High/Med/Low

Example:

| Service remains down | All users affected, revenue loss | High (will recur) |
| Connection leak worsens | Database crashes | High |
| SLA breach | Customer refunds, reputation damage | Medium |

Communication Plan

Internal Communication

Incident Channel: #incident-YYYYMMDD-title

Update Frequency: Every [X] minutes

Stakeholders to Notify:

Engineering team (#engineering)
Customer support (#support)
Management (#management)
[Other teams]

Update Template:

[HH:MM] Update:
- Status: [Investigating / Mitigating / Resolved]
- Root Cause: [Known / Under investigation]
- Current Action: [What we're doing now]
- Next Steps: [What's next]
- ETA: [Estimated resolution time]

External Communication

Status Page: [URL]

Update Frequency: Every [X] minutes or when status changes

Status Page Template:

[HH:MM] Investigating: We are currently investigating [issue description]. Our team is actively working on a resolution.

[HH:MM] Identified: We have identified the issue as [root cause]. We are implementing a fix. ETA: [time].

[HH:MM] Monitoring: The fix has been deployed. We are monitoring to ensure stability.

[HH:MM] Resolved: The issue has been fully resolved. All services are operating normally. We apologize for the inconvenience.

Customer Email (if needed):

Draft email
Approve with management
Send to affected customers

Validation

Before Declaring Resolved

Verify all of the following:

Root cause identified
Immediate fix deployed and verified
Service health check passing for >30 minutes
Users able to access application
Metrics returned to normal (response time, error rate, etc.)
No active alerts
Load test passed (if applicable)
Customer support confirms no ongoing issues

Monitoring After Resolution

Monitor for [X] hours after declaring resolved:

[Metric 1] within normal range
[Metric 2] within normal range
[Metric 3] within normal range
No error spikes
No user complaints

Example:

- [ ] Connection pool <50% of max
- [ ] API response time <200ms (p95)
- [ ] Error rate <0.1%
- [ ] Database CPU <70%

Rollback Plan

If mitigation actions fail or make things worse:

Immediate Rollback

# Rollback code deployment
git revert <commit>
npm run deploy

# Rollback database config
ALTER SYSTEM SET max_connections = 100;
SELECT pg_reload_conf();

# Verify rollback
curl http://localhost/health

When to Rollback

Rollback if:

Issue worsens after mitigation
New errors appear
Service remains down >X minutes after mitigation
Metrics worsen (response time, error rate)

Next Steps

After incident is resolved:

Create post-mortem (within 24 hours)
- Owner: [Name]
- Due: [Date]
Schedule post-mortem review meeting
- Date: [Date]
- Attendees: [List]
Track action items to completion
- Use: [JIRA/GitHub/etc.]
- Review: Weekly in team meeting
Update runbooks based on learnings
- Owner: [Name]
- Due: [Date]
Share learnings with organization
- Format: All-hands presentation / Email / Wiki
- Owner: [Name]
- Due: [Date]

Appendix

Commands Reference

# Useful commands for this incident
<command1>
<command2>
<command3>

9.1 KiB Raw Blame History

Mitigation Plan: [Incident Title]

Executive Summary

Three-Horizon Mitigation

Immediate (Now - 5 minutes)

Short-term (5 minutes - 1 hour)

Long-term (1 hour - days/weeks)

Risk Assessment

Risks of Mitigation Actions

Risks of NOT Mitigating

Communication Plan

Internal Communication

External Communication

Validation

Before Declaring Resolved

Monitoring After Resolution

Rollback Plan

Immediate Rollback

When to Rollback

Next Steps

Appendix

Commands Reference

Links

9.1 KiB

Raw Blame History