Initial commit
This commit is contained in:
26
skills/incident-response/SKILL.md
Normal file
26
skills/incident-response/SKILL.md
Normal file
@@ -0,0 +1,26 @@
|
||||
# Incident Response Skill
|
||||
|
||||
Handle production incidents with SRE best practices including detection, investigation, mitigation, recovery, and postmortems.
|
||||
|
||||
## Description
|
||||
|
||||
Production incident response following SRE methodologies with incident timeline tracking, RCA documentation, and runbook updates.
|
||||
|
||||
## What's Included
|
||||
|
||||
- **Examples**: SEV1 incident handling, postmortem templates
|
||||
- **Reference**: SRE best practices, incident severity levels
|
||||
- **Templates**: Incident reports, RCA documents, runbook updates
|
||||
|
||||
## Use When
|
||||
|
||||
- Production outages
|
||||
- SEV1/SEV2 incidents
|
||||
- Postmortem creation
|
||||
- Runbook updates
|
||||
|
||||
## Related Agents
|
||||
|
||||
- `incident-responder`
|
||||
|
||||
**Skill Version**: 1.0
|
||||
122
skills/incident-response/examples/INDEX.md
Normal file
122
skills/incident-response/examples/INDEX.md
Normal file
@@ -0,0 +1,122 @@
|
||||
# Incident Response Examples
|
||||
|
||||
Real-world production incident examples demonstrating systematic incident response, root cause analysis, mitigation strategies, and blameless postmortems.
|
||||
|
||||
## Available Examples
|
||||
|
||||
### SEV1: Critical Database Outage
|
||||
|
||||
**File**: [sev1-critical-database-outage.md](sev1-critical-database-outage.md)
|
||||
|
||||
Complete database failure causing total service outage:
|
||||
- **Incident**: PostgreSQL primary failure, 100% error rate
|
||||
- **Impact**: All services down, $50K revenue loss/hour
|
||||
- **Root Cause**: Disk full on primary, replication lag spike
|
||||
- **Resolution**: Promoted replica, cleared disk space, restored service
|
||||
- **MTTR**: 45 minutes (detection → full recovery)
|
||||
- **Prevention**: Disk monitoring alerts, automatic disk cleanup, replica promotion automation
|
||||
|
||||
**Key Learnings**:
|
||||
- Importance of replica promotion runbooks
|
||||
- Disk space monitoring thresholds
|
||||
- Automated failover procedures
|
||||
|
||||
---
|
||||
|
||||
### SEV2: API Performance Degradation
|
||||
|
||||
**File**: [sev2-api-performance-degradation.md](sev2-api-performance-degradation.md)
|
||||
|
||||
Gradual performance degradation due to memory leak:
|
||||
- **Incident**: API p95 latency 200ms → 5000ms over 2 hours
|
||||
- **Impact**: 30% of users affected, slow page loads
|
||||
- **Root Cause**: Memory leak in worker process, OOM killing workers
|
||||
- **Resolution**: Identified leak with heap snapshot, deployed fix, restarted workers
|
||||
- **MTTR**: 3 hours (detection → permanent fix)
|
||||
- **Prevention**: Memory profiling in CI/CD, heap snapshot automation, worker restart automation
|
||||
|
||||
**Key Learnings**:
|
||||
- Early detection through gradual alerts
|
||||
- Heap snapshot analysis for memory leaks
|
||||
- Temporary mitigation (worker restarts) vs permanent fix
|
||||
|
||||
---
|
||||
|
||||
### SEV3: Feature Flag Misconfiguration
|
||||
|
||||
**File**: [sev3-feature-flag-misconfiguration.md](sev3-feature-flag-misconfiguration.md)
|
||||
|
||||
Feature flag enabled for wrong audience causing confusion:
|
||||
- **Incident**: Experimental feature shown to 20% of production users
|
||||
- **Impact**: 200 support tickets, user confusion, no revenue impact
|
||||
- **Root Cause**: Feature flag percentage set to 20% instead of 0%
|
||||
- **Resolution**: Disabled flag, sent customer communication, updated flag process
|
||||
- **MTTR**: 30 minutes (detection → resolution)
|
||||
- **Prevention**: Feature flag code review, staging validation, gradual rollout process
|
||||
|
||||
**Key Learnings**:
|
||||
- Feature flag validation before production
|
||||
- Importance of clear documentation
|
||||
- Quick rollback procedures
|
||||
|
||||
---
|
||||
|
||||
### Distributed Tracing Investigation
|
||||
|
||||
**File**: [distributed-tracing-investigation.md](distributed-tracing-investigation.md)
|
||||
|
||||
Using Jaeger distributed tracing to find microservice bottleneck:
|
||||
- **Incident**: Checkout API slow (3s p95), unclear which service
|
||||
- **Investigation**: Used Jaeger to trace request flow across 7 microservices
|
||||
- **Root Cause**: Payment service calling external API synchronously (2.8s)
|
||||
- **Resolution**: Moved external API call to async background job
|
||||
- **Impact**: p95 latency 3000ms → 150ms (95% faster)
|
||||
|
||||
**Key Learnings**:
|
||||
- Power of distributed tracing for microservices
|
||||
- Synchronous external dependencies are dangerous
|
||||
- Background jobs for non-critical operations
|
||||
|
||||
---
|
||||
|
||||
### Cascade Failure Prevention
|
||||
|
||||
**File**: [cascade-failure-prevention.md](cascade-failure-prevention.md)
|
||||
|
||||
Preventing cascade failure through circuit breakers and bulkheads:
|
||||
- **Incident**: Auth service down, caused all dependent services to fail
|
||||
- **Impact**: Complete outage instead of graceful degradation
|
||||
- **Root Cause**: No circuit breakers, all services retrying auth indefinitely
|
||||
- **Resolution**: Implemented circuit breakers, bulkhead isolation, fallback logic
|
||||
- **Prevention**: Circuit breaker pattern, timeout configuration, graceful degradation
|
||||
|
||||
**Key Learnings**:
|
||||
- Circuit breakers prevent cascade failures
|
||||
- Bulkhead isolation limits blast radius
|
||||
- Fallback logic enables graceful degradation
|
||||
|
||||
---
|
||||
|
||||
## Learning Outcomes
|
||||
|
||||
After studying these examples, you will understand:
|
||||
|
||||
1. **Incident Classification**: How to assess severity (SEV1-SEV4) based on impact
|
||||
2. **Incident Command**: Role of IC, communication protocols, timeline management
|
||||
3. **Root Cause Analysis**: 5 Whys, timeline reconstruction, data-driven investigation
|
||||
4. **Mitigation Strategies**: Immediate actions, temporary fixes, permanent solutions
|
||||
5. **Blameless Postmortems**: Focus on systems not people, actionable items, continuous improvement
|
||||
6. **Communication**: Internal updates, external communications, executive briefings
|
||||
7. **Prevention**: Monitoring improvements, runbook automation, architectural changes
|
||||
|
||||
---
|
||||
|
||||
## Related Documentation
|
||||
|
||||
- **Reference**: [Reference Index](../reference/INDEX.md) - Severity matrix, communication templates, RCA techniques
|
||||
- **Templates**: [Templates Index](../templates/INDEX.md) - Incident timeline, postmortem, runbook templates
|
||||
- **Main Agent**: [incident-responder.md](../incident-responder.md) - Incident responder agent
|
||||
|
||||
---
|
||||
|
||||
Return to [main agent](../incident-responder.md)
|
||||
82
skills/incident-response/reference/INDEX.md
Normal file
82
skills/incident-response/reference/INDEX.md
Normal file
@@ -0,0 +1,82 @@
|
||||
# Incident Response Reference
|
||||
|
||||
Quick reference guides for incident severity classification, communication templates, root cause analysis techniques, and runbook structure.
|
||||
|
||||
## Available References
|
||||
|
||||
### Incident Severity Matrix
|
||||
|
||||
**File**: [incident-severity-matrix.md](incident-severity-matrix.md)
|
||||
|
||||
Complete severity classification guide with examples:
|
||||
- **SEV1 (Critical)**: Complete outage, all customers affected, revenue stopped
|
||||
- **SEV2 (Major)**: Partial degradation, significant customer impact
|
||||
- **SEV3 (Minor)**: Isolated issues, workarounds available
|
||||
- **SEV4 (Cosmetic)**: UI issues, no functional impact
|
||||
|
||||
**Use when**: Classifying incident severity, determining escalation path
|
||||
|
||||
---
|
||||
|
||||
### Communication Templates
|
||||
|
||||
**File**: [communication-templates.md](communication-templates.md)
|
||||
|
||||
Ready-to-use templates for all incident communications:
|
||||
- Internal updates (Slack, email)
|
||||
- External communications (status page, customer emails)
|
||||
- Executive briefings
|
||||
- Post-incident summaries
|
||||
- Postmortem distribution
|
||||
|
||||
**Use when**: Communicating during or after incidents
|
||||
|
||||
---
|
||||
|
||||
### Root Cause Analysis Techniques
|
||||
|
||||
**File**: [rca-techniques.md](rca-techniques.md)
|
||||
|
||||
Comprehensive RCA methodology guide:
|
||||
- **5 Whys**: Iterative questioning to find root cause
|
||||
- **Fishbone Diagrams**: Category-based analysis
|
||||
- **Timeline Reconstruction**: Event sequencing and correlation
|
||||
- **Contributing Factors Analysis**: Immediate vs underlying vs latent causes
|
||||
- **Hypothesis Testing**: Data-driven validation
|
||||
|
||||
**Use when**: Conducting root cause analysis, writing postmortems
|
||||
|
||||
---
|
||||
|
||||
### Runbook Structure Guide
|
||||
|
||||
**File**: [runbook-structure-guide.md](runbook-structure-guide.md)
|
||||
|
||||
Best practices for writing effective runbooks:
|
||||
- Standard runbook template
|
||||
- Diagnostic procedures
|
||||
- Remediation steps
|
||||
- Escalation paths
|
||||
- Success criteria
|
||||
- Runbook maintenance
|
||||
|
||||
**Use when**: Creating or updating runbooks, automating diagnostics
|
||||
|
||||
---
|
||||
|
||||
## Quick Links
|
||||
|
||||
**By Use Case**:
|
||||
- Need to classify incident severity → [Severity Matrix](incident-severity-matrix.md)
|
||||
- Need to communicate during incident → [Communication Templates](communication-templates.md)
|
||||
- Need to find root cause → [RCA Techniques](rca-techniques.md)
|
||||
- Need to write a runbook → [Runbook Structure Guide](runbook-structure-guide.md)
|
||||
|
||||
**Related Documentation**:
|
||||
- **Examples**: [Examples Index](../examples/INDEX.md) - Real-world incident examples
|
||||
- **Templates**: [Templates Index](../templates/INDEX.md) - Incident timeline, postmortem templates
|
||||
- **Main Agent**: [incident-responder.md](../incident-responder.md) - Incident responder agent
|
||||
|
||||
---
|
||||
|
||||
Return to [main agent](../incident-responder.md)
|
||||
213
skills/incident-response/reference/communication-templates.md
Normal file
213
skills/incident-response/reference/communication-templates.md
Normal file
@@ -0,0 +1,213 @@
|
||||
# Communication Templates
|
||||
|
||||
Copy-paste templates for incident communications across all channels and severity levels.
|
||||
|
||||
## Internal Communications
|
||||
|
||||
### SEV1 Incident Start
|
||||
|
||||
```
|
||||
🚨 SEV1 INCIDENT DECLARED 🚨
|
||||
Incident ID: INC-YYYY-MM-DD-XXX
|
||||
Impact: [Brief description - e.g., "100% outage, database down"]
|
||||
Affected Services: [List services]
|
||||
Customer Impact: [All users / X% of users / Specific feature unavailable]
|
||||
|
||||
Roles:
|
||||
- Incident Commander: @[name]
|
||||
- Technical Lead: @[name]
|
||||
- Communications Lead: @[name]
|
||||
|
||||
War Room:
|
||||
- Slack: #incident-XXX
|
||||
- Zoom: [link]
|
||||
|
||||
Status: Investigating
|
||||
Next Update: [Time] (15 minutes or on status change)
|
||||
Runbook: [Link if available]
|
||||
```
|
||||
|
||||
### SEV2 Incident Start
|
||||
|
||||
```
|
||||
⚠️ SEV2 INCIDENT
|
||||
Incident ID: INC-YYYY-MM-DD-XXX
|
||||
Impact: [Brief description - e.g., "API degraded, 30% users affected"]
|
||||
Symptoms: [What users are experiencing]
|
||||
|
||||
IC: @[name]
|
||||
Status: Investigating [suspected cause]
|
||||
Next Update: 30 minutes
|
||||
```
|
||||
|
||||
### Incident Update
|
||||
|
||||
```
|
||||
📊 UPDATE #[N] (T+[X] minutes)
|
||||
Root Cause: [What we found OR "Still investigating"]
|
||||
Mitigation: [What we're doing]
|
||||
Impact: [Current status - improving/stable/worsening]
|
||||
ETA: [Expected resolution time OR "Unknown"]
|
||||
Next Update: [Time]
|
||||
```
|
||||
|
||||
### Incident Resolved
|
||||
|
||||
```
|
||||
🎉 INCIDENT RESOLVED (T+[X] minutes/hours)
|
||||
Final Status: All services operational
|
||||
Root Cause: [Brief summary]
|
||||
Fix Applied: [What was done]
|
||||
Monitoring: Ongoing for [duration]
|
||||
|
||||
Postmortem: Scheduled for [date/time]
|
||||
Timeline: [Link to detailed timeline]
|
||||
```
|
||||
|
||||
## External Communications
|
||||
|
||||
### Status Page - Investigating
|
||||
|
||||
```
|
||||
🔴 INVESTIGATING - [Brief Title]
|
||||
|
||||
We are investigating reports of [issue description].
|
||||
Our team is actively working to identify the cause.
|
||||
|
||||
Affected: [Service names]
|
||||
Started: [HH:MM UTC]
|
||||
Next Update: [HH:MM UTC]
|
||||
```
|
||||
|
||||
### Status Page - Identified
|
||||
|
||||
```
|
||||
🟡 IDENTIFIED - [Issue Title]
|
||||
|
||||
We have identified the issue as [brief cause].
|
||||
Our team is implementing a fix.
|
||||
|
||||
Affected: [Service names]
|
||||
Started: [HH:MM UTC]
|
||||
Identified: [HH:MM UTC]
|
||||
Est. Resolution: [HH:MM UTC]
|
||||
```
|
||||
|
||||
### Status Page - Monitoring
|
||||
|
||||
```
|
||||
🟢 MONITORING - [Issue Title] Resolved
|
||||
|
||||
The issue has been resolved and services are operating normally.
|
||||
We are monitoring to ensure stability.
|
||||
|
||||
Started: [HH:MM UTC]
|
||||
Resolved: [HH:MM UTC]
|
||||
Duration: [X] minutes
|
||||
```
|
||||
|
||||
### Customer Email - SEV1 Postmortem
|
||||
|
||||
```
|
||||
Subject: Service Disruption - [Date] Postmortem
|
||||
|
||||
Dear [Product Name] Customers,
|
||||
|
||||
On [Date] at [Time UTC], we experienced a service disruption that affected [all users / X% of users] for approximately [duration].
|
||||
|
||||
What Happened:
|
||||
[2-3 sentence summary of the incident]
|
||||
|
||||
Impact:
|
||||
- Duration: [X] minutes
|
||||
- Affected Users: [percentage or description]
|
||||
- Services Impacted: [list]
|
||||
|
||||
Root Cause:
|
||||
[1-2 sentence explanation of root cause]
|
||||
|
||||
Resolution:
|
||||
[1-2 sentences on how we fixed it]
|
||||
|
||||
Prevention:
|
||||
We have implemented the following measures to prevent recurrence:
|
||||
1. [Measure 1]
|
||||
2. [Measure 2]
|
||||
3. [Measure 3]
|
||||
|
||||
We sincerely apologize for the inconvenience and appreciate your patience.
|
||||
|
||||
[Team Name]
|
||||
[Company Name]
|
||||
```
|
||||
|
||||
## Executive Briefings
|
||||
|
||||
### Initial Notification (SEV1 only)
|
||||
|
||||
```
|
||||
Subject: SEV1 Incident - [Brief Title]
|
||||
|
||||
Summary:
|
||||
- Incident ID: INC-YYYY-MM-DD-XXX
|
||||
- Started: [HH:MM UTC]
|
||||
- Impact: [All users affected / X% affected / revenue stopped]
|
||||
- Status: [Investigating / Mitigation in progress]
|
||||
|
||||
Current Situation:
|
||||
[2-3 sentences explaining what's happening]
|
||||
|
||||
Response:
|
||||
- IC: [Name]
|
||||
- Team: [X] engineers actively working
|
||||
- ETA: [Time if known, "Unknown" if not]
|
||||
|
||||
Business Impact:
|
||||
- Revenue: [Estimated $ per hour OR "Minimal"]
|
||||
- Customers: [Number affected]
|
||||
- SLA: [Yes/No breach, details]
|
||||
|
||||
Next Update: [Time]
|
||||
```
|
||||
|
||||
### Resolution Summary (Executive)
|
||||
|
||||
```
|
||||
Subject: SEV1 Resolved - [Brief Title]
|
||||
|
||||
Incident INC-YYYY-MM-DD-XXX has been resolved after [duration].
|
||||
|
||||
Timeline:
|
||||
- Started: [HH:MM UTC]
|
||||
- Identified: [HH:MM UTC]
|
||||
- Resolved: [HH:MM UTC]
|
||||
- Total Duration: [X] minutes
|
||||
|
||||
Impact:
|
||||
- Customers Affected: [Number / percentage]
|
||||
- Revenue Loss: [$X estimated]
|
||||
- SLA Breach: [Yes/No]
|
||||
|
||||
Root Cause:
|
||||
[1-2 sentences]
|
||||
|
||||
Resolution:
|
||||
[1-2 sentences on fix]
|
||||
|
||||
Prevention:
|
||||
[2-3 key action items with owners and dates]
|
||||
|
||||
Postmortem: Scheduled for [date/time]
|
||||
|
||||
[IC Name]
|
||||
```
|
||||
|
||||
## Related Documentation
|
||||
|
||||
- **Severity Matrix**: [incident-severity-matrix.md](incident-severity-matrix.md) - When to use each template
|
||||
- **Examples**: [Examples Index](../examples/INDEX.md) - Real communications from incidents
|
||||
- **Templates**: [Templates Index](../templates/INDEX.md) - Full incident timeline and postmortem templates
|
||||
|
||||
---
|
||||
|
||||
Return to [reference index](INDEX.md)
|
||||
260
skills/incident-response/reference/rca-techniques.md
Normal file
260
skills/incident-response/reference/rca-techniques.md
Normal file
@@ -0,0 +1,260 @@
|
||||
# Root Cause Analysis Techniques
|
||||
|
||||
Comprehensive methods for identifying root causes through 5 Whys, Fishbone Diagrams, Timeline Reconstruction, and data-driven hypothesis testing.
|
||||
|
||||
## 5 Whys Technique
|
||||
|
||||
### Method
|
||||
|
||||
Ask "Why?" iteratively until you reach the root cause (typically 5 levels deep, but can be 3-7).
|
||||
|
||||
**Rules**:
|
||||
1. Start with the problem statement
|
||||
2. Ask "Why did this happen?"
|
||||
3. Answer based on facts/data (not assumptions)
|
||||
4. Repeat for each answer until root cause found
|
||||
5. Root cause = **systemic issue** (not human error)
|
||||
|
||||
### Example
|
||||
|
||||
**Problem**: Database went down
|
||||
|
||||
```
|
||||
Why 1: Why did the database go down?
|
||||
→ Because the primary database ran out of disk space
|
||||
|
||||
Why 2: Why did it run out of disk space?
|
||||
→ Because PostgreSQL logs filled the entire disk (450GB)
|
||||
|
||||
Why 3: Why did logs grow to 450GB?
|
||||
→ Because log rotation was disabled
|
||||
|
||||
Why 4: Why was log rotation disabled?
|
||||
→ Because the `log_truncate_on_rotation` config was set to `off` during a migration
|
||||
|
||||
Why 5: Why was this config change not caught?
|
||||
→ Because configuration changes are not code-reviewed and there was no disk monitoring alert
|
||||
|
||||
ROOT CAUSE: Missing disk monitoring alerts + configuration change without code review
|
||||
```
|
||||
|
||||
**Action Items**:
|
||||
- Add disk usage monitoring (>90% alert)
|
||||
- Require code review for all config changes
|
||||
- Enable log rotation on all databases
|
||||
|
||||
---
|
||||
|
||||
## Fishbone Diagram (Ishikawa)
|
||||
|
||||
### Method
|
||||
|
||||
Categorize contributing factors into major categories to identify root cause systematically.
|
||||
|
||||
**Categories** (6M's):
|
||||
1. **Method** (Process)
|
||||
2. **Machine** (Technology)
|
||||
3. **Material** (Inputs/Data)
|
||||
4. **Measurement** (Monitoring)
|
||||
5. **Mother Nature** (Environment)
|
||||
6. **Manpower** (People/Skills)
|
||||
|
||||
### Example
|
||||
|
||||
**Problem**: API performance degraded (p95: 200ms → 2000ms)
|
||||
|
||||
```
|
||||
API Performance Degraded
|
||||
│
|
||||
┌───────────────────┬───────────┴───────────┬───────────────────┐
|
||||
│ │ │ │
|
||||
METHOD MACHINE MATERIAL MEASUREMENT
|
||||
(Process) (Technology) (Data) (Monitoring)
|
||||
│ │ │ │
|
||||
No memory EventEmitter Large dataset No heap
|
||||
profiling in listeners leak processing snapshots
|
||||
code review (not removed) (100K orders) in CI/CD
|
||||
│ │ │ │
|
||||
No long-running Node.js v14 High traffic No gradual
|
||||
load tests (old GC) spike (2x) alerts
|
||||
(only 5min) (1h → 2h)
|
||||
```
|
||||
|
||||
**Root Causes Identified**:
|
||||
- **Machine**: EventEmitter leak (technical)
|
||||
- **Measurement**: No heap monitoring (monitoring gap)
|
||||
- **Method**: No memory profiling in code review (process gap)
|
||||
|
||||
---
|
||||
|
||||
## Timeline Reconstruction
|
||||
|
||||
### Method
|
||||
|
||||
Build chronological timeline of events to identify causation and correlation.
|
||||
|
||||
**Steps**:
|
||||
1. Gather logs from all systems (with timestamps)
|
||||
2. Normalize to UTC
|
||||
3. Plot events chronologically
|
||||
4. Identify cause-and-effect relationships
|
||||
5. Find the triggering event
|
||||
|
||||
### Example
|
||||
|
||||
```
|
||||
12:00:00 - Normal operation (p95: 200ms, memory: 400MB)
|
||||
12:15:00 - Code deployment (v2.15.4)
|
||||
12:30:00 - Memory: 720MB (+80% in 15min) ⚠️
|
||||
12:45:00 - Memory: 1.2GB (+67% in 15min) ⚠️
|
||||
13:00:00 - Memory: 1.8GB (+50% in 15min) 🚨
|
||||
13:00:00 - p95 latency: 800ms (4x slower)
|
||||
13:15:00 - Memory: 2.3GB (limit reached)
|
||||
13:15:00 - Workers start OOMing
|
||||
13:20:00 - p95 latency: 2000ms (10x slower)
|
||||
13:30:00 - Alert fired: High latency
|
||||
14:00:00 - Alert fired: High memory
|
||||
|
||||
CORRELATION:
|
||||
- Deployment at 12:15 → Memory growth starts at 12:30
|
||||
- Memory growth → Latency increase (correlated)
|
||||
- TRIGGER: Code deployment v2.15.4
|
||||
|
||||
ACTION: Review code changes in v2.15.4
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Contributing Factors Analysis
|
||||
|
||||
### Levels of Causation
|
||||
|
||||
**Immediate Cause** (What happened):
|
||||
- Direct technical failure
|
||||
- Example: EventEmitter listeners not removed
|
||||
|
||||
**Underlying Conditions** (Why it was possible):
|
||||
- Missing safeguards
|
||||
- Example: No memory profiling in code review
|
||||
|
||||
**Latent Failures** (Systemic weaknesses):
|
||||
- Organizational/process gaps
|
||||
- Example: No developer training on memory management
|
||||
|
||||
### Example
|
||||
|
||||
**Incident**: Memory leak in production
|
||||
|
||||
```
|
||||
Immediate Cause:
|
||||
└─ Code: EventEmitter .on() used without .removeListener()
|
||||
|
||||
Underlying Conditions:
|
||||
├─ No code review caught the issue
|
||||
├─ No memory profiling in CI/CD
|
||||
└─ Short load tests (5min) didn't reveal gradual leak
|
||||
|
||||
Latent Failures:
|
||||
├─ Team lacks memory management training
|
||||
├─ No documentation on EventEmitter best practices
|
||||
└─ Culture of "ship fast, fix later"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Hypothesis Testing
|
||||
|
||||
### Method
|
||||
|
||||
Generate hypotheses, test with data, validate or reject.
|
||||
|
||||
**Process**:
|
||||
1. Observe symptoms
|
||||
2. Generate hypotheses (educated guesses)
|
||||
3. Design experiments to test each hypothesis
|
||||
4. Collect data
|
||||
5. Accept or reject hypothesis
|
||||
6. Repeat until root cause found
|
||||
|
||||
### Example
|
||||
|
||||
**Symptom**: Checkout API slow (p95: 3000ms)
|
||||
|
||||
**Hypothesis 1**: Database slow queries
|
||||
```
|
||||
Test: Check slow query log
|
||||
Data: All queries < 50ms ✅
|
||||
Result: REJECTED - database is fast
|
||||
```
|
||||
|
||||
**Hypothesis 2**: External API slow
|
||||
```
|
||||
Test: Distributed tracing (Jaeger)
|
||||
Data: Fraud check API: 2750ms (91% of total time) 🚨
|
||||
Result: ACCEPTED - external API is bottleneck
|
||||
```
|
||||
|
||||
**Hypothesis 3**: Network latency
|
||||
```
|
||||
Test: curl timing breakdown
|
||||
Data: DNS: 50ms, Connect: 30ms, Transfer: 2750ms
|
||||
Result: PARTIAL - transfer is slow (not DNS/connect)
|
||||
```
|
||||
|
||||
**Root Cause**: External fraud check API slow (blocking checkout)
|
||||
|
||||
---
|
||||
|
||||
## Blameless RCA Principles
|
||||
|
||||
### Core Tenets
|
||||
|
||||
1. **Focus on Systems, Not People**
|
||||
- ❌ "Engineer made a mistake"
|
||||
- ✅ "Process didn't catch config error"
|
||||
|
||||
2. **Assume Good Intent**
|
||||
- Everyone did the best they could with information available
|
||||
- Blame discourages honesty and learning
|
||||
|
||||
3. **Multiple Contributing Factors**
|
||||
- Never a single cause
|
||||
- Usually 3-5 factors contribute
|
||||
|
||||
4. **Actionable Improvements**
|
||||
- Fix the system, not the person
|
||||
- Concrete action items with owners
|
||||
|
||||
### Example (Blameless vs Blame)
|
||||
|
||||
**Blamefu (BAD)**:
|
||||
```
|
||||
Root Cause: Engineer Jane deployed code without testing
|
||||
Action Item: Remind Jane to test before deploying
|
||||
```
|
||||
|
||||
**Blameless (GOOD)**:
|
||||
```
|
||||
Root Cause: Deployment process allowed untested code to reach production
|
||||
Contributing Factors:
|
||||
1. No automated tests in CI/CD
|
||||
2. Manual deployment process (prone to human error)
|
||||
3. No staging environment validation
|
||||
|
||||
Action Items:
|
||||
1. Add automated tests to CI/CD (Owner: Mike, Due: Dec 20)
|
||||
2. Require staging deployment + validation before production (Owner: Sarah, Due: Dec 22)
|
||||
3. Implement deployment checklist (Owner: Alex, Due: Dec 18)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Related Documentation
|
||||
|
||||
- **Examples**: [Examples Index](../examples/INDEX.md) - RCA examples from real incidents
|
||||
- **Severity Matrix**: [incident-severity-matrix.md](incident-severity-matrix.md) - When to perform RCA
|
||||
- **Templates**: [Postmortem Template](../templates/postmortem-template.md) - Structured RCA format
|
||||
|
||||
---
|
||||
|
||||
Return to [reference index](INDEX.md)
|
||||
76
skills/incident-response/templates/INDEX.md
Normal file
76
skills/incident-response/templates/INDEX.md
Normal file
@@ -0,0 +1,76 @@
|
||||
# Incident Response Templates
|
||||
|
||||
Ready-to-use templates for incident timelines, blameless postmortems, and runbooks. Copy and fill in for your incidents.
|
||||
|
||||
## Available Templates
|
||||
|
||||
### Incident Timeline Template
|
||||
|
||||
**File**: [incident-timeline-template.md](incident-timeline-template.md)
|
||||
|
||||
Real-time incident tracking template:
|
||||
- Incident overview (ID, severity, impact)
|
||||
- Chronological timeline (minute-by-minute)
|
||||
- Role assignments (IC, Tech Lead, Comms)
|
||||
- Status updates
|
||||
- Resolution summary
|
||||
|
||||
**Use when**: Tracking ongoing incident in real-time
|
||||
|
||||
---
|
||||
|
||||
### Postmortem Template
|
||||
|
||||
**File**: [postmortem-template.md](postmortem-template.md)
|
||||
|
||||
Blameless postmortem template:
|
||||
- Executive summary
|
||||
- Timeline reconstruction
|
||||
- Root cause analysis (5 Whys)
|
||||
- Contributing factors
|
||||
- Action items with owners
|
||||
- Lessons learned
|
||||
|
||||
**Use when**: Documenting incident after resolution (within 24-48 hours)
|
||||
|
||||
---
|
||||
|
||||
### Runbook Template
|
||||
|
||||
**File**: [runbook-template.md](runbook-template.md)
|
||||
|
||||
Standard runbook structure:
|
||||
- Problem description
|
||||
- Diagnostic steps with commands
|
||||
- Mitigation procedures
|
||||
- Escalation paths
|
||||
- Success criteria
|
||||
|
||||
**Use when**: Creating new runbook or updating existing one
|
||||
|
||||
---
|
||||
|
||||
## Template Usage
|
||||
|
||||
**How to use**:
|
||||
1. Copy template to your documentation system
|
||||
2. Fill in all `[FILL IN]` sections
|
||||
3. Remove optional sections if not applicable
|
||||
4. Share with team for review
|
||||
|
||||
**When to create**:
|
||||
- **Incident Timeline**: As soon as SEV1/SEV2 declared (real-time)
|
||||
- **Postmortem**: Within 24-48 hours of incident resolution
|
||||
- **Runbook**: After any new incident type or process improvement
|
||||
|
||||
---
|
||||
|
||||
## Related Documentation
|
||||
|
||||
- **Examples**: [Examples Index](../examples/INDEX.md) - See completed examples
|
||||
- **Reference**: [Reference Index](../reference/INDEX.md) - RCA techniques, communication templates
|
||||
- **Main Agent**: [incident-responder.md](../incident-responder.md) - Incident responder agent
|
||||
|
||||
---
|
||||
|
||||
Return to [main agent](../incident-responder.md)
|
||||
147
skills/incident-response/templates/incident-timeline-template.md
Normal file
147
skills/incident-response/templates/incident-timeline-template.md
Normal file
@@ -0,0 +1,147 @@
|
||||
# Incident Timeline: [INCIDENT TITLE]
|
||||
|
||||
**Incident ID**: INC-YYYY-MM-DD-XXX
|
||||
**Severity**: [SEV1 / SEV2 / SEV3]
|
||||
**Status**: [Investigating / Mitigating / Resolved / Monitoring]
|
||||
**Started**: [YYYY-MM-DD HH:MM UTC]
|
||||
|
||||
---
|
||||
|
||||
## Incident Overview
|
||||
|
||||
**Impact**:
|
||||
- Customer Impact: [All users / X% of users / Specific feature]
|
||||
- Services Affected: [List affected services]
|
||||
- Error Rate: [X%]
|
||||
- Revenue Impact: [$X estimated]
|
||||
|
||||
**Symptoms**:
|
||||
- [User-facing symptom 1]
|
||||
- [User-facing symptom 2]
|
||||
- [Metric: baseline → current]
|
||||
|
||||
---
|
||||
|
||||
## Team
|
||||
|
||||
**Incident Commander**: @[name]
|
||||
**Technical Lead**: @[name]
|
||||
**Communications Lead**: @[name]
|
||||
**Scribe**: @[name]
|
||||
**SMEs**: @[name1], @[name2]
|
||||
|
||||
**Channels**:
|
||||
- Slack: #incident-XXX
|
||||
- Zoom: [link]
|
||||
- Status Page: [link]
|
||||
|
||||
---
|
||||
|
||||
## Timeline
|
||||
|
||||
| Time (UTC) | Event | Action Taken | Owner | Status |
|
||||
|------------|-------|--------------|-------|--------|
|
||||
| [HH:MM] | [Alert fired / Issue detected] | [What was done] | @[name] | 🔴 Started |
|
||||
| [HH:MM] | [IC joined] | [Declared severity, assigned roles] | @[IC] | 🔴 Investigating |
|
||||
| [HH:MM] | [Discovery] | [What was found] | @[name] | 🔴 Investigating |
|
||||
| [HH:MM] | [Root cause identified] | [What the root cause is] | @[name] | 🟡 Identified |
|
||||
| [HH:MM] | [Mitigation started] | [What fix is being applied] | @[name] | 🟡 Mitigating |
|
||||
| [HH:MM] | [Mitigation complete] | [Verification of fix] | @[name] | 🟢 Mitigated |
|
||||
| [HH:MM] | [Incident resolved] | [All checks passing] | @[IC] | 🟢 Resolved |
|
||||
|
||||
**Total Duration**: [X] minutes/hours
|
||||
|
||||
---
|
||||
|
||||
## Status Updates
|
||||
|
||||
### Update #1 ([HH:MM UTC] - T+[X] min)
|
||||
|
||||
**Status**: [Investigating / Mitigating]
|
||||
**Root Cause**: [Known / Unknown - investigating X]
|
||||
**Current Actions**: [What team is doing]
|
||||
**Impact**: [Current impact status]
|
||||
**ETA**: [Estimated resolution time OR "Unknown"]
|
||||
**Next Update**: [Time]
|
||||
|
||||
### Update #2 ([HH:MM UTC] - T+[X] min)
|
||||
|
||||
[Same format as Update #1]
|
||||
|
||||
### Final Update ([HH:MM UTC] - T+[X] min)
|
||||
|
||||
**Status**: Resolved
|
||||
**Root Cause**: [Brief summary]
|
||||
**Fix Applied**: [What was done]
|
||||
**Impact**: Resolved
|
||||
**Monitoring**: [Ongoing monitoring period]
|
||||
|
||||
---
|
||||
|
||||
## Root Cause (Brief)
|
||||
|
||||
**Immediate Cause**: [What directly caused the issue]
|
||||
|
||||
**Contributing Factors**:
|
||||
1. [Factor 1]
|
||||
2. [Factor 2]
|
||||
3. [Factor 3]
|
||||
|
||||
---
|
||||
|
||||
## Resolution Summary
|
||||
|
||||
**Temporary Fix** (if applicable):
|
||||
- [What was done to quickly mitigate]
|
||||
- [When it was applied]
|
||||
|
||||
**Permanent Fix**:
|
||||
- [What was done for long-term solution]
|
||||
- [When it was applied]
|
||||
|
||||
**Verification**:
|
||||
- [How we confirmed the fix worked]
|
||||
- [Metrics that returned to normal]
|
||||
|
||||
---
|
||||
|
||||
## Communications
|
||||
|
||||
### Internal
|
||||
|
||||
- [HH:MM] - SEV1 declared in #incidents
|
||||
- [HH:MM] - Update #1 posted
|
||||
- [HH:MM] - Update #2 posted
|
||||
- [HH:MM] - Resolution announced
|
||||
|
||||
### External
|
||||
|
||||
- [HH:MM] - Status page: "Investigating"
|
||||
- [HH:MM] - Status page: "Identified"
|
||||
- [HH:MM] - Status page: "Monitoring"
|
||||
- [HH:MM] - Status page: "Resolved"
|
||||
- [HH:MM] - Customer email sent (if applicable)
|
||||
|
||||
### Executive
|
||||
|
||||
- [HH:MM] - Initial notification to CTO/CEO (SEV1 only)
|
||||
- [HH:MM] - Resolution summary sent
|
||||
|
||||
---
|
||||
|
||||
## Next Steps
|
||||
|
||||
- [ ] Full postmortem scheduled: [Date/Time]
|
||||
- [ ] Action items created in Linear
|
||||
- [ ] Runbook updated with new learnings
|
||||
- [ ] Monitoring improvements identified
|
||||
|
||||
---
|
||||
|
||||
## Notes
|
||||
|
||||
[Any additional context, observations, or learnings captured during the incident]
|
||||
|
||||
---
|
||||
|
||||
Return to [templates index](INDEX.md)
|
||||
187
skills/incident-response/templates/postmortem-template.md
Normal file
187
skills/incident-response/templates/postmortem-template.md
Normal file
@@ -0,0 +1,187 @@
|
||||
# Postmortem: [INCIDENT TITLE]
|
||||
|
||||
**Date**: [YYYY-MM-DD]
|
||||
**Incident ID**: INC-YYYY-MM-DD-XXX
|
||||
**Severity**: [SEV1 / SEV2 / SEV3]
|
||||
**Author**: [Name]
|
||||
**Reviewers**: [Names]
|
||||
**Status**: [Draft / Final]
|
||||
|
||||
---
|
||||
|
||||
## Executive Summary
|
||||
|
||||
**What Happened**: [2-3 sentence summary of the incident]
|
||||
|
||||
**Impact**:
|
||||
- **Duration**: [X] minutes/hours
|
||||
- **Users Affected**: [All / X% / specific group]
|
||||
- **Revenue Impact**: [$X estimated loss]
|
||||
- **SLA Breach**: [Yes/No - details]
|
||||
|
||||
**Root Cause**: [1 sentence root cause]
|
||||
|
||||
**Resolution**: [1 sentence how it was fixed]
|
||||
|
||||
**Key Actions**: [3 most important action items]
|
||||
|
||||
---
|
||||
|
||||
## Timeline
|
||||
|
||||
| Time (UTC) | Event | Notes |
|
||||
|------------|-------|-------|
|
||||
| [HH:MM] | [Event] | [Context] |
|
||||
| [HH:MM] | [Event] | [Context] |
|
||||
| [HH:MM] | [Event] | [Context] |
|
||||
|
||||
**Duration Breakdown**:
|
||||
- Detection → Identification: [X] minutes
|
||||
- Identification → Mitigation: [X] minutes
|
||||
- Mitigation → Full Resolution: [X] minutes
|
||||
- **Total MTTR**: [X] minutes
|
||||
|
||||
---
|
||||
|
||||
## Root Cause Analysis (5 Whys)
|
||||
|
||||
**Why 1**: Why did [problem] happen?
|
||||
→ [Answer based on facts]
|
||||
|
||||
**Why 2**: Why did [previous answer] happen?
|
||||
→ [Answer based on facts]
|
||||
|
||||
**Why 3**: Why did [previous answer] happen?
|
||||
→ [Answer based on facts]
|
||||
|
||||
**Why 4**: Why did [previous answer] happen?
|
||||
→ [Answer based on facts]
|
||||
|
||||
**Why 5**: Why did [previous answer] happen?
|
||||
→ [Answer based on facts]
|
||||
|
||||
**ROOT CAUSE**: [Final systemic issue identified]
|
||||
|
||||
---
|
||||
|
||||
## Contributing Factors
|
||||
|
||||
### Immediate Cause
|
||||
[Direct technical cause of the incident]
|
||||
|
||||
### Underlying Conditions
|
||||
1. [Condition that enabled the immediate cause]
|
||||
2. [Condition that enabled the immediate cause]
|
||||
|
||||
### Latent Failures
|
||||
1. [Organizational/process weakness]
|
||||
2. [Organizational/process weakness]
|
||||
|
||||
---
|
||||
|
||||
## What Went Well ✅
|
||||
|
||||
1. [Something that worked well during response]
|
||||
2. [Something that worked well during response]
|
||||
3. [Something that worked well during response]
|
||||
|
||||
---
|
||||
|
||||
## What Went Wrong ❌
|
||||
|
||||
1. [Something that didn't work or was missing]
|
||||
2. [Something that didn't work or was missing]
|
||||
3. [Something that didn't work or was missing]
|
||||
|
||||
---
|
||||
|
||||
## Action Items
|
||||
|
||||
| Priority | Action | Owner | Due Date | Status | Link |
|
||||
|----------|--------|-------|----------|--------|------|
|
||||
| P0 | [Critical - do immediately] | @[name] | [Date] | [ ] | [Link] |
|
||||
| P1 | [Important - do within 1 week] | @[name] | [Date] | [ ] | [Link] |
|
||||
| P2 | [Nice to have - do within 1 month] | @[name] | [Date] | [ ] | [Link] |
|
||||
|
||||
### P0 Actions (Immediate)
|
||||
- [ ] [Action 1] - @[owner] - [due date]
|
||||
- [ ] [Action 2] - @[owner] - [due date]
|
||||
|
||||
### P1 Actions (Short-Term)
|
||||
- [ ] [Action 1] - @[owner] - [due date]
|
||||
- [ ] [Action 2] - @[owner] - [due date]
|
||||
|
||||
### P2 Actions (Long-Term)
|
||||
- [ ] [Action 1] - @[owner] - [due date]
|
||||
- [ ] [Action 2] - @[owner] - [due date]
|
||||
|
||||
---
|
||||
|
||||
## Lessons Learned
|
||||
|
||||
### Technical Learnings
|
||||
1. [Technical insight gained]
|
||||
2. [Technical insight gained]
|
||||
|
||||
### Process Learnings
|
||||
1. [Process improvement identified]
|
||||
2. [Process improvement identified]
|
||||
|
||||
### Communication Learnings
|
||||
1. [Communication improvement identified]
|
||||
2. [Communication improvement identified]
|
||||
|
||||
---
|
||||
|
||||
## Prevention Measures
|
||||
|
||||
### Immediate (Completed)
|
||||
- [x] [What was done same day]
|
||||
- [x] [What was done same day]
|
||||
|
||||
### Short-Term (1-2 weeks)
|
||||
- [ ] [What will be done soon]
|
||||
- [ ] [What will be done soon]
|
||||
|
||||
### Long-Term (1-3 months)
|
||||
- [ ] [What will be done eventually]
|
||||
- [ ] [What will be done eventually]
|
||||
|
||||
---
|
||||
|
||||
## Related Incidents
|
||||
|
||||
- [INC-YYYY-MM-DD-XXX] - [Brief description] - [Link]
|
||||
- [INC-YYYY-MM-DD-XXX] - [Brief description] - [Link]
|
||||
|
||||
---
|
||||
|
||||
## Appendix
|
||||
|
||||
### Relevant Logs
|
||||
```
|
||||
[Paste key log entries]
|
||||
```
|
||||
|
||||
### Metrics/Graphs
|
||||
[Links to Grafana dashboards, screenshots]
|
||||
|
||||
### Commands Run
|
||||
```bash
|
||||
[Commands that were used during investigation/mitigation]
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Sign-Off
|
||||
|
||||
**Incident Commander**: [Name] - [Date]
|
||||
**Technical Lead**: [Name] - [Date]
|
||||
**Engineering Manager**: [Name] - [Date]
|
||||
|
||||
**Postmortem Review**: [Date/Time]
|
||||
**Attendees**: [List of people who reviewed]
|
||||
|
||||
---
|
||||
|
||||
Return to [templates index](INDEX.md)
|
||||
255
skills/incident-response/templates/runbook-template.md
Normal file
255
skills/incident-response/templates/runbook-template.md
Normal file
@@ -0,0 +1,255 @@
|
||||
# Runbook: [PROBLEM TITLE]
|
||||
|
||||
**Alert**: [Alert name that triggers this runbook]
|
||||
**Severity**: [SEV1 / SEV2 / SEV3]
|
||||
**Owner**: [Team name]
|
||||
**Last Updated**: [YYYY-MM-DD]
|
||||
**Last Tested**: [YYYY-MM-DD]
|
||||
|
||||
---
|
||||
|
||||
## Problem Description
|
||||
|
||||
[2-3 sentence description of what this problem is]
|
||||
|
||||
**Symptoms**:
|
||||
- [Observable symptom 1 - what users/operators see]
|
||||
- [Observable symptom 2]
|
||||
- [Observable symptom 3]
|
||||
|
||||
**Impact**:
|
||||
- **Customer Impact**: [What users experience]
|
||||
- **Business Impact**: [Revenue, SLA, compliance]
|
||||
- **Affected Services**: [List of services]
|
||||
|
||||
---
|
||||
|
||||
## Prerequisites
|
||||
|
||||
**Required Access**:
|
||||
- [ ] Kubernetes cluster access (`kubectl` configured)
|
||||
- [ ] Database access (PlanetScale/PostgreSQL)
|
||||
- [ ] Cloudflare Workers access (`wrangler` configured)
|
||||
- [ ] Monitoring access (Grafana, Datadog)
|
||||
|
||||
**Required Tools**:
|
||||
- [ ] `kubectl` v1.28+
|
||||
- [ ] `wrangler` v3+
|
||||
- [ ] `pscale` CLI
|
||||
- [ ] `curl`, `jq`
|
||||
|
||||
---
|
||||
|
||||
## Diagnosis
|
||||
|
||||
### Step 1: [Check Initial Symptom]
|
||||
|
||||
**What to check**: [Describe what this step verifies]
|
||||
|
||||
```bash
|
||||
# Command to run
|
||||
[command]
|
||||
|
||||
# Expected output (healthy):
|
||||
[what you should see if everything is fine]
|
||||
|
||||
# Problem indicator:
|
||||
[what you see if there's an issue]
|
||||
```
|
||||
|
||||
**Interpretation**:
|
||||
- If [condition], then [conclusion]
|
||||
- If [condition], then go to Step 2
|
||||
|
||||
---
|
||||
|
||||
### Step 2: [Verify Root Cause]
|
||||
|
||||
```bash
|
||||
# Command to run
|
||||
[command]
|
||||
|
||||
# Look for:
|
||||
[what to look for in the output]
|
||||
```
|
||||
|
||||
**Possible Causes**:
|
||||
1. **[Cause 1]**: [How to identify] → Go to [Mitigation Option A](#option-a-cause-1)
|
||||
2. **[Cause 2]**: [How to identify] → Go to [Mitigation Option B](#option-b-cause-2)
|
||||
3. **[Cause 3]**: [How to identify] → Escalate to [team]
|
||||
|
||||
---
|
||||
|
||||
### Step 3: [Additional Verification]
|
||||
|
||||
[Only if needed for complex scenarios]
|
||||
|
||||
```bash
|
||||
# Commands
|
||||
[commands]
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Mitigation
|
||||
|
||||
### Option A: [Cause 1]
|
||||
|
||||
**When to use**: [Conditions when this mitigation applies]
|
||||
|
||||
**Steps**:
|
||||
1. [Action 1]
|
||||
```bash
|
||||
[command]
|
||||
```
|
||||
|
||||
2. [Action 2]
|
||||
```bash
|
||||
[command]
|
||||
```
|
||||
|
||||
3. [Action 3]
|
||||
```bash
|
||||
[command]
|
||||
```
|
||||
|
||||
**Verification**:
|
||||
```bash
|
||||
# Check that mitigation worked
|
||||
[verification command]
|
||||
|
||||
# Expected result:
|
||||
[what you should see]
|
||||
```
|
||||
|
||||
**If mitigation fails**: [What to do next - usually escalate]
|
||||
|
||||
---
|
||||
|
||||
### Option B: [Cause 2]
|
||||
|
||||
[Same format as Option A]
|
||||
|
||||
---
|
||||
|
||||
## Rollback
|
||||
|
||||
**If mitigation makes things worse:**
|
||||
|
||||
```bash
|
||||
# Rollback command 1
|
||||
[command to undo action 1]
|
||||
|
||||
# Rollback command 2
|
||||
[command to undo action 2]
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Verification & Monitoring
|
||||
|
||||
### Health Checks
|
||||
|
||||
After mitigation, verify these metrics return to normal:
|
||||
|
||||
```bash
|
||||
# Check 1: Service health
|
||||
curl https://api.greyhaven.io/health
|
||||
# Expected: HTTP 200, {"status": "healthy"}
|
||||
|
||||
# Check 2: Error rate
|
||||
# Grafana: Error Rate dashboard
|
||||
# Expected: <0.1%
|
||||
|
||||
# Check 3: Latency
|
||||
# Grafana: API Latency dashboard
|
||||
# Expected: p95 <500ms
|
||||
```
|
||||
|
||||
### Monitoring Period
|
||||
|
||||
Monitor for **[time period]** after mitigation:
|
||||
- [ ] Error rate stable (<0.1%)
|
||||
- [ ] Latency normal (p95 <500ms)
|
||||
- [ ] No new alerts
|
||||
- [ ] User reports resolved
|
||||
|
||||
---
|
||||
|
||||
## Escalation
|
||||
|
||||
**Escalate if**:
|
||||
- Mitigation doesn't work after [X] minutes
|
||||
- Root cause unclear after diagnosis
|
||||
- Issue is [severity] and unresolved after [X] minutes
|
||||
- Multiple services affected
|
||||
|
||||
**Escalation Path**:
|
||||
```
|
||||
0-15 min: @oncall-engineer
|
||||
15-30 min: @team-lead
|
||||
30-60 min: @engineering-manager
|
||||
60+ min: @vp-engineering (SEV1 only)
|
||||
```
|
||||
|
||||
**Escalation Contact**:
|
||||
- Team Slack: #[team-channel]
|
||||
- PagerDuty: [escalation policy]
|
||||
- Oncall: @[oncall-alias]
|
||||
|
||||
---
|
||||
|
||||
## Common Mistakes
|
||||
|
||||
### Mistake 1: [Common Error]
|
||||
|
||||
**Wrong**:
|
||||
```bash
|
||||
[incorrect command or approach]
|
||||
```
|
||||
|
||||
**Correct**:
|
||||
```bash
|
||||
[correct command or approach]
|
||||
```
|
||||
|
||||
### Mistake 2: [Common Error]
|
||||
|
||||
[Description and correction]
|
||||
|
||||
---
|
||||
|
||||
## Related Documentation
|
||||
|
||||
- **Alert Definition**: [Link to alert config]
|
||||
- **Monitoring Dashboard**: [Link to Grafana]
|
||||
- **Architecture Doc**: [Link to system architecture]
|
||||
- **Past Incidents**: [Links to similar incidents]
|
||||
- **Postmortems**: [Links to related postmortems]
|
||||
|
||||
---
|
||||
|
||||
## Changelog
|
||||
|
||||
| Date | Author | Changes |
|
||||
|------|--------|---------|
|
||||
| [YYYY-MM-DD] | @[name] | Initial creation |
|
||||
| [YYYY-MM-DD] | @[name] | Updated [what changed] |
|
||||
|
||||
---
|
||||
|
||||
## Testing Notes
|
||||
|
||||
**Last Test Date**: [YYYY-MM-DD]
|
||||
**Test Result**: [Pass / Fail]
|
||||
**Notes**: [What was learned from testing]
|
||||
|
||||
**How to Test**:
|
||||
1. [Step to simulate failure in staging]
|
||||
2. [Follow runbook]
|
||||
3. [Verify recovery]
|
||||
4. [Document time taken and any issues]
|
||||
|
||||
---
|
||||
|
||||
Return to [templates index](INDEX.md)
|
||||
Reference in New Issue
Block a user