Initial commit

2025-11-29 18:23:58 +08:00
commit 60a3fe4d9d
31 changed files with 4110 additions and 0 deletions
--- a/commands/workflows/incident-response.md
+++ b/commands/workflows/incident-response.md
@@ -0,0 +1,85 @@
+---
+model: claude-opus-4-1
+---
+
+Respond to production incidents with coordinated agent expertise for rapid resolution:
+
+[Extended thinking: This workflow handles production incidents with urgency and precision. Multiple specialized agents work together to identify root causes, implement fixes, and prevent recurrence.]
+
+## Phase 1: Immediate Response
+
+### 1. Incident Assessment
+- Use Task tool with subagent_type="incident-responder"
+- Prompt: "URGENT: Assess production incident: $ARGUMENTS. Determine severity, impact, and immediate mitigation steps. Time is critical."
+- Output: Incident severity, impact assessment, immediate actions
+
+### 2. Initial Troubleshooting
+- Use Task tool with subagent_type="devops-troubleshooter"
+- Prompt: "Investigate production issue: $ARGUMENTS. Check logs, metrics, recent deployments, and system health. Identify potential root causes."
+- Output: Initial findings, suspicious patterns, potential causes
+
+## Phase 2: Root Cause Analysis
+
+### 3. Deep Debugging
+- Use Task tool with subagent_type="debugger"
+- Prompt: "Debug production issue: $ARGUMENTS using findings from initial investigation. Analyze stack traces, reproduce issue if possible, identify exact root cause."
+- Output: Root cause identification, reproduction steps, debug analysis
+
+### 4. Performance Analysis (if applicable)
+- Use Task tool with subagent_type="performance-engineer"
+- Prompt: "Analyze performance aspects of incident: $ARGUMENTS. Check for resource exhaustion, bottlenecks, or performance degradation."
+- Output: Performance metrics, resource analysis, bottleneck identification
+
+### 5. Database Investigation (if applicable)
+- Use Task tool with subagent_type="database-optimizer"
+- Prompt: "Investigate database-related aspects of incident: $ARGUMENTS. Check for locks, slow queries, connection issues, or data corruption."
+- Output: Database health report, query analysis, data integrity check
+
+## Phase 3: Resolution Implementation
+
+### 6. Fix Development
+- Use Task tool with subagent_type="backend-architect"
+- Prompt: "Design and implement fix for incident: $ARGUMENTS based on root cause analysis. Ensure fix is safe for immediate production deployment."
+- Output: Fix implementation, safety analysis, rollout strategy
+
+### 7. Emergency Deployment
+- Use Task tool with subagent_type="deployment-engineer"
+- Prompt: "Deploy emergency fix for incident: $ARGUMENTS. Implement with minimal risk, include rollback plan, and monitor deployment closely."
+- Output: Deployment execution, rollback procedures, monitoring setup
+
+## Phase 4: Stabilization and Prevention
+
+### 8. System Stabilization
+- Use Task tool with subagent_type="devops-troubleshooter"
+- Prompt: "Stabilize system after incident fix: $ARGUMENTS. Monitor system health, clear any backlogs, and ensure full recovery."
+- Output: System health report, recovery metrics, stability confirmation
+
+### 9. Security Review (if applicable)
+- Use Task tool with subagent_type="security-auditor"
+- Prompt: "Review security implications of incident: $ARGUMENTS. Check for any security breaches, data exposure, or vulnerabilities exploited."
+- Output: Security assessment, breach analysis, hardening recommendations
+
+## Phase 5: Post-Incident Activities
+
+### 10. Monitoring Enhancement
+- Use Task tool with subagent_type="devops-troubleshooter"
+- Prompt: "Enhance monitoring to prevent recurrence of: $ARGUMENTS. Add alerts, improve observability, and set up early warning systems."
+- Output: New monitoring rules, alert configurations, observability improvements
+
+### 11. Test Coverage
+- Use Task tool with subagent_type="test-automator"
+- Prompt: "Create tests to prevent regression of incident: $ARGUMENTS. Include unit tests, integration tests, and chaos engineering scenarios."
+- Output: Test implementations, regression prevention, chaos tests
+
+### 12. Documentation
+- Use Task tool with subagent_type="incident-responder"
+- Prompt: "Document incident postmortem for: $ARGUMENTS. Include timeline, root cause, impact, resolution, and lessons learned. No blame, focus on improvement."
+- Output: Postmortem document, action items, process improvements
+
+## Coordination Notes
+- Speed is critical in early phases - parallel agent execution where possible
+- Communication between agents must be clear and rapid
+- All changes must be safe and reversible
+- Document everything for postmortem analysis
+
+Production incident: $ARGUMENTS