Initial commit
This commit is contained in:
242
skills/chaos-engineering-principles/SKILL.md
Normal file
242
skills/chaos-engineering-principles/SKILL.md
Normal file
@@ -0,0 +1,242 @@
|
||||
---
|
||||
name: chaos-engineering-principles
|
||||
description: Use when starting chaos engineering, designing fault injection experiments, choosing chaos tools, testing system resilience, or recovering from chaos incidents - provides hypothesis-driven testing, blast radius control, and anti-patterns for safe chaos
|
||||
---
|
||||
|
||||
# Chaos Engineering Principles
|
||||
|
||||
## Overview
|
||||
|
||||
**Core principle:** Chaos engineering validates resilience through controlled experiments, not random destruction.
|
||||
|
||||
**Rule:** Start in staging, with monitoring, with rollback, with small blast radius. No exceptions.
|
||||
|
||||
## When NOT to Do Chaos
|
||||
|
||||
Don't run chaos experiments if ANY of these are missing:
|
||||
- ❌ No comprehensive monitoring (APM, metrics, logs, alerts)
|
||||
- ❌ No automated rollback capability
|
||||
- ❌ No baseline metrics documented
|
||||
- ❌ No incident response team available
|
||||
- ❌ System already unstable (fix stability first)
|
||||
- ❌ No staging environment to practice
|
||||
|
||||
**Fix these prerequisites BEFORE chaos testing.**
|
||||
|
||||
## Tool Selection Decision Tree
|
||||
|
||||
| Your Constraint | Choose | Why |
|
||||
|----------------|--------|-----|
|
||||
| Kubernetes-native, CNCF preference | **LitmusChaos** | Cloud-native, operator-based, excellent K8s integration |
|
||||
| Kubernetes-focused, visualization needs | **Chaos Mesh** | Fine-grained control, dashboards, low overhead |
|
||||
| Want managed service, quick start | **Gremlin** | Commercial, guided experiments, built-in best practices |
|
||||
| Vendor-neutral, maximum flexibility | **Chaos Toolkit** | Open source, plugin ecosystem, any infrastructure |
|
||||
| AWS-specific, cost-sensitive | **AWS FIS** | Native AWS integration, pay-per-experiment |
|
||||
|
||||
**For most teams:** Chaos Toolkit (flexible, free) or Gremlin (fast, managed)
|
||||
|
||||
## Prerequisites Checklist
|
||||
|
||||
Before FIRST experiment:
|
||||
|
||||
**Monitoring (Required):**
|
||||
- [ ] Real-time dashboards for key metrics (latency, error rate, throughput)
|
||||
- [ ] Distributed tracing for request flows
|
||||
- [ ] Log aggregation with timeline correlation
|
||||
- [ ] Alerts configured with thresholds
|
||||
|
||||
**Rollback (Required):**
|
||||
- [ ] Automated rollback based on metrics (e.g., error rate > 5% → abort)
|
||||
- [ ] Manual kill switch everyone can activate
|
||||
- [ ] Rollback tested and documented (< 30 sec recovery)
|
||||
|
||||
**Baseline (Required):**
|
||||
- [ ] Documented normal metrics (P50/P95/P99 latency, error rate %)
|
||||
- [ ] Known dependencies and critical paths
|
||||
- [ ] System architecture diagram
|
||||
|
||||
**Team (Required):**
|
||||
- [ ] Designated observer monitoring experiment
|
||||
- [ ] On-call engineer available
|
||||
- [ ] Communication channel established (war room, Slack)
|
||||
- [ ] Post-experiment debrief scheduled
|
||||
|
||||
## Anti-Patterns Catalog
|
||||
|
||||
### ❌ Production First Chaos
|
||||
**Symptom:** "Let's start chaos testing in production to see what breaks"
|
||||
|
||||
**Why bad:** No practice, no muscle memory, production incidents guaranteed
|
||||
|
||||
**Fix:** Run 5-10 experiments in staging FIRST. Graduate to production only after proving: experiments work as designed, rollback functions, team can execute response
|
||||
|
||||
---
|
||||
|
||||
### ❌ Chaos Without Monitoring
|
||||
**Symptom:** "We injected latency but we're not sure what happened"
|
||||
|
||||
**Why bad:** Blind chaos = no learning. You can't validate resilience without seeing system behavior
|
||||
|
||||
**Fix:** Set up comprehensive monitoring BEFORE first experiment. Must be able to answer "What changed?" within 30 seconds
|
||||
|
||||
---
|
||||
|
||||
### ❌ Unlimited Blast Radius
|
||||
**Symptom:** Affecting 100% of traffic/all services on first run
|
||||
|
||||
**Why bad:** Cascading failures, actual outages, customer impact
|
||||
|
||||
**Fix:** Start at 0.1-1% traffic. Progression: 0.1% → 1% → 5% → 10% → (stop or 50%). Each step validates before expanding
|
||||
|
||||
---
|
||||
|
||||
### ❌ Chaos Without Rollback
|
||||
**Symptom:** "The experiment broke everything and we can't stop it"
|
||||
|
||||
**Why bad:** Chaos becomes real incident, 2+ hour recovery, lost trust
|
||||
|
||||
**Fix:** Automated abort criteria (error rate threshold, latency threshold, manual kill switch). Test rollback before injecting failures
|
||||
|
||||
---
|
||||
|
||||
### ❌ Random Chaos (No Hypothesis)
|
||||
**Symptom:** "Let's inject some failures and see what happens"
|
||||
|
||||
**Why bad:** No learning objective, can't validate resilience, wasted time
|
||||
|
||||
**Fix:** Every experiment needs hypothesis: "System will [expected behavior] when [failure injected]"
|
||||
|
||||
## Failure Types Catalog
|
||||
|
||||
Priority order for microservices:
|
||||
|
||||
| Failure Type | Priority | Why Test This | Example |
|
||||
|--------------|----------|---------------|---------|
|
||||
| **Network Latency** | HIGH | Most common production issue | 500ms delay service A → B |
|
||||
| **Service Timeout** | HIGH | Tests circuit breakers, retry logic | Service B unresponsive |
|
||||
| **Connection Loss** | HIGH | Tests failover, graceful degradation | TCP connection drops |
|
||||
| **Resource Exhaustion** | MEDIUM | Tests resource limits, scaling | Memory limit, connection pool full |
|
||||
| **Packet Loss** | MEDIUM | Tests retry strategies | 1-10% packet loss |
|
||||
| **DNS Failure** | MEDIUM | Tests service discovery resilience | DNS resolution delays |
|
||||
| **Cache Failure** | MEDIUM | Tests fallback behavior | Redis down |
|
||||
| **Database Errors** | LOW (start) | High risk - test after basics work | Connection refused, query timeout |
|
||||
|
||||
**Start with network latency** - safest, most informative, easiest rollback.
|
||||
|
||||
## Experiment Template
|
||||
|
||||
Use this for every chaos experiment:
|
||||
|
||||
**1. Hypothesis**
|
||||
"If [failure injected], system will [expected behavior], and [metric] will remain [threshold]"
|
||||
|
||||
Example: "If service-payment experiences 2s latency, circuit breaker will open within 10s, and P99 latency will stay < 500ms"
|
||||
|
||||
**2. Baseline Metrics**
|
||||
- Current P50/P95/P99 latency:
|
||||
- Current error rate:
|
||||
- Current throughput:
|
||||
|
||||
**3. Experiment Config**
|
||||
- Failure type: [latency / packet loss / service down / etc.]
|
||||
- Target: [specific service / % of traffic]
|
||||
- Blast radius: [0.1% traffic, single region, canary pods]
|
||||
- Duration: [2-5 minutes initial]
|
||||
- Abort criteria: [error rate > 5% OR P99 > 1s OR manual stop]
|
||||
|
||||
**4. Execution**
|
||||
- Observer: [name] monitoring dashboards
|
||||
- Runner: [name] executing experiment
|
||||
- Kill switch: [procedure]
|
||||
- Start time: [timestamp]
|
||||
|
||||
**5. Observation**
|
||||
- What happened vs hypothesis:
|
||||
- Actual metrics during chaos:
|
||||
- System behavior notes:
|
||||
|
||||
**6. Validation**
|
||||
- ✓ Hypothesis validated / ✗ Hypothesis failed
|
||||
- Unexpected findings:
|
||||
- Action items:
|
||||
|
||||
## Blast Radius Progression
|
||||
|
||||
Safe scaling path:
|
||||
|
||||
| Step | Traffic Affected | Duration | Abort If |
|
||||
|------|------------------|----------|----------|
|
||||
| **1. Staging** | 100% staging | 5 min | Any production impact |
|
||||
| **2. Canary** | 0.1% production | 2 min | Error rate > 1% |
|
||||
| **3. Small** | 1% production | 5 min | Error rate > 2% |
|
||||
| **4. Medium** | 5% production | 10 min | Error rate > 5% |
|
||||
| **5. Large** | 10% production | 15 min | Error rate > 5% |
|
||||
|
||||
**Never skip steps.** Each step validates before expanding.
|
||||
|
||||
**Stop at 10-20% for most experiments** - no need to chaos 100% of production traffic.
|
||||
|
||||
**Low-traffic services (< 1000 req/day):** Use absolute request counts instead of percentages. Minimum 5-10 affected requests per step. Example: 100 req/day service should still start with 5-10 requests (6 hours), not 0.1% (1 request every 10 days).
|
||||
|
||||
## Your First Experiment (Staging)
|
||||
|
||||
**Goal:** Build confidence, validate monitoring, test rollback
|
||||
|
||||
**Experiment:** Network latency on non-critical service
|
||||
|
||||
```bash
|
||||
# Example with Chaos Toolkit
|
||||
1. Pick least critical service (e.g., recommendation engine, not payment)
|
||||
2. Inject 500ms latency to 100% of staging traffic
|
||||
3. Duration: 5 minutes
|
||||
4. Expected: Timeouts handled gracefully, fallback behavior activates
|
||||
5. Monitor: Error rate, latency, downstream services
|
||||
6. Abort if: Error rate > 10% or cascading failures
|
||||
7. Debrief: What did we learn? Did monitoring catch it? Did rollback work?
|
||||
```
|
||||
|
||||
**Success criteria:** You can answer "Did our hypothesis hold?" within 5 minutes of experiment completion.
|
||||
|
||||
## Common Mistakes
|
||||
|
||||
### ❌ Testing During Incidents
|
||||
**Fix:** Only chaos test during stable periods, business hours, with extra staffing
|
||||
|
||||
---
|
||||
|
||||
### ❌ Network Latency Underestimation
|
||||
**Fix:** Latency cascades - 500ms can become 5s downstream. Start with 100-200ms, observe, then increase
|
||||
|
||||
---
|
||||
|
||||
### ❌ No Post-Experiment Review
|
||||
**Fix:** Every experiment gets 15-min debrief: What worked? What broke? What did we learn?
|
||||
|
||||
## Quick Reference
|
||||
|
||||
**Prerequisites Before First Chaos:**
|
||||
1. Monitoring + alerts
|
||||
2. Automated rollback
|
||||
3. Baseline metrics documented
|
||||
4. Team coordinated
|
||||
|
||||
**Experiment Steps:**
|
||||
1. Write hypothesis
|
||||
2. Document baseline
|
||||
3. Define blast radius (start 0.1%)
|
||||
4. Set abort criteria
|
||||
5. Execute with observer
|
||||
6. Validate hypothesis
|
||||
7. Debrief team
|
||||
|
||||
**Blast Radius Progression:**
|
||||
Staging → 0.1% → 1% → 5% → 10% (stop for most experiments)
|
||||
|
||||
**First Experiment:**
|
||||
Network latency (500ms) on non-critical service in staging for 5 minutes
|
||||
|
||||
## Bottom Line
|
||||
|
||||
**Chaos engineering is hypothesis-driven science, not random destruction.**
|
||||
|
||||
Start small (staging, 0.1% traffic), with monitoring, with rollback. Graduate slowly.
|
||||
Reference in New Issue
Block a user