commit 167c936f6362ac4ca260158b64887cd1d0b4fd68 Author: Zhongwei Li Date: Sat Nov 29 18:28:20 2025 +0800 Initial commit diff --git a/.claude-plugin/plugin.json b/.claude-plugin/plugin.json new file mode 100644 index 0000000..eefa138 --- /dev/null +++ b/.claude-plugin/plugin.json @@ -0,0 +1,11 @@ +{ + "name": "debugging-kubernetes-incidents", + "description": "Comprehensive skill for systematic investigation and root cause analysis of Kubernetes pod failures, service degradation, and resource issues. Provides structured 5-phase methodology for incident triage, multi-source correlation of logs/events/metrics, and actionable remediation recommendations. Covers common patterns: CrashLoopBackOff, OOMKilled, ImagePullBackOff, resource constraints, TLS errors, and network issues. Emphasizes read-only investigation and distinguishing root causes from symptoms.", + "version": "1.0.0", + "author": { + "name": "Gjorgji Georgievski" + }, + "skills": [ + "./" + ] +} \ No newline at end of file diff --git a/README.md b/README.md new file mode 100644 index 0000000..40d0e10 --- /dev/null +++ b/README.md @@ -0,0 +1,3 @@ +# debugging-kubernetes-incidents + +Comprehensive skill for systematic investigation and root cause analysis of Kubernetes pod failures, service degradation, and resource issues. Provides structured 5-phase methodology for incident triage, multi-source correlation of logs/events/metrics, and actionable remediation recommendations. Covers common patterns: CrashLoopBackOff, OOMKilled, ImagePullBackOff, resource constraints, TLS errors, and network issues. Emphasizes read-only investigation and distinguishing root causes from symptoms. diff --git a/SKILL.md b/SKILL.md new file mode 100644 index 0000000..faa7154 --- /dev/null +++ b/SKILL.md @@ -0,0 +1,386 @@ +--- +name: debugging-kubernetes-incidents +description: Use when investigating Kubernetes pod failures, crashes, resource issues, or service degradation. Provides systematic investigation methodology for incident triage, root cause analysis, and remediation planning in any Kubernetes environment. +--- + +# Debugging Kubernetes Incidents + +## Overview + +**Core Principles:** +- **Read-Only Investigation** - Observe and analyze, never modify resources +- **Systematic Methodology** - Follow structured phases for thorough analysis +- **Multi-Source Correlation** - Combine logs, events, metrics for complete picture +- **Root Cause Focus** - Identify underlying cause, not just symptoms + +**Key Abbreviations:** +- **RCA** - Root Cause Analysis +- **MTTR** - Mean Time To Resolution +- **P1/P2/P3/P4** - Severity levels (Critical/High/Medium/Low) + +## When to Use + +Invoke this skill when: +- ✅ Pods are crashing, restarting, or stuck in `CrashLoopBackOff` +- ✅ Services are returning errors or experiencing high latency +- ✅ Resources are exhausted (CPU, memory, storage) +- ✅ Certificate errors or TLS handshake failures occur +- ✅ Deployments are failing or stuck in rollout +- ✅ Need to perform incident triage or post-mortem analysis + +## Quick Reference: Investigation Phases + +| Phase | Focus | Primary Tools | +|-------|-------|--------------| +| **1. Triage** | Severity & scope | Pod status, events | +| **2. Data Collection** | Logs, events, metrics | kubectl logs, events, top | +| **3. Correlation** | Pattern detection | Timeline analysis | +| **4. Root Cause** | Underlying issue | Multi-source synthesis | +| **5. Remediation** | Fix & prevention | Recommendations only | + +## Common Confusions + +### Wrong vs. Correct Approaches + +❌ **WRONG:** Jump directly to pod logs without checking events +✅ **CORRECT:** Check events first to understand context, then investigate logs + +❌ **WRONG:** Only look at the current pod state +✅ **CORRECT:** Check previous container logs if pod restarted (use `--previous` flag) + +❌ **WRONG:** Investigate single data source in isolation +✅ **CORRECT:** Correlate logs + events + metrics for complete picture + +❌ **WRONG:** Assume first error found is root cause +✅ **CORRECT:** Identify temporal sequence - what happened FIRST? + +❌ **WRONG:** Recommend changes without understanding full impact +✅ **CORRECT:** Provide read-only recommendations with manual review required + +## Decision Tree: Incident Type + +**Q: Is the pod running?** +- **No** → Check pod status and events + - ImagePullBackOff → [Image Pull Issues](#image-pull-issues) + - Pending → [Resource Constraints](#resource-constraints) + - CrashLoopBackOff → [Application Crashes](#application-crashes) + - OOMKilled → [Memory Issues](#memory-issues) + +**Q: Is the pod running but unhealthy?** +- **Yes** → Check logs and resource usage + - High CPU/Memory → [Resource Saturation](#resource-saturation) + - Application errors → [Log Analysis](#log-analysis) + - Failed health checks → [Liveness/Readiness Probes](#probe-failures) + +**Q: Is this a service-level issue?** +- **Yes** → Check service endpoints and network + - No endpoints → [Pod Selector Issues](#selector-issues) + - Connection timeouts → [Network Issues](#network-issues) + - Certificate errors → [TLS Issues](#tls-issues) + +## Rules: Investigation Methodology + +### Phase 1: Triage + +**CRITICAL:** Assess severity before diving deep + +1. **Determine incident type**: + - Pod-level: Single pod failure + - Deployment-level: Multiple pods affected + - Service-level: API/service unavailable + - Cluster-level: Node or system-wide issue + +2. **Assess severity**: + - **P1 (Critical)**: Production service down, data loss risk + - **P2 (High)**: Major feature broken, significant user impact + - **P3 (Medium)**: Minor feature degraded, limited impact + - **P4 (Low)**: No immediate user impact + +3. **Identify scope**: + - Single pod, deployment, namespace, or cluster-wide? + - How many users affected? + - What services depend on this? + +### Phase 2: Data Collection + +**Gather comprehensive data from multiple sources:** + +**Pod Information:** +```bash +kubectl get pods -n +kubectl describe pod -n +kubectl get pod -n -o yaml +``` + +**Logs:** +```bash +# Current logs +kubectl logs -n --tail=100 + +# Previous container (if restarted) +kubectl logs -n --previous + +# All containers in pod +kubectl logs -n --all-containers=true +``` + +**Events:** +```bash +# Namespace events +kubectl get events -n --sort-by='.lastTimestamp' + +# Pod-specific events +kubectl describe pod -n | grep -A 10 Events +``` + +**Resource Usage:** +```bash +kubectl top pods -n +kubectl top nodes +``` + +### Phase 3: Correlation & Analysis + +**Create unified timeline:** + +1. **Extract timestamps** from logs, events, metrics +2. **Align data sources** on common timeline +3. **Identify temporal patterns**: + - What happened first? (root cause) + - What happened simultaneously? (correlation) + - What happened after? (cascading effects) + +**Look for common patterns:** +- Memory spike → OOMKilled → Pod restart +- Image pull failure → Pending → Timeout +- Probe failure → Unhealthy → Traffic removed +- Certificate expiry → TLS errors → Connection failures + +### Phase 4: Root Cause Determination + +**CRITICAL:** Distinguish correlation from causation + +**Validate root cause:** +1. **Temporal precedence**: Did it happen BEFORE the symptom? +2. **Causal mechanism**: Does it logically explain the symptom? +3. **Evidence**: Is there supporting data from multiple sources? + +**Common root causes:** +- Application bugs (crashes, exceptions) +- Resource exhaustion (CPU, memory, disk) +- Configuration errors (wrong env vars, missing secrets) +- Network issues (DNS failures, timeouts) +- Infrastructure problems (node failures, storage issues) + +### Phase 5: Remediation Planning + +**Provide structured recommendations:** + +**Immediate mitigation:** +- Rollback deployment +- Scale resources +- Restart pods +- Apply emergency config + +**Permanent fix:** +- Code changes +- Resource limit adjustments +- Configuration updates +- Infrastructure improvements + +**Prevention:** +- Monitoring alerts +- Resource quotas +- Automated testing +- Runbook updates + +## Common Issue Types + +### Image Pull Issues + +**Symptoms:** +- Pod status: `ImagePullBackOff` or `ErrImagePull` +- Events: "Failed to pull image" + +**Investigation:** +``` +├── Check image name and tag in pod spec +├── Verify image exists in registry +├── Check image pull secrets +└── Validate network connectivity to registry +``` + +**Common causes:** +- Wrong image name/tag +- Missing/expired image pull secret +- Private registry authentication failure +- Network connectivity to registry + +### Resource Constraints + +**Symptoms:** +- Pod status: `Pending` +- Events: "Insufficient cpu/memory" + +**Investigation:** +``` +├── Check namespace resource quotas +├── Check pod resource requests/limits +├── Review node capacity +└── Check for pod priority and preemption +``` + +**Common causes:** +- Namespace quota exceeded +- Insufficient cluster capacity +- Resource requests too high +- No nodes matching pod requirements + +### Application Crashes + +**Symptoms:** +- Pod status: `CrashLoopBackOff` +- Container exit code: non-zero +- Frequent restarts + +**Investigation:** +``` +├── Get logs from current container +├── Get logs from previous container (--previous) +├── Check exit code in pod status +├── Review application startup logs +└── Check for environment variable issues +``` + +**Common causes:** +- Application exceptions/errors +- Missing required environment variables +- Database connection failures +- Invalid configuration + +### Memory Issues + +**Symptoms:** +- Pod status shows: `OOMKilled` +- Events: "Container exceeded memory limit" +- High memory usage before restart + +**Investigation:** +``` +├── Check memory limits vs actual usage +├── Review memory usage trends +├── Analyze for memory leaks +└── Check for spike patterns +``` + +**Common causes:** +- Memory leak in application +- Insufficient memory limits +- Unexpected traffic spike +- Large dataset processing + +### TLS Issues + +**Symptoms:** +- Logs show: "tls: bad certificate" or "x509: certificate has expired" +- Connection failures between services +- HTTP 503 errors + +**Investigation:** +``` +├── Check certificate expiration dates +├── Verify certificate CN/SAN matches hostname +├── Check CA bundle configuration +└── Review certificate secret in namespace +``` + +**Common causes:** +- Expired certificates +- Certificate CN mismatch +- Missing CA certificate +- Incorrect certificate chain + +## Real-World Example: Pod Crash Loop Investigation + +**Scenario:** API gateway pods crashing repeatedly + +**Step 1: Triage** +```bash +$ kubectl get pods -n production | grep api-gateway +api-gateway-7d8f9b-xyz 0/1 CrashLoopBackOff 5 10m +``` +Severity: P2 (High) - Production API affected + +**Step 2: Check Events** +```bash +$ kubectl describe pod api-gateway-7d8f9b-xyz -n production | grep -A 10 Events +Events: + 10m Warning BackOff Pod Back-off restarting failed container + 9m Warning Failed Pod Error: container exceeded memory limit +``` +Key finding: OOMKilled event + +**Step 3: Check Logs (Previous Container)** +```bash +$ kubectl logs api-gateway-7d8f9b-xyz -n production --previous --tail=50 +... +[ERROR] Database connection pool exhausted: 50/50 connections in use +[WARN] High memory pressure detected +[CRITICAL] Memory usage at 98%, OOM imminent +``` +Pattern: Connection pool → Memory pressure → OOM + +**Step 4: Check Resource Limits** +```bash +$ kubectl get pod api-gateway-7d8f9b-xyz -n production -o yaml | grep -A 5 resources +resources: + limits: + memory: "2Gi" + requests: + memory: "1Gi" +``` + +**Step 5: Root Cause Analysis** +``` +Timeline: +09:00 - Database query slowdown (from logs) +09:05 - Connection pool exhausted +09:10 - Memory usage spike (connections held) +09:15 - OOM killed +09:16 - Pod restart +``` + +**Root Cause:** Slow database queries → connection pool exhaustion → memory leak → OOM + +**Recommendations:** +```markdown +Immediate: +1. Increase memory limit to 4Gi (temporary) +2. Add connection timeout (10s) + +Permanent: +1. Optimize slow database query +2. Increase connection pool size +3. Implement connection timeout +4. Add memory alerts at 80% + +Prevention: +1. Monitor database query performance +2. Add Prometheus alert for connection pool usage +3. Regular load testing +``` + +## Troubleshooting Decision Matrix + +| Symptom | First Check | Common Cause | Quick Fix | +|---------|-------------|--------------|-----------| +| ImagePullBackOff | `describe pod` events | Wrong image/registry | Fix image name | +| Pending | Resource quotas | Insufficient resources | Increase quota | +| CrashLoopBackOff | Logs (--previous) | App error | Fix application | +| OOMKilled | Memory limits | Memory leak | Increase limits | +| Unhealthy | Readiness probe | Slow startup | Increase probe delay | +| No endpoints | Service selector | Label mismatch | Fix selector | + +## Keywords for Search + +kubernetes, pod failure, crashloopbackoff, oomkilled, pending pod, debugging, incident investigation, root cause analysis, pod logs, events, kubectl, container restart, image pull error, resource constraints, memory leak, application crash, service degradation, tls error, certificate expiry, readiness probe, liveness probe, troubleshooting, production incident, cluster debugging, namespace investigation, deployment failure, rollout stuck, error analysis, log correlation diff --git a/plugin.lock.json b/plugin.lock.json new file mode 100644 index 0000000..d0dee8e --- /dev/null +++ b/plugin.lock.json @@ -0,0 +1,53 @@ +{ + "$schema": "internal://schemas/plugin.lock.v1.json", + "pluginId": "gh:geored/sre-skill:debugging-kubernetes-incidents", + "normalized": { + "repo": null, + "ref": "refs/tags/v20251128.0", + "commit": "b1fc825a4467d23b26f3d47c9c5cb767f35d9c61", + "treeHash": "ef59d5907bb570ec270990bac06a07f919b7643eeeb81640339ec42d3089b431", + "generatedAt": "2025-11-28T10:16:59.381080Z", + "toolVersion": "publish_plugins.py@0.2.0" + }, + "origin": { + "remote": "git@github.com:zhongweili/42plugin-data.git", + "branch": "master", + "commit": "aa1497ed0949fd50e99e70d6324a29c5b34f9390", + "repoRoot": "/Users/zhongweili/projects/openmind/42plugin-data" + }, + "manifest": { + "name": "debugging-kubernetes-incidents", + "description": "Comprehensive skill for systematic investigation and root cause analysis of Kubernetes pod failures, service degradation, and resource issues. Provides structured 5-phase methodology for incident triage, multi-source correlation of logs/events/metrics, and actionable remediation recommendations. Covers common patterns: CrashLoopBackOff, OOMKilled, ImagePullBackOff, resource constraints, TLS errors, and network issues. Emphasizes read-only investigation and distinguishing root causes from symptoms.", + "version": "1.0.0" + }, + "content": { + "files": [ + { + "path": "README.md", + "sha256": "0d417351e335edb63b8771a0ab391eac47a2052cac5b38dc2eefb121a28b7087" + }, + { + "path": "SKILL.md", + "sha256": "d242e76fa6006dcd6f4fdd3617965a6e1a1f525b684f3026b406b5c2991133bb" + }, + { + "path": "tests/README.md", + "sha256": "80d38e62fb573e804808a970d7c6ca70be4a01c8ff378cb6ad44503d83fb2d07" + }, + { + "path": "tests/scenarios.yaml", + "sha256": "3beac757d9a595dd2ee61e976ab7fa0b188b44e0a1338cac856cbf6debad5168" + }, + { + "path": ".claude-plugin/plugin.json", + "sha256": "0aa508cd1cc30941c59bf5ef3b7541dad04daf91e12490d2dc2459f217e1eb49" + } + ], + "dirSha256": "ef59d5907bb570ec270990bac06a07f919b7643eeeb81640339ec42d3089b431" + }, + "security": { + "scannedAt": null, + "scannerVersion": null, + "flags": [] + } +} \ No newline at end of file diff --git a/tests/README.md b/tests/README.md new file mode 100644 index 0000000..f553838 --- /dev/null +++ b/tests/README.md @@ -0,0 +1,182 @@ +# Test Suite for Debugging Kubernetes Incidents Skill + +## Overview + +This test suite validates that the Kubernetes incident debugging skill properly teaches Claude Code to: +1. Recognize common Kubernetes failure patterns +2. Follow systematic investigation methodology +3. Correlate multiple data sources (logs, events, metrics) +4. Distinguish root causes from symptoms +5. Maintain read-only investigation approach + +## Test Scenarios + +### 1. CrashLoopBackOff Recognition +**Purpose**: Validates Claude recognizes crash loop pattern and suggests proper investigation +**Expected**: Should mention checking logs (--previous), events, describe, and exit codes +**Baseline Failure**: Without skill, may suggest fixes without investigation + +### 2. OOMKilled Investigation +**Purpose**: Ensures Claude identifies memory exhaustion and correlates with resource limits +**Expected**: Should investigate memory usage patterns, limits, and potential leaks +**Baseline Failure**: Without skill, may just suggest increasing memory + +### 3. Multi-Source Correlation +**Purpose**: Tests ability to gather and correlate logs, events, and metrics +**Expected**: Should mention all three data sources and timeline creation +**Baseline Failure**: Without skill, may focus on single data source + +### 4. Root Cause vs Symptom +**Purpose**: Validates temporal analysis to distinguish cause from effect +**Expected**: Should use timeline and "what happened first" approach +**Baseline Failure**: Without skill, may confuse correlation with causation + +### 5. Image Pull Failure +**Purpose**: Tests systematic approach to ImagePullBackOff debugging +**Expected**: Should check image name, registry, and pull secrets systematically +**Baseline Failure**: Without skill, may suggest random fixes + +### 6. Read-Only Investigation +**Purpose**: Ensures skill maintains advisory-only approach +**Expected**: Should recommend steps, not execute changes +**Baseline Failure**: Without skill, might suggest direct modifications + +## Running Tests + +### Prerequisites + +- Python 3.8+ +- `claudelint` installed for validation +- Claude Code CLI access +- Claude Sonnet 4 or Claude Opus 4 (tests use `sonnet` model) + +### Run All Tests + +```bash +# From repository root +make test + +# Or specifically for this skill +make test-only SKILL=debugging-kubernetes-incidents +``` + +### Validate Skill Schema + +```bash +claudelint debugging-kubernetes-incidents/SKILL.md +``` + +### Generate Test Results + +```bash +make generate SKILL=debugging-kubernetes-incidents +``` + +## Test-Driven Development Process + +This skill followed TDD for Documentation: + +### RED Phase (Initial Failures) +1. Created 6 test scenarios representing real investigation needs +2. Ran tests WITHOUT the skill +3. Documented baseline failures: + - Suggested direct fixes without investigation + - Missed multi-source correlation + - Confused symptoms with root causes + - Lacked systematic methodology + +### GREEN Phase (Minimal Skill) +1. Created SKILL.md addressing test failures +2. Added investigation phases and decision trees +3. Included multi-source correlation guidance +4. Emphasized read-only approach +5. All tests passed + +### REFACTOR Phase (Improvement) +1. Added real-world examples +2. Enhanced decision trees +3. Improved troubleshooting matrix +4. Refined investigation methodology +5. Added keyword search terms + +## Success Criteria + +All tests must: +- ✅ Pass with 100% success rate (3/3 samples) +- ✅ Contain expected keywords +- ✅ NOT contain prohibited terms +- ✅ Demonstrate systematic approach +- ✅ Maintain read-only advisory model + +## Continuous Validation + +Tests run automatically on: +- Every pull request (GitHub Actions) +- Skill file modifications +- Schema changes +- Version updates + +## Adding New Tests + +To add test scenarios: + +1. **Identify gap**: What investigation scenario is missing? +2. **Create scenario**: Add to `scenarios.yaml` +3. **Run without skill**: Document baseline failure +4. **Update SKILL.md**: Address the gap +5. **Validate**: Ensure test passes + +Example: +```yaml +- name: your-test-name + description: What you're testing + prompt: "User query to test" + model: haiku + samples: 3 + expected: + contains_keywords: + - keyword1 + - keyword2 + baseline_failure: What happens without the skill +``` + +## Known Limitations + +- Tests use synthetic scenarios (not real cluster data) +- Keyword matching is basic (could use semantic analysis) +- No integration testing with actual Kubernetes clusters +- Sample size (3) may not catch all edge cases + +## Future Improvements + +- Add tests for more complex multi-cluster scenarios +- Include performance regression testing +- Add semantic similarity scoring +- Test with real cluster incident data +- Add negative test cases (what should NOT do) + +## Troubleshooting + +### Test Failures + +**Symptom**: Test fails intermittently +**Fix**: Increase samples or refine expected keywords + +**Symptom**: All tests fail +**Fix**: Check SKILL.md frontmatter and schema validation + +**Symptom**: Baseline failure unclear +**Fix**: Run test manually without skill, document actual output + +## Contributing + +When contributing test improvements: +1. Ensure tests are deterministic +2. Use realistic user prompts +3. Document baseline failures clearly +4. Keep samples count reasonable (3-5) +5. Update this README with new scenarios + +## Questions? + +See main repository documentation or file an issue. diff --git a/tests/scenarios.yaml b/tests/scenarios.yaml new file mode 100644 index 0000000..120f44b --- /dev/null +++ b/tests/scenarios.yaml @@ -0,0 +1,116 @@ +skill_name: debugging-kubernetes-incidents +description: Validates systematic Kubernetes incident investigation methodology and root cause analysis capabilities +test_scenarios: + - name: crashloopbackoff-recognition + description: Recognizes CrashLoopBackOff pattern and suggests appropriate investigation steps + prompt: "I have a pod in CrashLoopBackOff state. What should I do?" + model: sonnet + samples: 3 + expected: + recognizes: CrashLoopBackOff + contains_keywords: + - logs + - previous + - events + - describe + - exit code + does_not_contain: + - modify + - delete + - force + baseline_failure: Without the skill, may suggest direct fixes without investigation or miss the importance of checking previous container logs + + - name: oomkilled-investigation + description: Identifies memory exhaustion pattern and correlates events with resource limits + prompt: "My pods keep getting OOMKilled. How do I find out why?" + model: sonnet + samples: 3 + expected: + recognizes: OOMKilled + contains_keywords: + - memory + - limits + - resources + - usage + - leak + - events + does_not_contain: + - ignore + - just restart + baseline_failure: Without the skill, may only suggest increasing memory without investigating the underlying cause of memory exhaustion + + - name: multi-source-correlation + description: Demonstrates correlation of logs, events, and metrics for complete incident picture + prompt: "I'm investigating a service degradation issue. What data should I collect and how do I correlate it?" + model: sonnet + samples: 3 + expected: + contains_keywords: + - logs + - events + - metrics + - correlate + - timeline + - multiple sources + mentions_tools: + - kubectl logs + - kubectl get events + - kubectl top + baseline_failure: Without the skill, may focus on only one data source without correlating across logs, events, and metrics + + - name: root-cause-vs-symptom + description: Distinguishes between root cause and symptoms using temporal analysis + prompt: "I see high CPU usage and connection errors. Which is the root cause?" + model: sonnet + samples: 3 + expected: + contains_keywords: + - temporal + - first + - timeline + - causation + - correlation + provides_approach: + - Check what happened first + - Create timeline + - Validate causal mechanism + baseline_failure: Without the skill, may incorrectly identify symptoms as root causes or confuse correlation with causation + + - name: image-pull-failure + description: Provides systematic approach to investigating ImagePullBackOff errors + prompt: "Pod status shows ImagePullBackOff. How do I debug this?" + model: sonnet + samples: 3 + expected: + recognizes: ImagePullBackOff + contains_keywords: + - image + - registry + - secret + - pull + - authentication + - describe + investigation_steps: + - Check image name + - Verify registry + - Check pull secrets + baseline_failure: Without the skill, may suggest random fixes without systematic investigation of image name, registry access, or credentials + + - name: read-only-investigation + description: Ensures skill maintains read-only approach and never suggests direct modifications + prompt: "I found the issue. How do I fix the pod?" + model: sonnet + samples: 3 + expected: + contains_keywords: + - recommend + - manual + - review + - steps + does_not_contain: + - automatically + - I will delete + - I will modify + - I will restart + approach: Advisory recommendations only + baseline_failure: Without the skill, might suggest direct modifications or automated fixes without proper human review