# Test Suite for Debugging Kubernetes Incidents Skill ## Overview This test suite validates that the Kubernetes incident debugging skill properly teaches Claude Code to: 1. Recognize common Kubernetes failure patterns 2. Follow systematic investigation methodology 3. Correlate multiple data sources (logs, events, metrics) 4. Distinguish root causes from symptoms 5. Maintain read-only investigation approach ## Test Scenarios ### 1. CrashLoopBackOff Recognition **Purpose**: Validates Claude recognizes crash loop pattern and suggests proper investigation **Expected**: Should mention checking logs (--previous), events, describe, and exit codes **Baseline Failure**: Without skill, may suggest fixes without investigation ### 2. OOMKilled Investigation **Purpose**: Ensures Claude identifies memory exhaustion and correlates with resource limits **Expected**: Should investigate memory usage patterns, limits, and potential leaks **Baseline Failure**: Without skill, may just suggest increasing memory ### 3. Multi-Source Correlation **Purpose**: Tests ability to gather and correlate logs, events, and metrics **Expected**: Should mention all three data sources and timeline creation **Baseline Failure**: Without skill, may focus on single data source ### 4. Root Cause vs Symptom **Purpose**: Validates temporal analysis to distinguish cause from effect **Expected**: Should use timeline and "what happened first" approach **Baseline Failure**: Without skill, may confuse correlation with causation ### 5. Image Pull Failure **Purpose**: Tests systematic approach to ImagePullBackOff debugging **Expected**: Should check image name, registry, and pull secrets systematically **Baseline Failure**: Without skill, may suggest random fixes ### 6. Read-Only Investigation **Purpose**: Ensures skill maintains advisory-only approach **Expected**: Should recommend steps, not execute changes **Baseline Failure**: Without skill, might suggest direct modifications ## Running Tests ### Prerequisites - Python 3.8+ - `claudelint` installed for validation - Claude Code CLI access - Claude Sonnet 4 or Claude Opus 4 (tests use `sonnet` model) ### Run All Tests ```bash # From repository root make test # Or specifically for this skill make test-only SKILL=debugging-kubernetes-incidents ``` ### Validate Skill Schema ```bash claudelint debugging-kubernetes-incidents/SKILL.md ``` ### Generate Test Results ```bash make generate SKILL=debugging-kubernetes-incidents ``` ## Test-Driven Development Process This skill followed TDD for Documentation: ### RED Phase (Initial Failures) 1. Created 6 test scenarios representing real investigation needs 2. Ran tests WITHOUT the skill 3. Documented baseline failures: - Suggested direct fixes without investigation - Missed multi-source correlation - Confused symptoms with root causes - Lacked systematic methodology ### GREEN Phase (Minimal Skill) 1. Created SKILL.md addressing test failures 2. Added investigation phases and decision trees 3. Included multi-source correlation guidance 4. Emphasized read-only approach 5. All tests passed ### REFACTOR Phase (Improvement) 1. Added real-world examples 2. Enhanced decision trees 3. Improved troubleshooting matrix 4. Refined investigation methodology 5. Added keyword search terms ## Success Criteria All tests must: - ✅ Pass with 100% success rate (3/3 samples) - ✅ Contain expected keywords - ✅ NOT contain prohibited terms - ✅ Demonstrate systematic approach - ✅ Maintain read-only advisory model ## Continuous Validation Tests run automatically on: - Every pull request (GitHub Actions) - Skill file modifications - Schema changes - Version updates ## Adding New Tests To add test scenarios: 1. **Identify gap**: What investigation scenario is missing? 2. **Create scenario**: Add to `scenarios.yaml` 3. **Run without skill**: Document baseline failure 4. **Update SKILL.md**: Address the gap 5. **Validate**: Ensure test passes Example: ```yaml - name: your-test-name description: What you're testing prompt: "User query to test" model: haiku samples: 3 expected: contains_keywords: - keyword1 - keyword2 baseline_failure: What happens without the skill ``` ## Known Limitations - Tests use synthetic scenarios (not real cluster data) - Keyword matching is basic (could use semantic analysis) - No integration testing with actual Kubernetes clusters - Sample size (3) may not catch all edge cases ## Future Improvements - Add tests for more complex multi-cluster scenarios - Include performance regression testing - Add semantic similarity scoring - Test with real cluster incident data - Add negative test cases (what should NOT do) ## Troubleshooting ### Test Failures **Symptom**: Test fails intermittently **Fix**: Increase samples or refine expected keywords **Symptom**: All tests fail **Fix**: Check SKILL.md frontmatter and schema validation **Symptom**: Baseline failure unclear **Fix**: Run test manually without skill, document actual output ## Contributing When contributing test improvements: 1. Ensure tests are deterministic 2. Use realistic user prompts 3. Document baseline failures clearly 4. Keep samples count reasonable (3-5) 5. Update this README with new scenarios ## Questions? See main repository documentation or file an issue.