5.2 KiB
Test Suite for Debugging Kubernetes Incidents Skill
Overview
This test suite validates that the Kubernetes incident debugging skill properly teaches Claude Code to:
- Recognize common Kubernetes failure patterns
- Follow systematic investigation methodology
- Correlate multiple data sources (logs, events, metrics)
- Distinguish root causes from symptoms
- Maintain read-only investigation approach
Test Scenarios
1. CrashLoopBackOff Recognition
Purpose: Validates Claude recognizes crash loop pattern and suggests proper investigation Expected: Should mention checking logs (--previous), events, describe, and exit codes Baseline Failure: Without skill, may suggest fixes without investigation
2. OOMKilled Investigation
Purpose: Ensures Claude identifies memory exhaustion and correlates with resource limits Expected: Should investigate memory usage patterns, limits, and potential leaks Baseline Failure: Without skill, may just suggest increasing memory
3. Multi-Source Correlation
Purpose: Tests ability to gather and correlate logs, events, and metrics Expected: Should mention all three data sources and timeline creation Baseline Failure: Without skill, may focus on single data source
4. Root Cause vs Symptom
Purpose: Validates temporal analysis to distinguish cause from effect Expected: Should use timeline and "what happened first" approach Baseline Failure: Without skill, may confuse correlation with causation
5. Image Pull Failure
Purpose: Tests systematic approach to ImagePullBackOff debugging Expected: Should check image name, registry, and pull secrets systematically Baseline Failure: Without skill, may suggest random fixes
6. Read-Only Investigation
Purpose: Ensures skill maintains advisory-only approach Expected: Should recommend steps, not execute changes Baseline Failure: Without skill, might suggest direct modifications
Running Tests
Prerequisites
- Python 3.8+
claudelintinstalled for validation- Claude Code CLI access
- Claude Sonnet 4 or Claude Opus 4 (tests use
sonnetmodel)
Run All Tests
# From repository root
make test
# Or specifically for this skill
make test-only SKILL=debugging-kubernetes-incidents
Validate Skill Schema
claudelint debugging-kubernetes-incidents/SKILL.md
Generate Test Results
make generate SKILL=debugging-kubernetes-incidents
Test-Driven Development Process
This skill followed TDD for Documentation:
RED Phase (Initial Failures)
- Created 6 test scenarios representing real investigation needs
- Ran tests WITHOUT the skill
- Documented baseline failures:
- Suggested direct fixes without investigation
- Missed multi-source correlation
- Confused symptoms with root causes
- Lacked systematic methodology
GREEN Phase (Minimal Skill)
- Created SKILL.md addressing test failures
- Added investigation phases and decision trees
- Included multi-source correlation guidance
- Emphasized read-only approach
- All tests passed
REFACTOR Phase (Improvement)
- Added real-world examples
- Enhanced decision trees
- Improved troubleshooting matrix
- Refined investigation methodology
- Added keyword search terms
Success Criteria
All tests must:
- ✅ Pass with 100% success rate (3/3 samples)
- ✅ Contain expected keywords
- ✅ NOT contain prohibited terms
- ✅ Demonstrate systematic approach
- ✅ Maintain read-only advisory model
Continuous Validation
Tests run automatically on:
- Every pull request (GitHub Actions)
- Skill file modifications
- Schema changes
- Version updates
Adding New Tests
To add test scenarios:
- Identify gap: What investigation scenario is missing?
- Create scenario: Add to
scenarios.yaml - Run without skill: Document baseline failure
- Update SKILL.md: Address the gap
- Validate: Ensure test passes
Example:
- name: your-test-name
description: What you're testing
prompt: "User query to test"
model: haiku
samples: 3
expected:
contains_keywords:
- keyword1
- keyword2
baseline_failure: What happens without the skill
Known Limitations
- Tests use synthetic scenarios (not real cluster data)
- Keyword matching is basic (could use semantic analysis)
- No integration testing with actual Kubernetes clusters
- Sample size (3) may not catch all edge cases
Future Improvements
- Add tests for more complex multi-cluster scenarios
- Include performance regression testing
- Add semantic similarity scoring
- Test with real cluster incident data
- Add negative test cases (what should NOT do)
Troubleshooting
Test Failures
Symptom: Test fails intermittently Fix: Increase samples or refine expected keywords
Symptom: All tests fail Fix: Check SKILL.md frontmatter and schema validation
Symptom: Baseline failure unclear Fix: Run test manually without skill, document actual output
Contributing
When contributing test improvements:
- Ensure tests are deterministic
- Use realistic user prompts
- Document baseline failures clearly
- Keep samples count reasonable (3-5)
- Update this README with new scenarios
Questions?
See main repository documentation or file an issue.