Initial commit
This commit is contained in:
182
tests/README.md
Normal file
182
tests/README.md
Normal file
@@ -0,0 +1,182 @@
|
||||
# Test Suite for Debugging Kubernetes Incidents Skill
|
||||
|
||||
## Overview
|
||||
|
||||
This test suite validates that the Kubernetes incident debugging skill properly teaches Claude Code to:
|
||||
1. Recognize common Kubernetes failure patterns
|
||||
2. Follow systematic investigation methodology
|
||||
3. Correlate multiple data sources (logs, events, metrics)
|
||||
4. Distinguish root causes from symptoms
|
||||
5. Maintain read-only investigation approach
|
||||
|
||||
## Test Scenarios
|
||||
|
||||
### 1. CrashLoopBackOff Recognition
|
||||
**Purpose**: Validates Claude recognizes crash loop pattern and suggests proper investigation
|
||||
**Expected**: Should mention checking logs (--previous), events, describe, and exit codes
|
||||
**Baseline Failure**: Without skill, may suggest fixes without investigation
|
||||
|
||||
### 2. OOMKilled Investigation
|
||||
**Purpose**: Ensures Claude identifies memory exhaustion and correlates with resource limits
|
||||
**Expected**: Should investigate memory usage patterns, limits, and potential leaks
|
||||
**Baseline Failure**: Without skill, may just suggest increasing memory
|
||||
|
||||
### 3. Multi-Source Correlation
|
||||
**Purpose**: Tests ability to gather and correlate logs, events, and metrics
|
||||
**Expected**: Should mention all three data sources and timeline creation
|
||||
**Baseline Failure**: Without skill, may focus on single data source
|
||||
|
||||
### 4. Root Cause vs Symptom
|
||||
**Purpose**: Validates temporal analysis to distinguish cause from effect
|
||||
**Expected**: Should use timeline and "what happened first" approach
|
||||
**Baseline Failure**: Without skill, may confuse correlation with causation
|
||||
|
||||
### 5. Image Pull Failure
|
||||
**Purpose**: Tests systematic approach to ImagePullBackOff debugging
|
||||
**Expected**: Should check image name, registry, and pull secrets systematically
|
||||
**Baseline Failure**: Without skill, may suggest random fixes
|
||||
|
||||
### 6. Read-Only Investigation
|
||||
**Purpose**: Ensures skill maintains advisory-only approach
|
||||
**Expected**: Should recommend steps, not execute changes
|
||||
**Baseline Failure**: Without skill, might suggest direct modifications
|
||||
|
||||
## Running Tests
|
||||
|
||||
### Prerequisites
|
||||
|
||||
- Python 3.8+
|
||||
- `claudelint` installed for validation
|
||||
- Claude Code CLI access
|
||||
- Claude Sonnet 4 or Claude Opus 4 (tests use `sonnet` model)
|
||||
|
||||
### Run All Tests
|
||||
|
||||
```bash
|
||||
# From repository root
|
||||
make test
|
||||
|
||||
# Or specifically for this skill
|
||||
make test-only SKILL=debugging-kubernetes-incidents
|
||||
```
|
||||
|
||||
### Validate Skill Schema
|
||||
|
||||
```bash
|
||||
claudelint debugging-kubernetes-incidents/SKILL.md
|
||||
```
|
||||
|
||||
### Generate Test Results
|
||||
|
||||
```bash
|
||||
make generate SKILL=debugging-kubernetes-incidents
|
||||
```
|
||||
|
||||
## Test-Driven Development Process
|
||||
|
||||
This skill followed TDD for Documentation:
|
||||
|
||||
### RED Phase (Initial Failures)
|
||||
1. Created 6 test scenarios representing real investigation needs
|
||||
2. Ran tests WITHOUT the skill
|
||||
3. Documented baseline failures:
|
||||
- Suggested direct fixes without investigation
|
||||
- Missed multi-source correlation
|
||||
- Confused symptoms with root causes
|
||||
- Lacked systematic methodology
|
||||
|
||||
### GREEN Phase (Minimal Skill)
|
||||
1. Created SKILL.md addressing test failures
|
||||
2. Added investigation phases and decision trees
|
||||
3. Included multi-source correlation guidance
|
||||
4. Emphasized read-only approach
|
||||
5. All tests passed
|
||||
|
||||
### REFACTOR Phase (Improvement)
|
||||
1. Added real-world examples
|
||||
2. Enhanced decision trees
|
||||
3. Improved troubleshooting matrix
|
||||
4. Refined investigation methodology
|
||||
5. Added keyword search terms
|
||||
|
||||
## Success Criteria
|
||||
|
||||
All tests must:
|
||||
- ✅ Pass with 100% success rate (3/3 samples)
|
||||
- ✅ Contain expected keywords
|
||||
- ✅ NOT contain prohibited terms
|
||||
- ✅ Demonstrate systematic approach
|
||||
- ✅ Maintain read-only advisory model
|
||||
|
||||
## Continuous Validation
|
||||
|
||||
Tests run automatically on:
|
||||
- Every pull request (GitHub Actions)
|
||||
- Skill file modifications
|
||||
- Schema changes
|
||||
- Version updates
|
||||
|
||||
## Adding New Tests
|
||||
|
||||
To add test scenarios:
|
||||
|
||||
1. **Identify gap**: What investigation scenario is missing?
|
||||
2. **Create scenario**: Add to `scenarios.yaml`
|
||||
3. **Run without skill**: Document baseline failure
|
||||
4. **Update SKILL.md**: Address the gap
|
||||
5. **Validate**: Ensure test passes
|
||||
|
||||
Example:
|
||||
```yaml
|
||||
- name: your-test-name
|
||||
description: What you're testing
|
||||
prompt: "User query to test"
|
||||
model: haiku
|
||||
samples: 3
|
||||
expected:
|
||||
contains_keywords:
|
||||
- keyword1
|
||||
- keyword2
|
||||
baseline_failure: What happens without the skill
|
||||
```
|
||||
|
||||
## Known Limitations
|
||||
|
||||
- Tests use synthetic scenarios (not real cluster data)
|
||||
- Keyword matching is basic (could use semantic analysis)
|
||||
- No integration testing with actual Kubernetes clusters
|
||||
- Sample size (3) may not catch all edge cases
|
||||
|
||||
## Future Improvements
|
||||
|
||||
- Add tests for more complex multi-cluster scenarios
|
||||
- Include performance regression testing
|
||||
- Add semantic similarity scoring
|
||||
- Test with real cluster incident data
|
||||
- Add negative test cases (what should NOT do)
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Test Failures
|
||||
|
||||
**Symptom**: Test fails intermittently
|
||||
**Fix**: Increase samples or refine expected keywords
|
||||
|
||||
**Symptom**: All tests fail
|
||||
**Fix**: Check SKILL.md frontmatter and schema validation
|
||||
|
||||
**Symptom**: Baseline failure unclear
|
||||
**Fix**: Run test manually without skill, document actual output
|
||||
|
||||
## Contributing
|
||||
|
||||
When contributing test improvements:
|
||||
1. Ensure tests are deterministic
|
||||
2. Use realistic user prompts
|
||||
3. Document baseline failures clearly
|
||||
4. Keep samples count reasonable (3-5)
|
||||
5. Update this README with new scenarios
|
||||
|
||||
## Questions?
|
||||
|
||||
See main repository documentation or file an issue.
|
||||
116
tests/scenarios.yaml
Normal file
116
tests/scenarios.yaml
Normal file
@@ -0,0 +1,116 @@
|
||||
skill_name: debugging-kubernetes-incidents
|
||||
description: Validates systematic Kubernetes incident investigation methodology and root cause analysis capabilities
|
||||
test_scenarios:
|
||||
- name: crashloopbackoff-recognition
|
||||
description: Recognizes CrashLoopBackOff pattern and suggests appropriate investigation steps
|
||||
prompt: "I have a pod in CrashLoopBackOff state. What should I do?"
|
||||
model: sonnet
|
||||
samples: 3
|
||||
expected:
|
||||
recognizes: CrashLoopBackOff
|
||||
contains_keywords:
|
||||
- logs
|
||||
- previous
|
||||
- events
|
||||
- describe
|
||||
- exit code
|
||||
does_not_contain:
|
||||
- modify
|
||||
- delete
|
||||
- force
|
||||
baseline_failure: Without the skill, may suggest direct fixes without investigation or miss the importance of checking previous container logs
|
||||
|
||||
- name: oomkilled-investigation
|
||||
description: Identifies memory exhaustion pattern and correlates events with resource limits
|
||||
prompt: "My pods keep getting OOMKilled. How do I find out why?"
|
||||
model: sonnet
|
||||
samples: 3
|
||||
expected:
|
||||
recognizes: OOMKilled
|
||||
contains_keywords:
|
||||
- memory
|
||||
- limits
|
||||
- resources
|
||||
- usage
|
||||
- leak
|
||||
- events
|
||||
does_not_contain:
|
||||
- ignore
|
||||
- just restart
|
||||
baseline_failure: Without the skill, may only suggest increasing memory without investigating the underlying cause of memory exhaustion
|
||||
|
||||
- name: multi-source-correlation
|
||||
description: Demonstrates correlation of logs, events, and metrics for complete incident picture
|
||||
prompt: "I'm investigating a service degradation issue. What data should I collect and how do I correlate it?"
|
||||
model: sonnet
|
||||
samples: 3
|
||||
expected:
|
||||
contains_keywords:
|
||||
- logs
|
||||
- events
|
||||
- metrics
|
||||
- correlate
|
||||
- timeline
|
||||
- multiple sources
|
||||
mentions_tools:
|
||||
- kubectl logs
|
||||
- kubectl get events
|
||||
- kubectl top
|
||||
baseline_failure: Without the skill, may focus on only one data source without correlating across logs, events, and metrics
|
||||
|
||||
- name: root-cause-vs-symptom
|
||||
description: Distinguishes between root cause and symptoms using temporal analysis
|
||||
prompt: "I see high CPU usage and connection errors. Which is the root cause?"
|
||||
model: sonnet
|
||||
samples: 3
|
||||
expected:
|
||||
contains_keywords:
|
||||
- temporal
|
||||
- first
|
||||
- timeline
|
||||
- causation
|
||||
- correlation
|
||||
provides_approach:
|
||||
- Check what happened first
|
||||
- Create timeline
|
||||
- Validate causal mechanism
|
||||
baseline_failure: Without the skill, may incorrectly identify symptoms as root causes or confuse correlation with causation
|
||||
|
||||
- name: image-pull-failure
|
||||
description: Provides systematic approach to investigating ImagePullBackOff errors
|
||||
prompt: "Pod status shows ImagePullBackOff. How do I debug this?"
|
||||
model: sonnet
|
||||
samples: 3
|
||||
expected:
|
||||
recognizes: ImagePullBackOff
|
||||
contains_keywords:
|
||||
- image
|
||||
- registry
|
||||
- secret
|
||||
- pull
|
||||
- authentication
|
||||
- describe
|
||||
investigation_steps:
|
||||
- Check image name
|
||||
- Verify registry
|
||||
- Check pull secrets
|
||||
baseline_failure: Without the skill, may suggest random fixes without systematic investigation of image name, registry access, or credentials
|
||||
|
||||
- name: read-only-investigation
|
||||
description: Ensures skill maintains read-only approach and never suggests direct modifications
|
||||
prompt: "I found the issue. How do I fix the pod?"
|
||||
model: sonnet
|
||||
samples: 3
|
||||
expected:
|
||||
contains_keywords:
|
||||
- recommend
|
||||
- manual
|
||||
- review
|
||||
- steps
|
||||
does_not_contain:
|
||||
- automatically
|
||||
- I will delete
|
||||
- I will modify
|
||||
- I will restart
|
||||
approach: Advisory recommendations only
|
||||
baseline_failure: Without the skill, might suggest direct modifications or automated fixes without proper human review
|
||||
Reference in New Issue
Block a user