gh-geored-sre-skill-debuggi…/tests/README.md

# Test Suite for Debugging Kubernetes Incidents Skill

## Overview

This test suite validates that the Kubernetes incident debugging skill properly teaches Claude Code to:
1. Recognize common Kubernetes failure patterns
2. Follow systematic investigation methodology
3. Correlate multiple data sources (logs, events, metrics)
4. Distinguish root causes from symptoms
5. Maintain read-only investigation approach

## Test Scenarios

### 1. CrashLoopBackOff Recognition
**Purpose**: Validates Claude recognizes crash loop pattern and suggests proper investigation
**Expected**: Should mention checking logs (--previous), events, describe, and exit codes
**Baseline Failure**: Without skill, may suggest fixes without investigation

### 2. OOMKilled Investigation
**Purpose**: Ensures Claude identifies memory exhaustion and correlates with resource limits
**Expected**: Should investigate memory usage patterns, limits, and potential leaks
**Baseline Failure**: Without skill, may just suggest increasing memory

### 3. Multi-Source Correlation
**Purpose**: Tests ability to gather and correlate logs, events, and metrics
**Expected**: Should mention all three data sources and timeline creation
**Baseline Failure**: Without skill, may focus on single data source

### 4. Root Cause vs Symptom
**Purpose**: Validates temporal analysis to distinguish cause from effect
**Expected**: Should use timeline and "what happened first" approach
**Baseline Failure**: Without skill, may confuse correlation with causation

### 5. Image Pull Failure
**Purpose**: Tests systematic approach to ImagePullBackOff debugging
**Expected**: Should check image name, registry, and pull secrets systematically
**Baseline Failure**: Without skill, may suggest random fixes

### 6. Read-Only Investigation
**Purpose**: Ensures skill maintains advisory-only approach
**Expected**: Should recommend steps, not execute changes
**Baseline Failure**: Without skill, might suggest direct modifications

## Running Tests

### Prerequisites

- Python 3.8+
- `claudelint` installed for validation
- Claude Code CLI access
- Claude Sonnet 4 or Claude Opus 4 (tests use `sonnet` model)

### Run All Tests

```bash
# From repository root
make test

# Or specifically for this skill
make test-only SKILL=debugging-kubernetes-incidents
```

### Validate Skill Schema

```bash
claudelint debugging-kubernetes-incidents/SKILL.md
```

### Generate Test Results

```bash
make generate SKILL=debugging-kubernetes-incidents
```

## Test-Driven Development Process

This skill followed TDD for Documentation:

### RED Phase (Initial Failures)
1. Created 6 test scenarios representing real investigation needs
2. Ran tests WITHOUT the skill
3. Documented baseline failures:
   - Suggested direct fixes without investigation
   - Missed multi-source correlation
   - Confused symptoms with root causes
   - Lacked systematic methodology

### GREEN Phase (Minimal Skill)
1. Created SKILL.md addressing test failures
2. Added investigation phases and decision trees
3. Included multi-source correlation guidance
4. Emphasized read-only approach
5. All tests passed

### REFACTOR Phase (Improvement)
1. Added real-world examples
2. Enhanced decision trees
3. Improved troubleshooting matrix
4. Refined investigation methodology
5. Added keyword search terms

## Success Criteria

All tests must:
- ✅ Pass with 100% success rate (3/3 samples)
- ✅ Contain expected keywords
- ✅ NOT contain prohibited terms
- ✅ Demonstrate systematic approach
- ✅ Maintain read-only advisory model

## Continuous Validation

Tests run automatically on:
- Every pull request (GitHub Actions)
- Skill file modifications
- Schema changes
- Version updates

## Adding New Tests

To add test scenarios:

1. **Identify gap**: What investigation scenario is missing?
2. **Create scenario**: Add to `scenarios.yaml`
3. **Run without skill**: Document baseline failure
4. **Update SKILL.md**: Address the gap
5. **Validate**: Ensure test passes

Example:
```yaml
- name: your-test-name
  description: What you're testing
  prompt: "User query to test"
  model: haiku
  samples: 3
  expected:
    contains_keywords:
      - keyword1
      - keyword2
  baseline_failure: What happens without the skill
```

## Known Limitations

- Tests use synthetic scenarios (not real cluster data)
- Keyword matching is basic (could use semantic analysis)
- No integration testing with actual Kubernetes clusters
- Sample size (3) may not catch all edge cases

## Future Improvements

- Add tests for more complex multi-cluster scenarios
- Include performance regression testing
- Add semantic similarity scoring
- Test with real cluster incident data
- Add negative test cases (what should NOT do)

## Troubleshooting

### Test Failures

**Symptom**: Test fails intermittently
**Fix**: Increase samples or refine expected keywords

**Symptom**: All tests fail
**Fix**: Check SKILL.md frontmatter and schema validation

**Symptom**: Baseline failure unclear
**Fix**: Run test manually without skill, document actual output

## Contributing

When contributing test improvements:
1. Ensure tests are deterministic
2. Use realistic user prompts
3. Document baseline failures clearly
4. Keep samples count reasonable (3-5)
5. Update this README with new scenarios

## Questions?

See main repository documentation or file an issue.