zhongwei/gh-geored-sre-skill-debugging-kubernetes-incidents

Fork 0

Files

Zhongwei Li 167c936f63 Initial commit

2025-11-29 18:28:20 +08:00

5.2 KiB

Raw Blame History

Test Suite for Debugging Kubernetes Incidents Skill

Overview

This test suite validates that the Kubernetes incident debugging skill properly teaches Claude Code to:

Recognize common Kubernetes failure patterns
Follow systematic investigation methodology
Correlate multiple data sources (logs, events, metrics)
Distinguish root causes from symptoms
Maintain read-only investigation approach

Test Scenarios

1. CrashLoopBackOff Recognition

Purpose: Validates Claude recognizes crash loop pattern and suggests proper investigation Expected: Should mention checking logs (--previous), events, describe, and exit codes Baseline Failure: Without skill, may suggest fixes without investigation

2. OOMKilled Investigation

Purpose: Ensures Claude identifies memory exhaustion and correlates with resource limits Expected: Should investigate memory usage patterns, limits, and potential leaks Baseline Failure: Without skill, may just suggest increasing memory

3. Multi-Source Correlation

Purpose: Tests ability to gather and correlate logs, events, and metrics Expected: Should mention all three data sources and timeline creation Baseline Failure: Without skill, may focus on single data source

4. Root Cause vs Symptom

Purpose: Validates temporal analysis to distinguish cause from effect Expected: Should use timeline and "what happened first" approach Baseline Failure: Without skill, may confuse correlation with causation

5. Image Pull Failure

Purpose: Tests systematic approach to ImagePullBackOff debugging Expected: Should check image name, registry, and pull secrets systematically Baseline Failure: Without skill, may suggest random fixes

6. Read-Only Investigation

Purpose: Ensures skill maintains advisory-only approach Expected: Should recommend steps, not execute changes Baseline Failure: Without skill, might suggest direct modifications

Running Tests

Prerequisites

Python 3.8+
claudelint installed for validation
Claude Code CLI access
Claude Sonnet 4 or Claude Opus 4 (tests use sonnet model)

Run All Tests

# From repository root
make test

# Or specifically for this skill
make test-only SKILL=debugging-kubernetes-incidents

Validate Skill Schema

claudelint debugging-kubernetes-incidents/SKILL.md

Generate Test Results

make generate SKILL=debugging-kubernetes-incidents

Test-Driven Development Process

This skill followed TDD for Documentation:

RED Phase (Initial Failures)

Created 6 test scenarios representing real investigation needs
Ran tests WITHOUT the skill
Documented baseline failures:
- Suggested direct fixes without investigation
- Missed multi-source correlation
- Confused symptoms with root causes
- Lacked systematic methodology

GREEN Phase (Minimal Skill)

Created SKILL.md addressing test failures
Added investigation phases and decision trees
Included multi-source correlation guidance
Emphasized read-only approach
All tests passed

REFACTOR Phase (Improvement)

Added real-world examples
Enhanced decision trees
Improved troubleshooting matrix
Refined investigation methodology
Added keyword search terms

Success Criteria

All tests must:

✅ Pass with 100% success rate (3/3 samples)
✅ Contain expected keywords
✅ NOT contain prohibited terms
✅ Demonstrate systematic approach
✅ Maintain read-only advisory model

Continuous Validation

Tests run automatically on:

Every pull request (GitHub Actions)
Skill file modifications
Schema changes
Version updates

Adding New Tests

To add test scenarios:

Identify gap: What investigation scenario is missing?
Create scenario: Add to scenarios.yaml
Run without skill: Document baseline failure
Update SKILL.md: Address the gap
Validate: Ensure test passes

Example:

- name: your-test-name
  description: What you're testing
  prompt: "User query to test"
  model: haiku
  samples: 3
  expected:
    contains_keywords:
      - keyword1
      - keyword2
  baseline_failure: What happens without the skill

Known Limitations

Tests use synthetic scenarios (not real cluster data)
Keyword matching is basic (could use semantic analysis)
No integration testing with actual Kubernetes clusters
Sample size (3) may not catch all edge cases

Future Improvements

Add tests for more complex multi-cluster scenarios
Include performance regression testing
Add semantic similarity scoring
Test with real cluster incident data
Add negative test cases (what should NOT do)

Troubleshooting

Test Failures

Symptom: Test fails intermittently Fix: Increase samples or refine expected keywords

Symptom: All tests fail Fix: Check SKILL.md frontmatter and schema validation

Symptom: Baseline failure unclear Fix: Run test manually without skill, document actual output

Contributing

When contributing test improvements:

Ensure tests are deterministic
Use realistic user prompts
Document baseline failures clearly
Keep samples count reasonable (3-5)
Update this README with new scenarios

Questions?

See main repository documentation or file an issue.

5.2 KiB Raw Blame History