Files
gh-geored-sre-skill-debuggi…/tests/README.md
2025-11-29 18:28:20 +08:00

5.2 KiB

Test Suite for Debugging Kubernetes Incidents Skill

Overview

This test suite validates that the Kubernetes incident debugging skill properly teaches Claude Code to:

  1. Recognize common Kubernetes failure patterns
  2. Follow systematic investigation methodology
  3. Correlate multiple data sources (logs, events, metrics)
  4. Distinguish root causes from symptoms
  5. Maintain read-only investigation approach

Test Scenarios

1. CrashLoopBackOff Recognition

Purpose: Validates Claude recognizes crash loop pattern and suggests proper investigation Expected: Should mention checking logs (--previous), events, describe, and exit codes Baseline Failure: Without skill, may suggest fixes without investigation

2. OOMKilled Investigation

Purpose: Ensures Claude identifies memory exhaustion and correlates with resource limits Expected: Should investigate memory usage patterns, limits, and potential leaks Baseline Failure: Without skill, may just suggest increasing memory

3. Multi-Source Correlation

Purpose: Tests ability to gather and correlate logs, events, and metrics Expected: Should mention all three data sources and timeline creation Baseline Failure: Without skill, may focus on single data source

4. Root Cause vs Symptom

Purpose: Validates temporal analysis to distinguish cause from effect Expected: Should use timeline and "what happened first" approach Baseline Failure: Without skill, may confuse correlation with causation

5. Image Pull Failure

Purpose: Tests systematic approach to ImagePullBackOff debugging Expected: Should check image name, registry, and pull secrets systematically Baseline Failure: Without skill, may suggest random fixes

6. Read-Only Investigation

Purpose: Ensures skill maintains advisory-only approach Expected: Should recommend steps, not execute changes Baseline Failure: Without skill, might suggest direct modifications

Running Tests

Prerequisites

  • Python 3.8+
  • claudelint installed for validation
  • Claude Code CLI access
  • Claude Sonnet 4 or Claude Opus 4 (tests use sonnet model)

Run All Tests

# From repository root
make test

# Or specifically for this skill
make test-only SKILL=debugging-kubernetes-incidents

Validate Skill Schema

claudelint debugging-kubernetes-incidents/SKILL.md

Generate Test Results

make generate SKILL=debugging-kubernetes-incidents

Test-Driven Development Process

This skill followed TDD for Documentation:

RED Phase (Initial Failures)

  1. Created 6 test scenarios representing real investigation needs
  2. Ran tests WITHOUT the skill
  3. Documented baseline failures:
    • Suggested direct fixes without investigation
    • Missed multi-source correlation
    • Confused symptoms with root causes
    • Lacked systematic methodology

GREEN Phase (Minimal Skill)

  1. Created SKILL.md addressing test failures
  2. Added investigation phases and decision trees
  3. Included multi-source correlation guidance
  4. Emphasized read-only approach
  5. All tests passed

REFACTOR Phase (Improvement)

  1. Added real-world examples
  2. Enhanced decision trees
  3. Improved troubleshooting matrix
  4. Refined investigation methodology
  5. Added keyword search terms

Success Criteria

All tests must:

  • Pass with 100% success rate (3/3 samples)
  • Contain expected keywords
  • NOT contain prohibited terms
  • Demonstrate systematic approach
  • Maintain read-only advisory model

Continuous Validation

Tests run automatically on:

  • Every pull request (GitHub Actions)
  • Skill file modifications
  • Schema changes
  • Version updates

Adding New Tests

To add test scenarios:

  1. Identify gap: What investigation scenario is missing?
  2. Create scenario: Add to scenarios.yaml
  3. Run without skill: Document baseline failure
  4. Update SKILL.md: Address the gap
  5. Validate: Ensure test passes

Example:

- name: your-test-name
  description: What you're testing
  prompt: "User query to test"
  model: haiku
  samples: 3
  expected:
    contains_keywords:
      - keyword1
      - keyword2
  baseline_failure: What happens without the skill

Known Limitations

  • Tests use synthetic scenarios (not real cluster data)
  • Keyword matching is basic (could use semantic analysis)
  • No integration testing with actual Kubernetes clusters
  • Sample size (3) may not catch all edge cases

Future Improvements

  • Add tests for more complex multi-cluster scenarios
  • Include performance regression testing
  • Add semantic similarity scoring
  • Test with real cluster incident data
  • Add negative test cases (what should NOT do)

Troubleshooting

Test Failures

Symptom: Test fails intermittently Fix: Increase samples or refine expected keywords

Symptom: All tests fail Fix: Check SKILL.md frontmatter and schema validation

Symptom: Baseline failure unclear Fix: Run test manually without skill, document actual output

Contributing

When contributing test improvements:

  1. Ensure tests are deterministic
  2. Use realistic user prompts
  3. Document baseline failures clearly
  4. Keep samples count reasonable (3-5)
  5. Update this README with new scenarios

Questions?

See main repository documentation or file an issue.