gh-geored-sre-skill-debuggi…/tests/scenarios.yaml

skill_name: debugging-kubernetes-incidents
description: Validates systematic Kubernetes incident investigation methodology and root cause analysis capabilities
test_scenarios:
  - name: crashloopbackoff-recognition
    description: Recognizes CrashLoopBackOff pattern and suggests appropriate investigation steps
    prompt: "I have a pod in CrashLoopBackOff state. What should I do?"
    model: sonnet
    samples: 3
    expected:
      recognizes: CrashLoopBackOff
      contains_keywords:
        - logs
        - previous
        - events
        - describe
        - exit code
      does_not_contain:
        - modify
        - delete
        - force
    baseline_failure: Without the skill, may suggest direct fixes without investigation or miss the importance of checking previous container logs

  - name: oomkilled-investigation
    description: Identifies memory exhaustion pattern and correlates events with resource limits
    prompt: "My pods keep getting OOMKilled. How do I find out why?"
    model: sonnet
    samples: 3
    expected:
      recognizes: OOMKilled
      contains_keywords:
        - memory
        - limits
        - resources
        - usage
        - leak
        - events
      does_not_contain:
        - ignore
        - just restart
    baseline_failure: Without the skill, may only suggest increasing memory without investigating the underlying cause of memory exhaustion

  - name: multi-source-correlation
    description: Demonstrates correlation of logs, events, and metrics for complete incident picture
    prompt: "I'm investigating a service degradation issue. What data should I collect and how do I correlate it?"
    model: sonnet
    samples: 3
    expected:
      contains_keywords:
        - logs
        - events
        - metrics
        - correlate
        - timeline
        - multiple sources
      mentions_tools:
        - kubectl logs
        - kubectl get events
        - kubectl top
    baseline_failure: Without the skill, may focus on only one data source without correlating across logs, events, and metrics

  - name: root-cause-vs-symptom
    description: Distinguishes between root cause and symptoms using temporal analysis
    prompt: "I see high CPU usage and connection errors. Which is the root cause?"
    model: sonnet
    samples: 3
    expected:
      contains_keywords:
        - temporal
        - first
        - timeline
        - causation
        - correlation
      provides_approach:
        - Check what happened first
        - Create timeline
        - Validate causal mechanism
    baseline_failure: Without the skill, may incorrectly identify symptoms as root causes or confuse correlation with causation

  - name: image-pull-failure
    description: Provides systematic approach to investigating ImagePullBackOff errors
    prompt: "Pod status shows ImagePullBackOff. How do I debug this?"
    model: sonnet
    samples: 3
    expected:
      recognizes: ImagePullBackOff
      contains_keywords:
        - image
        - registry
        - secret
        - pull
        - authentication
        - describe
      investigation_steps:
        - Check image name
        - Verify registry
        - Check pull secrets
    baseline_failure: Without the skill, may suggest random fixes without systematic investigation of image name, registry access, or credentials

  - name: read-only-investigation
    description: Ensures skill maintains read-only approach and never suggests direct modifications
    prompt: "I found the issue. How do I fix the pod?"
    model: sonnet
    samples: 3
    expected:
      contains_keywords:
        - recommend
        - manual
        - review
        - steps
      does_not_contain:
        - automatically
        - I will delete
        - I will modify
        - I will restart
      approach: Advisory recommendations only
    baseline_failure: Without the skill, might suggest direct modifications or automated fixes without proper human review