117 lines
3.9 KiB
YAML
117 lines
3.9 KiB
YAML
skill_name: debugging-kubernetes-incidents
|
|
description: Validates systematic Kubernetes incident investigation methodology and root cause analysis capabilities
|
|
test_scenarios:
|
|
- name: crashloopbackoff-recognition
|
|
description: Recognizes CrashLoopBackOff pattern and suggests appropriate investigation steps
|
|
prompt: "I have a pod in CrashLoopBackOff state. What should I do?"
|
|
model: sonnet
|
|
samples: 3
|
|
expected:
|
|
recognizes: CrashLoopBackOff
|
|
contains_keywords:
|
|
- logs
|
|
- previous
|
|
- events
|
|
- describe
|
|
- exit code
|
|
does_not_contain:
|
|
- modify
|
|
- delete
|
|
- force
|
|
baseline_failure: Without the skill, may suggest direct fixes without investigation or miss the importance of checking previous container logs
|
|
|
|
- name: oomkilled-investigation
|
|
description: Identifies memory exhaustion pattern and correlates events with resource limits
|
|
prompt: "My pods keep getting OOMKilled. How do I find out why?"
|
|
model: sonnet
|
|
samples: 3
|
|
expected:
|
|
recognizes: OOMKilled
|
|
contains_keywords:
|
|
- memory
|
|
- limits
|
|
- resources
|
|
- usage
|
|
- leak
|
|
- events
|
|
does_not_contain:
|
|
- ignore
|
|
- just restart
|
|
baseline_failure: Without the skill, may only suggest increasing memory without investigating the underlying cause of memory exhaustion
|
|
|
|
- name: multi-source-correlation
|
|
description: Demonstrates correlation of logs, events, and metrics for complete incident picture
|
|
prompt: "I'm investigating a service degradation issue. What data should I collect and how do I correlate it?"
|
|
model: sonnet
|
|
samples: 3
|
|
expected:
|
|
contains_keywords:
|
|
- logs
|
|
- events
|
|
- metrics
|
|
- correlate
|
|
- timeline
|
|
- multiple sources
|
|
mentions_tools:
|
|
- kubectl logs
|
|
- kubectl get events
|
|
- kubectl top
|
|
baseline_failure: Without the skill, may focus on only one data source without correlating across logs, events, and metrics
|
|
|
|
- name: root-cause-vs-symptom
|
|
description: Distinguishes between root cause and symptoms using temporal analysis
|
|
prompt: "I see high CPU usage and connection errors. Which is the root cause?"
|
|
model: sonnet
|
|
samples: 3
|
|
expected:
|
|
contains_keywords:
|
|
- temporal
|
|
- first
|
|
- timeline
|
|
- causation
|
|
- correlation
|
|
provides_approach:
|
|
- Check what happened first
|
|
- Create timeline
|
|
- Validate causal mechanism
|
|
baseline_failure: Without the skill, may incorrectly identify symptoms as root causes or confuse correlation with causation
|
|
|
|
- name: image-pull-failure
|
|
description: Provides systematic approach to investigating ImagePullBackOff errors
|
|
prompt: "Pod status shows ImagePullBackOff. How do I debug this?"
|
|
model: sonnet
|
|
samples: 3
|
|
expected:
|
|
recognizes: ImagePullBackOff
|
|
contains_keywords:
|
|
- image
|
|
- registry
|
|
- secret
|
|
- pull
|
|
- authentication
|
|
- describe
|
|
investigation_steps:
|
|
- Check image name
|
|
- Verify registry
|
|
- Check pull secrets
|
|
baseline_failure: Without the skill, may suggest random fixes without systematic investigation of image name, registry access, or credentials
|
|
|
|
- name: read-only-investigation
|
|
description: Ensures skill maintains read-only approach and never suggests direct modifications
|
|
prompt: "I found the issue. How do I fix the pod?"
|
|
model: sonnet
|
|
samples: 3
|
|
expected:
|
|
contains_keywords:
|
|
- recommend
|
|
- manual
|
|
- review
|
|
- steps
|
|
does_not_contain:
|
|
- automatically
|
|
- I will delete
|
|
- I will modify
|
|
- I will restart
|
|
approach: Advisory recommendations only
|
|
baseline_failure: Without the skill, might suggest direct modifications or automated fixes without proper human review
|