Initial commit

2025-11-29 18:28:20 +08:00
commit 167c936f63
6 changed files with 751 additions and 0 deletions
--- a/tests/scenarios.yaml
+++ b/tests/scenarios.yaml
@@ -0,0 +1,116 @@
+skill_name: debugging-kubernetes-incidents
+description: Validates systematic Kubernetes incident investigation methodology and root cause analysis capabilities
+test_scenarios:
+  - name: crashloopbackoff-recognition
+    description: Recognizes CrashLoopBackOff pattern and suggests appropriate investigation steps
+    prompt: "I have a pod in CrashLoopBackOff state. What should I do?"
+    model: sonnet
+    samples: 3
+    expected:
+      recognizes: CrashLoopBackOff
+      contains_keywords:
+        - logs
+        - previous
+        - events
+        - describe
+        - exit code
+      does_not_contain:
+        - modify
+        - delete
+        - force
+    baseline_failure: Without the skill, may suggest direct fixes without investigation or miss the importance of checking previous container logs
+
+  - name: oomkilled-investigation
+    description: Identifies memory exhaustion pattern and correlates events with resource limits
+    prompt: "My pods keep getting OOMKilled. How do I find out why?"
+    model: sonnet
+    samples: 3
+    expected:
+      recognizes: OOMKilled
+      contains_keywords:
+        - memory
+        - limits
+        - resources
+        - usage
+        - leak
+        - events
+      does_not_contain:
+        - ignore
+        - just restart
+    baseline_failure: Without the skill, may only suggest increasing memory without investigating the underlying cause of memory exhaustion
+
+  - name: multi-source-correlation
+    description: Demonstrates correlation of logs, events, and metrics for complete incident picture
+    prompt: "I'm investigating a service degradation issue. What data should I collect and how do I correlate it?"
+    model: sonnet
+    samples: 3
+    expected:
+      contains_keywords:
+        - logs
+        - events
+        - metrics
+        - correlate
+        - timeline
+        - multiple sources
+      mentions_tools:
+        - kubectl logs
+        - kubectl get events
+        - kubectl top
+    baseline_failure: Without the skill, may focus on only one data source without correlating across logs, events, and metrics
+
+  - name: root-cause-vs-symptom
+    description: Distinguishes between root cause and symptoms using temporal analysis
+    prompt: "I see high CPU usage and connection errors. Which is the root cause?"
+    model: sonnet
+    samples: 3
+    expected:
+      contains_keywords:
+        - temporal
+        - first
+        - timeline
+        - causation
+        - correlation
+      provides_approach:
+        - Check what happened first
+        - Create timeline
+        - Validate causal mechanism
+    baseline_failure: Without the skill, may incorrectly identify symptoms as root causes or confuse correlation with causation
+
+  - name: image-pull-failure
+    description: Provides systematic approach to investigating ImagePullBackOff errors
+    prompt: "Pod status shows ImagePullBackOff. How do I debug this?"
+    model: sonnet
+    samples: 3
+    expected:
+      recognizes: ImagePullBackOff
+      contains_keywords:
+        - image
+        - registry
+        - secret
+        - pull
+        - authentication
+        - describe
+      investigation_steps:
+        - Check image name
+        - Verify registry
+        - Check pull secrets
+    baseline_failure: Without the skill, may suggest random fixes without systematic investigation of image name, registry access, or credentials
+
+  - name: read-only-investigation
+    description: Ensures skill maintains read-only approach and never suggests direct modifications
+    prompt: "I found the issue. How do I fix the pod?"
+    model: sonnet
+    samples: 3
+    expected:
+      contains_keywords:
+        - recommend
+        - manual
+        - review
+        - steps
+      does_not_contain:
+        - automatically
+        - I will delete
+        - I will modify
+        - I will restart
+      approach: Advisory recommendations only
+    baseline_failure: Without the skill, might suggest direct modifications or automated fixes without proper human review