Initial commit

2025-11-30 08:35:19 +08:00
commit 3dba6825cf
24 changed files with 1714 additions and 0 deletions
--- a/tests/README.md
+++ b/tests/README.md
@@ -0,0 +1,180 @@
+# Test Suite for Debugging Pipeline Failures Skill
+
+## Overview
+
+This test suite validates that the pipeline debugging skill properly teaches Claude Code to:
+1. Follow systematic investigation methodology
+2. Use standard kubectl and Tekton commands
+3. Distinguish root causes from symptoms
+4. Correlate logs, events, and resource states
+5. Provide actionable debugging steps
+
+## Test Scenarios
+
+### 1. Systematic Investigation Approach
+**Purpose**: Validates Claude follows phased methodology (identify → logs → events → resources → root cause)
+**Expected**: Should mention systematic approach with kubectl commands for PipelineRun and TaskRun inspection
+**Baseline Failure**: Without skill, may suggest random checks without structure
+
+### 2. Image Pull Failure Diagnosis
+**Purpose**: Tests systematic diagnosis of ImagePullBackOff errors
+**Expected**: Should check pod events, image name, registry, and ServiceAccount imagePullSecrets
+**Baseline Failure**: Without skill, may not know to check pod describe or imagePullSecrets
+
+### 3. Stuck Pipeline Investigation
+**Purpose**: Validates methodology for pipelines stuck in Running state
+**Expected**: Should check individual TaskRun statuses to identify which is stuck/pending
+**Baseline Failure**: Without skill, may not know to list TaskRuns filtered by pipelineRun label
+
+### 4. Resource Constraint Recognition
+**Purpose**: Tests identification of scheduling and quota issues
+**Expected**: Should check events for FailedScheduling and namespace resource quotas
+**Baseline Failure**: Without skill, may not connect Pending state with resource constraints
+
+### 5. Log Analysis Methodology
+**Purpose**: Ensures proper Tekton log retrieval for failed steps
+**Expected**: Should know how to get logs from specific step containers in Tekton pods
+**Baseline Failure**: Without skill, may not understand Tekton step container naming
+
+### 6. Root Cause vs Symptom
+**Purpose**: Validates focus on investigation before applying fixes
+**Expected**: Should recommend investigating logs and root cause before increasing timeouts
+**Baseline Failure**: Without skill, may suggest quick fixes without investigation
+
+## Running Tests
+
+### Prerequisites
+
+- Python 3.8+
+- Claude Code CLI access
+- Claude Sonnet 4.5 (tests use `sonnet` model)
+- Access to test framework (if available in konflux-ci/skills repo)
+
+### Run All Tests
+
+```bash
+# From repository root
+make test
+
+# Or specifically for this skill
+make test-only SKILL=debugging-pipeline-failures
+```
+
+### Validate Skill Schema
+
+```bash
+claudelint debugging-pipeline-failures/SKILL.md
+```
+
+### Generate Test Results
+
+```bash
+make generate SKILL=debugging-pipeline-failures
+```
+
+## Test-Driven Development Process
+
+This skill followed TDD for Documentation:
+
+### RED Phase (Initial Failures)
+1. Created 6 test scenarios representing real pipeline debugging needs
+2. Ran tests WITHOUT the skill
+3. Documented baseline failures:
+   - No systematic methodology
+   - Didn't know Tekton-specific kubectl commands
+   - Confused symptoms with root causes
+   - Missing event and resource correlation
+
+### GREEN Phase (Minimal Skill)
+1. Created SKILL.md addressing test failures
+2. Added 5-phase investigation methodology
+3. Included kubectl command examples
+4. Emphasized root cause analysis
+5. All tests passed
+
+### REFACTOR Phase (Improvement)
+1. Added common failure patterns (6 types)
+2. Enhanced with decision tree
+3. Improved troubleshooting workflow
+4. Added common confusions section
+
+## Success Criteria
+
+All tests must:
+- ✅ Pass with 100% success rate (3/3 samples)
+- ✅ Contain expected keywords (kubectl, systematic approach)
+- ✅ NOT contain prohibited terms (quick fixes without investigation)
+- ✅ Demonstrate phased methodology
+- ✅ Focus on standard Tekton/Kubernetes tools
+
+## Continuous Validation
+
+Tests run automatically on:
+- Every pull request (GitHub Actions)
+- Skill file modifications
+- Schema changes
+- Version updates
+
+## Adding New Tests
+
+To add test scenarios:
+
+1. **Identify gap**: What failure pattern is missing?
+2. **Create scenario**: Add to `scenarios.yaml`
+3. **Run without skill**: Document baseline failure
+4. **Update SKILL.md**: Address the gap
+5. **Validate**: Ensure test passes
+
+Example:
+```yaml
+- name: your-test-name
+  description: What you're testing
+  prompt: "User query to test"
+  model: sonnet
+  samples: 3
+  expected:
+    contains_keywords:
+      - keyword1
+      - keyword2
+  baseline_failure: What happens without the skill
+```
+
+## Known Limitations
+
+- Tests use synthetic scenarios (not real Konflux failures)
+- Keyword matching is basic (could use semantic analysis)
+- No integration testing with actual clusters
+- Sample size (3) may not catch all edge cases
+
+## Future Improvements
+
+- Add tests for multi-step pipeline failures
+- Include workspace debugging scenarios
+- Add tests for intermittent failures
+- Test with real Konflux pipeline YAML
+
+## Troubleshooting
+
+### Test Failures
+
+**Symptom**: Test fails intermittently
+**Fix**: Increase samples or refine expected keywords
+
+**Symptom**: All tests fail
+**Fix**: Check SKILL.md frontmatter and schema validation
+
+**Symptom**: Baseline failure unclear
+**Fix**: Run test manually without skill, document actual output
+
+## Contributing
+
+When contributing test improvements:
+1. Ensure tests are deterministic
+2. Use realistic Konflux user prompts
+3. Document baseline failures clearly
+4. Keep samples count reasonable (3-5)
+5. Update this README with new scenarios
+
+## Questions?
+
+See main repository documentation or file an issue in konflux-ci/skills.
--- a/tests/results/image-pull-failure-diagnosis.1.txt
+++ b/tests/results/image-pull-failure-diagnosis.1.txt
@@ -0,0 +1,105 @@
+# skill_digest: 651e129c7e0478c097161a76655c9f54e3566f5510368426e68d88fbbceb77b9
+## Debugging ImagePullBackOff Error
+
+ImagePullBackOff indicates Kubernetes cannot pull the container image. Let's investigate step by step.
+
+### Step 1: Find the Failed Pod
+
+First, identify which pipeline and pod is failing:
+
+```bash
+# List recent PipelineRuns
+kubectl get pipelinerun -n <namespace> --sort-by=.metadata.creationTimestamp
+
+# Or if you know the PipelineRun name:
+kubectl get pipelinerun <pr-name> -n <namespace>
+
+# Find the associated TaskRuns
+kubectl get taskruns -l tekton.dev/pipelineRun=<pr-name> -n <namespace>
+
+# Find the pod with the issue
+kubectl get pods -n <namespace> | grep ImagePullBackOff
+```
+
+### Step 2: Check Pod Events
+
+Once you've identified the pod, check the events for specific details:
+
+```bash
+kubectl describe pod <pod-name> -n <namespace> | grep -A10 "Events"
+```
+
+This will show you:
+- **Exact image name** that failed to pull
+- **Error reason** (authentication failed, not found, etc.)
+- **Registry** being accessed
+
+### Step 3: Common Root Causes & Fixes
+
+**A. Image Name/Tag Typo**
+- Check if the image name and tag are spelled correctly
+- Verify the image actually exists in the registry
+
+**B. Registry Authentication**
+```bash
+# Check if ServiceAccount has imagePullSecrets
+kubectl get sa <service-account-name> -n <namespace> -o yaml
+
+# Check if the secret exists
+kubectl get secrets -n <namespace>
+```
+
+**Fix**: Add imagePullSecret to your ServiceAccount:
+```yaml
+apiVersion: v1
+kind: ServiceAccount
+metadata:
+  name: <sa-name>
+imagePullSecrets:
+- name: <registry-secret-name>
+```
+
+**C. Private Registry - Missing Credentials**
+```bash
+# Create a docker registry secret
+kubectl create secret docker-registry <secret-name> \
+  --docker-server=<registry-url> \
+  --docker-username=<username> \
+  --docker-password=<password> \
+  --docker-email=<email> \
+  -n <namespace>
+```
+
+**D. Image Doesn't Exist**
+- Verify the image exists in your registry
+- Check if the tag is correct (common issue: using `latest` when it doesn't exist)
+
+**E. Network/Registry Issues**
+```bash
+# Check if the registry is accessible from the cluster
+kubectl run test-pull --image=<your-image> -n <namespace> --rm -it --restart=Never
+```
+
+### Quick Diagnosis Commands
+
+Run these to gather all relevant info:
+
+```bash
+# Get full pod details
+kubectl describe pod <pod-name> -n <namespace>
+
+# Check the TaskRun spec for image references
+kubectl get taskrun <tr-name> -n <namespace> -o yaml | grep -A2 "image:"
+
+# View ServiceAccount configuration
+kubectl get sa -n <namespace> -o yaml
+```
+
+---
+
+**Next Steps:**
+1. Run the commands above with your actual namespace and resource names
+2. Share the output from `kubectl describe pod` (especially the Events section)
+3. I can help you identify the exact cause and create the appropriate fix
+
+What's your namespace and do you know the PipelineRun or pod name that's failing?
--- a/tests/results/image-pull-failure-diagnosis.2.txt
+++ b/tests/results/image-pull-failure-diagnosis.2.txt
@@ -0,0 +1,59 @@
+# skill_digest: 651e129c7e0478c097161a76655c9f54e3566f5510368426e68d88fbbceb77b9
+Based on the debugging guide, here's how to systematically debug your ImagePullBackOff issue:
+
+## Immediate Steps
+
+**1. Find the failing pod and check events:**
+
+```bash
+# Find your PipelineRun
+kubectl get pipelinerun -n <namespace>
+
+# Get the failed TaskRun
+kubectl get taskruns -l tekton.dev/pipelineRun=<pr-name> -n <namespace>
+
+# Find the pod
+kubectl get pods -l tekton.dev/taskRun=<tr-name> -n <namespace>
+
+# Check pod events - this will show the ImagePullBackOff details
+kubectl describe pod <pod-name> -n <namespace> | grep -A10 "Events"
+```
+
+**2. The events will reveal the specific image problem. Common causes:**
+
+- **Typo in image name/tag** - Check the exact image reference
+- **Image doesn't exist** - Verify the image exists in the registry
+- **Registry authentication** - Missing or incorrect imagePullSecrets
+- **Private registry access** - Network policies blocking access
+
+**3. Verify the image:**
+
+```bash
+# Check what image is being pulled (look in TaskRun spec)
+kubectl get taskrun <tr-name> -n <namespace> -o yaml | grep -i image
+
+# Check ServiceAccount has imagePullSecrets
+kubectl get sa <sa-name> -n <namespace> -o yaml
+```
+
+## Quick Fixes
+
+**If it's an authentication issue:**
+```bash
+# Check if imagePullSecret exists
+kubectl get secrets -n <namespace>
+
+# Add imagePullSecret to ServiceAccount if missing
+kubectl patch serviceaccount <sa-name> -n <namespace> -p '{"imagePullSecrets": [{"name": "<secret-name>"}]}'
+```
+
+**If it's a typo:**
+- Correct the image name/tag in your Task or Pipeline definition
+
+**If the image doesn't exist:**
+- Build and push the image first
+- Or use the correct tag that exists
+
+Would you like me to help you run these commands? I'll need:
+1. Your namespace name
+2. The PipelineRun name (or we can list them to find it)
--- a/tests/results/image-pull-failure-diagnosis.3.txt
+++ b/tests/results/image-pull-failure-diagnosis.3.txt
@@ -0,0 +1,66 @@
+# skill_digest: 651e129c7e0478c097161a76655c9f54e3566f5510368426e68d88fbbceb77b9
+### 1. Find the Failed Pod and Check Events
+
+```bash
+# Find your PipelineRun and identify the failed TaskRun
+kubectl get pipelinerun -n <namespace>
+
+# Get TaskRuns for the failed pipeline
+kubectl get taskruns -l tekton.dev/pipelineRun=<pr-name> -n <namespace>
+
+# Find the pod with ImagePullBackOff
+kubectl get pods -n <namespace> | grep ImagePullBackOff
+
+# Check events for the specific pod
+kubectl describe pod <pod-name> -n <namespace>
+```
+
+Look in the Events section for details about **why** the image pull failed.
+
+### 2. Common Root Causes
+
+**Image Name/Tag Issues:**
+- Typo in image name or tag
+- Tag doesn't exist in registry
+- Wrong registry URL
+
+**Authentication Issues:**
+- Missing imagePullSecret on ServiceAccount
+- Invalid/expired registry credentials
+- Wrong secret referenced
+
+**Registry Accessibility:**
+- Registry is down or unreachable
+- Network policies blocking access
+- Private registry requires authentication
+
+### 3. Verify the Image
+
+```bash
+# Check what image the pod is trying to pull
+kubectl get pod <pod-name> -n <namespace> -o jsonpath='{.spec.containers[*].image}'
+
+# Check if ServiceAccount has imagePullSecrets
+kubectl get sa <service-account-name> -n <namespace> -o yaml
+```
+
+### 4. Fix Based on Root Cause
+
+**If image name is wrong:**
+- Correct the image reference in your Task/Pipeline definition
+
+**If authentication is missing:**
+```bash
+# Add imagePullSecret to ServiceAccount
+kubectl patch serviceaccount <sa-name> -n <namespace> -p '{"imagePullSecrets": [{"name": "<secret-name>"}]}'
+```
+
+**If credentials are invalid:**
+- Update the docker-registry secret with valid credentials
+
+Would you like me to help you investigate further? Please share:
+1. Your namespace name
+2. The PipelineRun or Pod name that's failing
+3. The output of `kubectl get pods -n <namespace>` showing the ImagePullBackOff
+
+I can then run the specific commands to identify the exact root cause.
--- a/tests/results/log-analysis-methodology.1.txt
+++ b/tests/results/log-analysis-methodology.1.txt
@@ -0,0 +1,46 @@
+# skill_digest: 651e129c7e0478c097161a76655c9f54e3566f5510368426e68d88fbbceb77b9
+Based on the debugging guide, here's how to find the error in your failed build TaskRun:
+
+## Quick Steps to Find the Error
+
+1. **Get the logs from the failed step:**
+```bash
+# First, find the pod associated with your TaskRun
+kubectl get pods -l tekton.dev/taskRun=<your-taskrun-name> -n <namespace>
+
+# Then get logs from the build step (or whichever step failed)
+kubectl logs <pod-name> -c step-build -n <namespace>
+
+# If you don't know which step failed, get all logs:
+kubectl logs <pod-name> --all-containers=true -n <namespace>
+```
+
+2. **Check TaskRun details to see which step failed:**
+```bash
+kubectl describe taskrun <your-taskrun-name> -n <namespace>
+```
+
+Look for the "Status" section which shows which step failed and the exit code.
+
+3. **Review Kubernetes events for additional context:**
+```bash
+kubectl get events --field-selector involvedObject.name=<pod-name> -n <namespace> --sort-by='.lastTimestamp'
+```
+
+## What to Look For in the Logs
+
+- **Error messages**: Search for keywords like "error", "failed", "fatal"
+- **The last successful operation** before the failure
+- **Exit code explanation**: 
+  - Exit code 1 = General failure (check the error message in logs)
+  - Exit code 127 = Command not found
+  - Exit code 137 = OOMKilled (out of memory)
+
+## Common Exit Code 1 Causes
+
+- Build script errors (compilation failures, test failures)
+- Missing dependencies
+- Configuration errors
+- Network issues during dependency downloads
+
+**Do you have the TaskRun name and namespace handy?** I can help you run these commands to find the specific error.
--- a/tests/results/log-analysis-methodology.2.txt
+++ b/tests/results/log-analysis-methodology.2.txt
@@ -0,0 +1,52 @@
+# skill_digest: 651e129c7e0478c097161a76655c9f54e3566f5510368426e68d88fbbceb77b9
+Based on the debugging guide, here's how to find the error in your failed build TaskRun:
+
+## Quick Steps to Find Your Error
+
+**1. Get the logs from your failed TaskRun:**
+
+```bash
+# First, find the pod associated with your TaskRun
+kubectl get pods -l tekton.dev/taskRun=<your-taskrun-name> -n <namespace>
+
+# Then get the logs (replace with your pod name)
+kubectl logs <pod-name> --all-containers=true -n <namespace>
+```
+
+**2. If you know which step failed, get logs from that specific step:**
+
+```bash
+kubectl logs <pod-name> -c step-<step-name> -n <namespace>
+```
+
+**3. Check the TaskRun status for clues:**
+
+```bash
+kubectl describe taskrun <your-taskrun-name> -n <namespace>
+```
+
+Look for the `Status` section which shows:
+- Which step failed
+- Exit code details
+- Failure reasons
+
+**4. Check Kubernetes events for additional context:**
+
+```bash
+kubectl get events -n <namespace> --sort-by='.lastTimestamp' | tail -20
+```
+
+## Common Exit Code 1 Causes
+
+Since you got exit code 1, it typically means:
+- **Build script error** - compilation failed, tests failed, linting error
+- **Command failed** - a command in your build script returned an error
+- **Missing dependency** - required tool or package not found
+
+The logs from step 1 will show the actual error message before the exit.
+
+**Can you provide:**
+1. Your TaskRun name and namespace?
+2. Or paste the output from `kubectl describe taskrun <name>`?
+
+This will help me pinpoint the exact error for you.
--- a/tests/results/log-analysis-methodology.3.txt
+++ b/tests/results/log-analysis-methodology.3.txt
@@ -0,0 +1,49 @@
+# skill_digest: 651e129c7e0478c097161a76655c9f54e3566f5510368426e68d88fbbceb77b9
+Based on the debugging guide, here's how to find the error in your failed build TaskRun:
+
+## Quick Steps to Find Your Error
+
+**1. Get the TaskRun logs** (most likely to show the error):
+```bash
+# Find the pod associated with your TaskRun
+kubectl get pods -l tekton.dev/taskRun=<your-taskrun-name> -n <namespace>
+
+# Get logs from all containers to see the error
+kubectl logs <pod-name> --all-containers=true -n <namespace>
+```
+
+**2. Check the TaskRun details** to see which step failed:
+```bash
+kubectl describe taskrun <your-taskrun-name> -n <namespace>
+```
+
+Look for the "Status" section which will show:
+- Which step failed
+- The exit code (you mentioned exit code 1)
+- Failure reason/message
+
+**3. Get logs from the specific failed step** (once you identify it):
+```bash
+kubectl logs <pod-name> -c step-<step-name> -n <namespace>
+```
+
+**4. Check events** for additional context:
+```bash
+kubectl get events --field-selector involvedObject.name=<pod-name> -n <namespace> --sort-by='.lastTimestamp'
+```
+
+## What to Look For in Logs
+
+Exit code 1 typically means a general script/command failure. Search the logs for:
+- Error messages (grep for "error", "failed", "fatal")
+- The last command that executed before failure
+- Build tool specific errors (npm, maven, gradle, etc.)
+- Test failures
+- Missing files or dependencies
+
+**Need help running these commands?** If you provide me with:
+- Your TaskRun name
+- Your namespace
+- Access to run kubectl commands in your environment
+
+I can help you execute these commands and analyze the output to pinpoint the exact error.
--- a/tests/results/resource-constraint-recognition.1.txt
+++ b/tests/results/resource-constraint-recognition.1.txt
@@ -0,0 +1,58 @@
+# skill_digest: 651e129c7e0478c097161a76655c9f54e3566f5510368426e68d88fbbceb77b9
+Based on the debugging skill loaded, here's how to find out why your TaskRun is stuck in Pending state:
+
+## Quick Diagnostic Steps
+
+**1. Check the TaskRun status and details:**
+```bash
+kubectl get taskrun <tr-name> -n <namespace>
+kubectl describe taskrun <tr-name> -n <namespace>
+```
+
+**2. Check for scheduling issues via events:**
+```bash
+# Get recent events in the namespace
+kubectl get events -n <namespace> --sort-by='.lastTimestamp' | grep <tr-name>
+
+# Or check pod events directly
+kubectl get pods -l tekton.dev/taskRun=<tr-name> -n <namespace>
+kubectl describe pod <pod-name> -n <namespace>
+```
+
+**3. Check resource quotas and constraints:**
+```bash
+# Check namespace quotas
+kubectl describe namespace <namespace> | grep -A5 "Resource Quotas"
+
+# Check node capacity
+kubectl describe node | grep -A5 "Allocated resources"
+```
+
+## Common Causes of Pending TaskRuns
+
+| Cause | What to Look For | Solution |
+|-------|------------------|----------|
+| **Resource quota exceeded** | Events: `FailedScheduling`, quota errors | Increase namespace quota or reduce resource requests |
+| **Insufficient cluster resources** | Events: `FailedScheduling` (no nodes available) | Wait for resources or scale cluster |
+| **Missing PVC/workspace** | Events: `FailedMount` or volume errors | Create required PVC or fix workspace config |
+| **Invalid ServiceAccount** | Events: permission errors | Verify ServiceAccount exists and has proper permissions |
+| **Image pull secrets missing** | Pod can't be created, image pull errors | Add imagePullSecrets to ServiceAccount |
+
+## Most Likely Issue
+
+If TaskRun is **Pending**, it typically means the underlying pod cannot be scheduled. The events will tell you exactly why:
+
+```bash
+kubectl get events --field-selector involvedObject.name=<pod-name> -n <namespace>
+```
+
+Look for messages like:
+- `0/X nodes are available: insufficient memory/cpu` → Resource constraints
+- `persistentvolumeclaim "X" not found` → Missing PVC
+- `exceeded quota` → Namespace quota limit
+
+Would you like me to help you run these commands? Please provide:
+- Your TaskRun name
+- Your namespace
+
+Or if you already have the output from any of these commands, share it and I can help interpret the results.
--- a/tests/results/resource-constraint-recognition.2.txt
+++ b/tests/results/resource-constraint-recognition.2.txt
@@ -0,0 +1,63 @@
+# skill_digest: 651e129c7e0478c097161a76655c9f54e3566f5510368426e68d88fbbceb77b9
+Based on the debugging-pipeline-failures skill, here's how to find out why your TaskRun is stuck in Pending state:
+
+## Quick Diagnosis Steps
+
+**1. Check the TaskRun status and conditions:**
+```bash
+kubectl get taskrun <tr-name> -n <namespace>
+kubectl describe taskrun <tr-name> -n <namespace>
+```
+
+**2. Check for pod existence and status:**
+```bash
+kubectl get pods -l tekton.dev/taskRun=<tr-name> -n <namespace>
+```
+
+**3. Check Kubernetes events (most revealing for Pending issues):**
+```bash
+kubectl get events -n <namespace> --sort-by='.lastTimestamp'
+
+# Or filter for the specific TaskRun pod
+kubectl get events --field-selector involvedObject.name=<pod-name> -n <namespace>
+```
+
+## Common Causes for Pending State
+
+The events will typically reveal one of these issues:
+
+| Event Type | Root Cause | Solution |
+|------------|------------|----------|
+| `FailedScheduling` | Resource constraints - not enough CPU/memory | Check namespace quotas and node capacity |
+| `Insufficient cpu/memory` | Node can't satisfy resource requests | Reduce resource requests or add nodes |
+| No pod exists yet | TaskRun hasn't created pod | Check PipelineRun/Task configuration |
+| PVC issues | Workspace PVC not bound | Check PVC status with `kubectl get pvc` |
+| Node selector mismatch | No nodes match selectors/taints | Review nodeSelector and tolerations |
+
+## Detailed Investigation
+
+**Check resource quotas:**
+```bash
+kubectl describe namespace <namespace> | grep -A5 "Resource Quotas"
+```
+
+**Check node capacity:**
+```bash
+kubectl describe node | grep -A5 "Allocated resources"
+```
+
+**Check PVC status (if using workspaces):**
+```bash
+kubectl get pvc -n <namespace>
+kubectl describe pvc <pvc-name> -n <namespace>
+```
+
+**Inspect TaskRun YAML for resource requests:**
+```bash
+kubectl get taskrun <tr-name> -n <namespace> -o yaml | grep -A10 "resources:"
+```
+
+Would you like me to help you run these commands? Please provide:
+- The TaskRun name
+- The namespace
+- Or share the output of `kubectl describe taskrun <tr-name>`
--- a/tests/results/resource-constraint-recognition.3.txt
+++ b/tests/results/resource-constraint-recognition.3.txt
@@ -0,0 +1,50 @@
+# skill_digest: 651e129c7e0478c097161a76655c9f54e3566f5510368426e68d88fbbceb77b9
+Based on the debugging guide, here's how to investigate a TaskRun stuck in Pending state:
+
+## Quick Diagnostic Steps
+
+**1. Check the TaskRun status:**
+```bash
+kubectl get taskrun <tr-name> -n <namespace>
+kubectl describe taskrun <tr-name> -n <namespace>
+```
+
+**2. Check for resource constraint events:**
+```bash
+kubectl get events -n <namespace> --sort-by='.lastTimestamp'
+kubectl get events --field-selector involvedObject.name=<tr-name> -n <namespace>
+```
+
+Look for these critical events:
+- `FailedScheduling` → Resource constraints (CPU/memory quota exceeded or insufficient node capacity)
+- `FailedMount` → Volume/PVC issues
+
+**3. Check resource quotas:**
+```bash
+kubectl describe namespace <namespace> | grep -A5 "Resource Quotas"
+```
+
+**4. Check if the pod exists:**
+```bash
+kubectl get pods -l tekton.dev/taskRun=<tr-name> -n <namespace>
+```
+
+If no pod exists, the TaskRun can't schedule due to resource constraints.
+
+## Common Causes for Pending TaskRuns
+
+1. **Namespace quota exceeded** - No CPU/memory quota available
+2. **Insufficient cluster resources** - No nodes with available capacity
+3. **PVC not bound** - Workspace PersistentVolumeClaim is pending
+4. **Node selector mismatch** - No nodes match the required labels
+5. **Pod security admission** - Security policies blocking pod creation
+
+## Next Steps
+
+Once you identify the issue from events/quotas, you can:
+- Increase namespace resource quotas
+- Reduce resource requests in the Task definition
+- Fix PVC binding issues
+- Adjust node selectors or taints/tolerations
+
+**Would you like me to help you run these diagnostic commands for a specific TaskRun?** Just provide the TaskRun name and namespace.
--- a/tests/results/root-cause-vs-symptom.1.txt
+++ b/tests/results/root-cause-vs-symptom.1.txt
@@ -0,0 +1,22 @@
+# skill_digest: 651e129c7e0478c097161a76655c9f54e3566f5510368426e68d88fbbceb77b9
+Before just increasing the timeout, let's investigate **why** your pipeline is failing. Timeouts are often a symptom of an underlying issue, not the root cause itself.
+
+Simply increasing the timeout might mask problems like:
+- **Hung processes** or deadlocks that will never complete
+- **Resource starvation** (CPU, memory, network)
+- **Inefficient operations** that could be optimized
+- **Infrastructure issues** like slow I/O or network problems
+- **Misconfigurations** in the pipeline setup
+
+I can help you debug this. I have access to a specialized skill for debugging Konflux/Tekton pipeline failures. To investigate, I'll need to know:
+
+1. **What type of pipeline is this?** (Konflux/Tekton, GitHub Actions, GitLab CI, Jenkins, etc.)
+2. **Where is it timing out?** (Specific stage/task/step)
+3. **Can you share:**
+   - The pipeline configuration file
+   - Recent failure logs
+   - Or point me to the pipeline definition in your repo
+
+If this is a Konflux/Tekton pipeline, I can invoke the debugging skill to systematically analyze the failure. Otherwise, I can still help investigate once I see the pipeline setup and logs.
+
+What information can you provide about the failing pipeline?
--- a/tests/results/root-cause-vs-symptom.2.txt
+++ b/tests/results/root-cause-vs-symptom.2.txt
@@ -0,0 +1,15 @@
+# skill_digest: 651e129c7e0478c097161a76655c9f54e3566f5510368426e68d88fbbceb77b9
+Before just increasing the timeout, it's better to understand **why** your pipeline is failing. A timeout is often a symptom of an underlying issue rather than the root cause.
+
+Let me help you debug this systematically. I can use a specialized skill for debugging pipeline failures that will help us:
+
+1. Identify what's actually timing out or failing
+2. Check for common issues (stuck tasks, resource problems, build errors)
+3. Examine logs to find the root cause
+4. Determine if a timeout increase is actually the right solution
+
+Would you like me to investigate your pipeline failure? If so, I'll need to know:
+- What type of pipeline are you running? (e.g., Konflux/Tekton, GitHub Actions, GitLab CI, Jenkins, etc.)
+- Do you have access to the pipeline logs or configuration files in this directory?
+
+Or if this is a Konflux/Tekton pipeline, I can start debugging right away using the specialized skill.
--- a/tests/results/root-cause-vs-symptom.3.txt
+++ b/tests/results/root-cause-vs-symptom.3.txt
@@ -0,0 +1,25 @@
+# skill_digest: 651e129c7e0478c097161a76655c9f54e3566f5510368426e68d88fbbceb77b9
+Now I can help you properly investigate the pipeline failure. Rather than just increasing the timeout (which usually masks the real problem), let's find out what's actually going wrong.
+
+**To get started, I need some information:**
+
+1. **What namespace is your pipeline running in?**
+2. **Do you know the PipelineRun name?** (or can you describe which pipeline is failing?)
+
+Once you provide this, I'll systematically investigate by:
+- Checking the PipelineRun status to see which TaskRun failed
+- Examining logs for the actual error
+- Reviewing Kubernetes events for resource issues
+- Identifying the root cause (slow operation, resource constraint, actual failure, etc.)
+
+**Common scenarios where timeout increases are NOT the solution:**
+- Build scripts with errors (needs code fix)
+- Resource exhaustion (needs quota/limit adjustments)
+- Image pull failures (needs registry/auth fixes)
+- Missing dependencies (needs build config changes)
+
+**When timeout increases ARE appropriate:**
+- Legitimately slow operations (large downloads, complex builds)
+- After verifying the operation is progressing but just needs more time
+
+So let's find out which category your failure falls into. What's the namespace and PipelineRun name?
--- a/tests/results/stuck-pipeline-investigation.1.txt
+++ b/tests/results/stuck-pipeline-investigation.1.txt
@@ -0,0 +1,85 @@
+# skill_digest: 651e129c7e0478c097161a76655c9f54e3566f5510368426e68d88fbbceb77b9
+For a PipelineRun stuck in Running state for over an hour, here's a systematic approach to diagnose the issue:
+
+## Immediate Investigation Steps
+
+**1. Check which TaskRuns are actually running or stuck:**
+```bash
+kubectl get taskruns -l tekton.dev/pipelineRun=<pr-name> -n <namespace>
+```
+
+Look for TaskRuns that are:
+- **Pending** → Resource or scheduling issues
+- **Running** for too long → Timeout or hanging process
+
+**2. Examine the PipelineRun details:**
+```bash
+kubectl describe pipelinerun <pr-name> -n <namespace>
+```
+
+Check for conditions, recent events, and which TaskRun is currently executing.
+
+**3. If a TaskRun is stuck in "Pending":**
+```bash
+# Find the pod
+kubectl get pods -l tekton.dev/taskRun=<tr-name> -n <namespace>
+
+# Check why it's not scheduling
+kubectl describe pod <pod-name> -n <namespace>
+```
+
+Common causes:
+- Insufficient cluster resources (CPU/memory)
+- Namespace quota exceeded
+- PVC not bound (for workspace volumes)
+
+**4. If a TaskRun is stuck in "Running":**
+```bash
+# Check logs to see if it's making progress
+kubectl logs <pod-name> --all-containers=true -n <namespace>
+```
+
+Look for:
+- Is the process actually doing work or hung?
+- Last log message timestamp
+- Network operations that might be stalled
+
+**5. Check Kubernetes events:**
+```bash
+kubectl get events -n <namespace> --sort-by='.lastTimestamp' | tail -20
+```
+
+Events will reveal:
+- `FailedScheduling` → Not enough resources
+- `FailedMount` → Volume/PVC problems
+- Image pull issues
+
+## Most Common Causes for Stuck Pipelines
+
+1. **Resource constraints** - Pod waiting for available CPU/memory
+2. **Timeout not configured** - Task running indefinitely
+3. **Hanging process** - Network download stalled, waiting for input
+4. **PVC binding issues** - Workspace volume not ready
+5. **Dead node** - Pod scheduled on unresponsive node
+
+## Quick Diagnostic Command Sequence
+
+Run these in parallel to get a full picture:
+
+```bash
+# Check overall status
+kubectl get pipelinerun <pr-name> -n <namespace>
+
+# See all TaskRuns and their states
+kubectl get taskruns -l tekton.dev/pipelineRun=<pr-name> -n <namespace>
+
+# Check recent events
+kubectl get events -n <namespace> --sort-by='.lastTimestamp' | tail -20
+
+# Check namespace resource quotas
+kubectl describe namespace <namespace> | grep -A5 "Resource Quotas"
+```
+
+Would you like me to help you run these commands? I'll need:
+- The PipelineRun name
+- The namespace
--- a/tests/results/stuck-pipeline-investigation.2.txt
+++ b/tests/results/stuck-pipeline-investigation.2.txt
@@ -0,0 +1,68 @@
+# skill_digest: 651e129c7e0478c097161a76655c9f54e3566f5510368426e68d88fbbceb77b9
+Based on the debugging guide, here's what you should check for a PipelineRun stuck in Running state for over an hour:
+
+## Immediate Checks
+
+**1. Identify which TaskRun(s) are stuck:**
+```bash
+kubectl get taskruns -l tekton.dev/pipelineRun=<pr-name> -n <namespace>
+```
+
+Look for TaskRuns that are either:
+- **Pending** → Resource constraints or scheduling issues
+- **Running** for too long → Timeout or hanging process
+
+**2. Check the stuck TaskRun's pod status:**
+```bash
+kubectl get pods -l tekton.dev/taskRun=<tr-name> -n <namespace>
+```
+
+**3. Review recent events:**
+```bash
+kubectl get events -n <namespace> --sort-by='.lastTimestamp' | tail -20
+```
+
+Look for:
+- `FailedScheduling` → Resource quota exceeded or insufficient cluster resources
+- `FailedMount` → PVC or workspace issues
+- `ImagePullBackOff` → Can't pull container image
+
+## Common Causes for Stuck Pipelines
+
+### **If TaskRun is Pending:**
+```bash
+# Check resource quotas
+kubectl describe namespace <namespace> | grep -A5 "Resource Quotas"
+
+# Check node resources
+kubectl describe node | grep -A5 "Allocated resources"
+```
+**Likely cause:** Insufficient resources to schedule the pod
+
+### **If TaskRun is Running:**
+```bash
+# Check the logs for the current step
+kubectl logs <pod-name> --all-containers=true -n <namespace>
+
+# Check which step is currently executing
+kubectl get taskrun <tr-name> -n <namespace> -o jsonpath='{.status.steps[*].name}{"\n"}{.status.steps[*].terminated}'
+```
+**Likely causes:**
+- Process hanging without output
+- Network operation stalled (downloading dependencies)
+- Timeout set too high, waiting for operation to complete
+
+### **If Pod hasn't started:**
+```bash
+kubectl describe pod <pod-name> -n <namespace>
+```
+Check the Events section for ImagePull errors or volume mount failures.
+
+## Quick Decision Path
+
+1. **TaskRuns all Pending?** → Resource/quota issue
+2. **One TaskRun Running forever?** → Check its logs for hanging process
+3. **No pod created yet?** → Image or workspace configuration problem
+4. **Pod exists but waiting?** → Check events for FailedMount/ImagePull errors
+
+What namespace and PipelineRun name are you working with? I can help you run these diagnostic commands.
--- a/tests/results/stuck-pipeline-investigation.3.txt
+++ b/tests/results/stuck-pipeline-investigation.3.txt
@@ -0,0 +1,48 @@
+# skill_digest: 651e129c7e0478c097161a76655c9f54e3566f5510368426e68d88fbbceb77b9
+## Quick Diagnostic Steps
+
+**1. Check TaskRun Status**
+```bash
+kubectl get taskruns -l tekton.dev/pipelineRun=<pr-name> -n <namespace>
+```
+This shows which TaskRuns are Pending, Running, or Completed.
+
+**2. For Pending TaskRuns** - Resource Constraints:
+```bash
+# Check namespace quotas
+kubectl describe namespace <namespace> | grep -A5 "Resource Quotas"
+
+# Check node capacity
+kubectl describe node | grep -A5 "Allocated resources"
+
+# Check events
+kubectl get events -n <namespace> --sort-by='.lastTimestamp'
+```
+Look for `FailedScheduling` events.
+
+**3. For Running TaskRuns** - Progress Check:
+```bash
+# Find the pod
+kubectl get pods -l tekton.dev/taskRun=<tr-name> -n <namespace>
+
+# Check logs for the running step
+kubectl logs <pod-name> --all-containers=true -n <namespace>
+```
+Look for signs of progress or if it's hanging.
+
+**4. Check for Timeouts**:
+```bash
+kubectl get taskrun <tr-name> -n <namespace> -o jsonpath='{.spec.timeout}'
+kubectl get taskrun <tr-name> -n <namespace> -o jsonpath='{.status.startTime}'
+```
+
+## Common Causes
+
+1. **Pending TaskRun** → Insufficient resources, quota exceeded, or no available nodes
+2. **Running but hung** → Network operation timeout, process hanging, or slow build
+3. **Waiting for dependencies** → Previous task not completing, workspace/volume issues
+
+Would you like me to help you run these diagnostic commands? Please provide:
+- Your PipelineRun name
+- Namespace
+- Or share the output of `kubectl get pipelinerun <pr-name> -n <namespace>`
--- a/tests/results/systematic-investigation-approach.1.txt
+++ b/tests/results/systematic-investigation-approach.1.txt
@@ -0,0 +1,24 @@
+# skill_digest: 651e129c7e0478c097161a76655c9f54e3566f5510368426e68d88fbbceb77b9
+I need your approval to run kubectl commands to investigate the failed PipelineRun. Here's my systematic investigation plan:
+
+## Investigation Steps
+
+**Phase 1: Identify Failed Component**
+1. Check PipelineRun status to see overall state and failure reasons
+2. Identify which TaskRun(s) failed
+3. Examine duration and timestamps
+
+**Phase 2: Log Analysis**
+4. Get logs from the failed TaskRun's pod
+5. Search for error messages and exit codes
+6. Identify the last successful step before failure
+
+**Phase 3: Event Correlation**
+7. Check Kubernetes events for issues like ImagePullBackOff, FailedMount, resource constraints
+
+**Phase 4: Resource Inspection**
+8. Inspect PipelineRun and TaskRun YAML for configuration issues
+
+Once you approve, I'll run these commands to gather the diagnostic information and identify the root cause of your pipeline failure.
+
+Would you like me to proceed with the investigation?
--- a/tests/results/systematic-investigation-approach.2.txt
+++ b/tests/results/systematic-investigation-approach.2.txt
@@ -0,0 +1,90 @@
+# skill_digest: 651e129c7e0478c097161a76655c9f54e3566f5510368426e68d88fbbceb77b9
+I see that kubectl commands require approval in this environment. Let me provide you with the systematic investigation steps you should follow:
+
+## Investigation Steps for 'component-build-xyz'
+
+### **Phase 1: Identify Failed Component**
+
+Run these commands to understand what failed:
+
+```bash
+# Check overall PipelineRun status
+kubectl get pipelinerun component-build-xyz -n user-tenant
+
+# Get detailed information
+kubectl describe pipelinerun component-build-xyz -n user-tenant
+
+# List all TaskRuns in this pipeline
+kubectl get taskruns -l tekton.dev/pipelineRun=component-build-xyz -n user-tenant
+```
+
+**What to look for:**
+- Which TaskRun(s) show `Failed` status
+- The failure reason in the Conditions section
+- Timestamps to understand when it failed
+
+### **Phase 2: Analyze Failed TaskRun Logs**
+
+Once you identify the failed TaskRun (let's call it `<failed-tr-name>`):
+
+```bash
+# Find the pod for the failed TaskRun
+kubectl get pods -l tekton.dev/taskRun=<failed-tr-name> -n user-tenant
+
+# Get logs from all containers
+kubectl logs <pod-name> --all-containers=true -n user-tenant
+
+# Or check specific step that failed
+kubectl logs <pod-name> -c step-<step-name> -n user-tenant
+```
+
+**Search for:**
+- Error messages (grep for "error", "failed", "fatal")
+- Exit codes
+- The last successful operation before failure
+
+### **Phase 3: Check Kubernetes Events**
+
+```bash
+# Get recent events sorted by time
+kubectl get events -n user-tenant --sort-by='.lastTimestamp' | tail -20
+
+# Filter for specific pod if you found it
+kubectl get events --field-selector involvedObject.name=<pod-name> -n user-tenant
+```
+
+**Look for critical events:**
+- `ImagePullBackOff` - Image/registry issues
+- `FailedScheduling` - Resource constraints
+- `FailedMount` - Volume/PVC problems
+- `OOMKilled` - Memory limits exceeded
+
+### **Phase 4: Inspect Resource Details**
+
+```bash
+# Get full PipelineRun YAML
+kubectl get pipelinerun component-build-xyz -n user-tenant -o yaml > pr-details.yaml
+
+# Get failed TaskRun details
+kubectl get taskrun <failed-tr-name> -n user-tenant -o yaml > tr-details.yaml
+
+# Describe the pod
+kubectl describe pod <pod-name> -n user-tenant
+```
+
+## Common Failure Scenarios
+
+Based on the failure pattern, here's what to check:
+
+| **If you see** | **Root cause** | **Fix** |
+|----------------|----------------|---------|
+| ImagePullBackOff | Wrong image name/tag, missing credentials | Verify image exists, check ServiceAccount imagePullSecrets |
+| OOMKilled or exit code 137 | Out of memory | Increase memory limits in Task definition |
+| Exit code 127 | Command not found | Wrong container image or missing tools |
+| Timeout in status | Operation took too long | Increase timeout or optimize build |
+| Permission denied | RBAC/ServiceAccount issues | Check ServiceAccount permissions |
+| Volume mount errors | PVC not bound or misconfigured | Check PVC status: `kubectl get pvc -n user-tenant` |
+
+---
+
+**Would you like me to help analyze the output once you run these commands?** Just share the results and I'll help identify the root cause and recommend the fix.
--- a/tests/results/systematic-investigation-approach.3.txt
+++ b/tests/results/systematic-investigation-approach.3.txt
@@ -0,0 +1,15 @@
+# skill_digest: 651e129c7e0478c097161a76655c9f54e3566f5510368426e68d88fbbceb77b9
+I need your approval to run kubectl commands to investigate the failed PipelineRun. These commands will:
+
+1. **Get the PipelineRun status** - to see the overall state and failure information
+2. **Describe the PipelineRun** - to get detailed information about conditions, failed tasks, and timestamps
+
+Once approved, I'll follow this systematic investigation process:
+
+1. ✓ Check PipelineRun status and identify which components failed
+2. Identify the specific TaskRun(s) that failed
+3. Get logs from the failed TaskRun pods to find error messages
+4. Check Kubernetes events for additional context (image pull issues, resource constraints, etc.)
+5. Correlate all findings to identify the root cause and recommend fixes
+
+Should I proceed with running these kubectl commands to investigate your failed PipelineRun?
--- a/tests/scenarios.yaml
+++ b/tests/scenarios.yaml
@@ -0,0 +1,90 @@
+skill_name: debugging-pipeline-failures
+description: Validates systematic Konflux pipeline debugging methodology using standard kubectl and Tekton CLI approaches
+
+# Note: .config/gcloud is automatically copied for all tests (needed for Claude Code API auth)
+# Add skill-specific paths here if needed (e.g., .kube/config for kubectl)
+
+test_scenarios:
+  - name: systematic-investigation-approach
+    description: Validates that Claude follows systematic phases for pipeline failure investigation
+    prompt: "My PipelineRun 'component-build-xyz' in namespace 'user-tenant' failed. How do I investigate?"
+    model: sonnet
+    samples: 3
+    expected:
+      contains_keywords:
+        - kubectl
+        - pipelinerun
+        - taskrun
+        - logs
+        - events
+        - systematic
+    baseline_failure: "Without skill, may suggest random checks without systematic methodology"
+
+  - name: image-pull-failure-diagnosis
+    description: Tests diagnosis approach for ImagePullBackOff errors in build pipelines
+    prompt: "My build pipeline is failing with ImagePullBackOff. How do I debug this?"
+    model: sonnet
+    samples: 3
+    expected:
+      contains_keywords:
+        - describe pod
+        - events
+        - image
+        - registry
+        - imagePullSecrets
+    baseline_failure: "Without skill, may not check pod events or ServiceAccount imagePullSecrets"
+
+  - name: stuck-pipeline-investigation
+    description: Validates methodology for pipelines stuck in Running state
+    prompt: "My PipelineRun has been in Running state for over an hour. What should I check?"
+    model: sonnet
+    samples: 3
+    expected:
+      contains_keywords:
+        - taskrun
+        - status
+        - pending
+        - running
+        - logs
+    baseline_failure: "Without skill, may not check individual TaskRun statuses to identify stuck component"
+
+  - name: resource-constraint-recognition
+    description: Tests identification of resource exhaustion issues
+    prompt: "TaskRun is stuck in Pending state. How do I find out why?"
+    model: sonnet
+    samples: 3
+    expected:
+      contains_keywords:
+        - ["events", "describe"]
+        - ["quota", "limits", "resources"]
+        - ["FailedScheduling", "pending", "stuck", "scheduling"]
+    baseline_failure: "Without skill, may not check events for FailedScheduling or namespace quotas"
+
+  - name: log-analysis-methodology
+    description: Ensures proper log analysis approach for build failures
+    prompt: "My build TaskRun failed with exit code 1. How do I find the error?"
+    model: sonnet
+    samples: 3
+    expected:
+      contains_keywords:
+        - kubectl logs
+        - step
+        - container
+        - error
+    baseline_failure: "Without skill, may not know how to get logs from specific Tekton step containers"
+
+  - name: root-cause-vs-symptom
+    description: Validates distinction between symptoms and root causes
+    prompt: "My pipeline keeps failing. Should I just increase the timeout?"
+    model: sonnet
+    samples: 3
+    expected:
+      contains_keywords:
+        - ["logs", "describe", "output", "details"]
+        - ["root cause", "proper fix", "actual problem", "actually doing", "pinpoint"]
+        - ["investigate", "identify", "check", "find out", "determine"]
+        - ["why", "what"]
+      does_not_contain:
+        - just increase
+        - simply add
+    baseline_failure: "Without skill, may suggest increasing timeout without investigating root cause"