Initial commit
This commit is contained in:
180
tests/README.md
Normal file
180
tests/README.md
Normal file
@@ -0,0 +1,180 @@
|
||||
# Test Suite for Debugging Pipeline Failures Skill
|
||||
|
||||
## Overview
|
||||
|
||||
This test suite validates that the pipeline debugging skill properly teaches Claude Code to:
|
||||
1. Follow systematic investigation methodology
|
||||
2. Use standard kubectl and Tekton commands
|
||||
3. Distinguish root causes from symptoms
|
||||
4. Correlate logs, events, and resource states
|
||||
5. Provide actionable debugging steps
|
||||
|
||||
## Test Scenarios
|
||||
|
||||
### 1. Systematic Investigation Approach
|
||||
**Purpose**: Validates Claude follows phased methodology (identify → logs → events → resources → root cause)
|
||||
**Expected**: Should mention systematic approach with kubectl commands for PipelineRun and TaskRun inspection
|
||||
**Baseline Failure**: Without skill, may suggest random checks without structure
|
||||
|
||||
### 2. Image Pull Failure Diagnosis
|
||||
**Purpose**: Tests systematic diagnosis of ImagePullBackOff errors
|
||||
**Expected**: Should check pod events, image name, registry, and ServiceAccount imagePullSecrets
|
||||
**Baseline Failure**: Without skill, may not know to check pod describe or imagePullSecrets
|
||||
|
||||
### 3. Stuck Pipeline Investigation
|
||||
**Purpose**: Validates methodology for pipelines stuck in Running state
|
||||
**Expected**: Should check individual TaskRun statuses to identify which is stuck/pending
|
||||
**Baseline Failure**: Without skill, may not know to list TaskRuns filtered by pipelineRun label
|
||||
|
||||
### 4. Resource Constraint Recognition
|
||||
**Purpose**: Tests identification of scheduling and quota issues
|
||||
**Expected**: Should check events for FailedScheduling and namespace resource quotas
|
||||
**Baseline Failure**: Without skill, may not connect Pending state with resource constraints
|
||||
|
||||
### 5. Log Analysis Methodology
|
||||
**Purpose**: Ensures proper Tekton log retrieval for failed steps
|
||||
**Expected**: Should know how to get logs from specific step containers in Tekton pods
|
||||
**Baseline Failure**: Without skill, may not understand Tekton step container naming
|
||||
|
||||
### 6. Root Cause vs Symptom
|
||||
**Purpose**: Validates focus on investigation before applying fixes
|
||||
**Expected**: Should recommend investigating logs and root cause before increasing timeouts
|
||||
**Baseline Failure**: Without skill, may suggest quick fixes without investigation
|
||||
|
||||
## Running Tests
|
||||
|
||||
### Prerequisites
|
||||
|
||||
- Python 3.8+
|
||||
- Claude Code CLI access
|
||||
- Claude Sonnet 4.5 (tests use `sonnet` model)
|
||||
- Access to test framework (if available in konflux-ci/skills repo)
|
||||
|
||||
### Run All Tests
|
||||
|
||||
```bash
|
||||
# From repository root
|
||||
make test
|
||||
|
||||
# Or specifically for this skill
|
||||
make test-only SKILL=debugging-pipeline-failures
|
||||
```
|
||||
|
||||
### Validate Skill Schema
|
||||
|
||||
```bash
|
||||
claudelint debugging-pipeline-failures/SKILL.md
|
||||
```
|
||||
|
||||
### Generate Test Results
|
||||
|
||||
```bash
|
||||
make generate SKILL=debugging-pipeline-failures
|
||||
```
|
||||
|
||||
## Test-Driven Development Process
|
||||
|
||||
This skill followed TDD for Documentation:
|
||||
|
||||
### RED Phase (Initial Failures)
|
||||
1. Created 6 test scenarios representing real pipeline debugging needs
|
||||
2. Ran tests WITHOUT the skill
|
||||
3. Documented baseline failures:
|
||||
- No systematic methodology
|
||||
- Didn't know Tekton-specific kubectl commands
|
||||
- Confused symptoms with root causes
|
||||
- Missing event and resource correlation
|
||||
|
||||
### GREEN Phase (Minimal Skill)
|
||||
1. Created SKILL.md addressing test failures
|
||||
2. Added 5-phase investigation methodology
|
||||
3. Included kubectl command examples
|
||||
4. Emphasized root cause analysis
|
||||
5. All tests passed
|
||||
|
||||
### REFACTOR Phase (Improvement)
|
||||
1. Added common failure patterns (6 types)
|
||||
2. Enhanced with decision tree
|
||||
3. Improved troubleshooting workflow
|
||||
4. Added common confusions section
|
||||
|
||||
## Success Criteria
|
||||
|
||||
All tests must:
|
||||
- ✅ Pass with 100% success rate (3/3 samples)
|
||||
- ✅ Contain expected keywords (kubectl, systematic approach)
|
||||
- ✅ NOT contain prohibited terms (quick fixes without investigation)
|
||||
- ✅ Demonstrate phased methodology
|
||||
- ✅ Focus on standard Tekton/Kubernetes tools
|
||||
|
||||
## Continuous Validation
|
||||
|
||||
Tests run automatically on:
|
||||
- Every pull request (GitHub Actions)
|
||||
- Skill file modifications
|
||||
- Schema changes
|
||||
- Version updates
|
||||
|
||||
## Adding New Tests
|
||||
|
||||
To add test scenarios:
|
||||
|
||||
1. **Identify gap**: What failure pattern is missing?
|
||||
2. **Create scenario**: Add to `scenarios.yaml`
|
||||
3. **Run without skill**: Document baseline failure
|
||||
4. **Update SKILL.md**: Address the gap
|
||||
5. **Validate**: Ensure test passes
|
||||
|
||||
Example:
|
||||
```yaml
|
||||
- name: your-test-name
|
||||
description: What you're testing
|
||||
prompt: "User query to test"
|
||||
model: sonnet
|
||||
samples: 3
|
||||
expected:
|
||||
contains_keywords:
|
||||
- keyword1
|
||||
- keyword2
|
||||
baseline_failure: What happens without the skill
|
||||
```
|
||||
|
||||
## Known Limitations
|
||||
|
||||
- Tests use synthetic scenarios (not real Konflux failures)
|
||||
- Keyword matching is basic (could use semantic analysis)
|
||||
- No integration testing with actual clusters
|
||||
- Sample size (3) may not catch all edge cases
|
||||
|
||||
## Future Improvements
|
||||
|
||||
- Add tests for multi-step pipeline failures
|
||||
- Include workspace debugging scenarios
|
||||
- Add tests for intermittent failures
|
||||
- Test with real Konflux pipeline YAML
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Test Failures
|
||||
|
||||
**Symptom**: Test fails intermittently
|
||||
**Fix**: Increase samples or refine expected keywords
|
||||
|
||||
**Symptom**: All tests fail
|
||||
**Fix**: Check SKILL.md frontmatter and schema validation
|
||||
|
||||
**Symptom**: Baseline failure unclear
|
||||
**Fix**: Run test manually without skill, document actual output
|
||||
|
||||
## Contributing
|
||||
|
||||
When contributing test improvements:
|
||||
1. Ensure tests are deterministic
|
||||
2. Use realistic Konflux user prompts
|
||||
3. Document baseline failures clearly
|
||||
4. Keep samples count reasonable (3-5)
|
||||
5. Update this README with new scenarios
|
||||
|
||||
## Questions?
|
||||
|
||||
See main repository documentation or file an issue in konflux-ci/skills.
|
||||
105
tests/results/image-pull-failure-diagnosis.1.txt
Normal file
105
tests/results/image-pull-failure-diagnosis.1.txt
Normal file
@@ -0,0 +1,105 @@
|
||||
# skill_digest: 651e129c7e0478c097161a76655c9f54e3566f5510368426e68d88fbbceb77b9
|
||||
## Debugging ImagePullBackOff Error
|
||||
|
||||
ImagePullBackOff indicates Kubernetes cannot pull the container image. Let's investigate step by step.
|
||||
|
||||
### Step 1: Find the Failed Pod
|
||||
|
||||
First, identify which pipeline and pod is failing:
|
||||
|
||||
```bash
|
||||
# List recent PipelineRuns
|
||||
kubectl get pipelinerun -n <namespace> --sort-by=.metadata.creationTimestamp
|
||||
|
||||
# Or if you know the PipelineRun name:
|
||||
kubectl get pipelinerun <pr-name> -n <namespace>
|
||||
|
||||
# Find the associated TaskRuns
|
||||
kubectl get taskruns -l tekton.dev/pipelineRun=<pr-name> -n <namespace>
|
||||
|
||||
# Find the pod with the issue
|
||||
kubectl get pods -n <namespace> | grep ImagePullBackOff
|
||||
```
|
||||
|
||||
### Step 2: Check Pod Events
|
||||
|
||||
Once you've identified the pod, check the events for specific details:
|
||||
|
||||
```bash
|
||||
kubectl describe pod <pod-name> -n <namespace> | grep -A10 "Events"
|
||||
```
|
||||
|
||||
This will show you:
|
||||
- **Exact image name** that failed to pull
|
||||
- **Error reason** (authentication failed, not found, etc.)
|
||||
- **Registry** being accessed
|
||||
|
||||
### Step 3: Common Root Causes & Fixes
|
||||
|
||||
**A. Image Name/Tag Typo**
|
||||
- Check if the image name and tag are spelled correctly
|
||||
- Verify the image actually exists in the registry
|
||||
|
||||
**B. Registry Authentication**
|
||||
```bash
|
||||
# Check if ServiceAccount has imagePullSecrets
|
||||
kubectl get sa <service-account-name> -n <namespace> -o yaml
|
||||
|
||||
# Check if the secret exists
|
||||
kubectl get secrets -n <namespace>
|
||||
```
|
||||
|
||||
**Fix**: Add imagePullSecret to your ServiceAccount:
|
||||
```yaml
|
||||
apiVersion: v1
|
||||
kind: ServiceAccount
|
||||
metadata:
|
||||
name: <sa-name>
|
||||
imagePullSecrets:
|
||||
- name: <registry-secret-name>
|
||||
```
|
||||
|
||||
**C. Private Registry - Missing Credentials**
|
||||
```bash
|
||||
# Create a docker registry secret
|
||||
kubectl create secret docker-registry <secret-name> \
|
||||
--docker-server=<registry-url> \
|
||||
--docker-username=<username> \
|
||||
--docker-password=<password> \
|
||||
--docker-email=<email> \
|
||||
-n <namespace>
|
||||
```
|
||||
|
||||
**D. Image Doesn't Exist**
|
||||
- Verify the image exists in your registry
|
||||
- Check if the tag is correct (common issue: using `latest` when it doesn't exist)
|
||||
|
||||
**E. Network/Registry Issues**
|
||||
```bash
|
||||
# Check if the registry is accessible from the cluster
|
||||
kubectl run test-pull --image=<your-image> -n <namespace> --rm -it --restart=Never
|
||||
```
|
||||
|
||||
### Quick Diagnosis Commands
|
||||
|
||||
Run these to gather all relevant info:
|
||||
|
||||
```bash
|
||||
# Get full pod details
|
||||
kubectl describe pod <pod-name> -n <namespace>
|
||||
|
||||
# Check the TaskRun spec for image references
|
||||
kubectl get taskrun <tr-name> -n <namespace> -o yaml | grep -A2 "image:"
|
||||
|
||||
# View ServiceAccount configuration
|
||||
kubectl get sa -n <namespace> -o yaml
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
**Next Steps:**
|
||||
1. Run the commands above with your actual namespace and resource names
|
||||
2. Share the output from `kubectl describe pod` (especially the Events section)
|
||||
3. I can help you identify the exact cause and create the appropriate fix
|
||||
|
||||
What's your namespace and do you know the PipelineRun or pod name that's failing?
|
||||
59
tests/results/image-pull-failure-diagnosis.2.txt
Normal file
59
tests/results/image-pull-failure-diagnosis.2.txt
Normal file
@@ -0,0 +1,59 @@
|
||||
# skill_digest: 651e129c7e0478c097161a76655c9f54e3566f5510368426e68d88fbbceb77b9
|
||||
Based on the debugging guide, here's how to systematically debug your ImagePullBackOff issue:
|
||||
|
||||
## Immediate Steps
|
||||
|
||||
**1. Find the failing pod and check events:**
|
||||
|
||||
```bash
|
||||
# Find your PipelineRun
|
||||
kubectl get pipelinerun -n <namespace>
|
||||
|
||||
# Get the failed TaskRun
|
||||
kubectl get taskruns -l tekton.dev/pipelineRun=<pr-name> -n <namespace>
|
||||
|
||||
# Find the pod
|
||||
kubectl get pods -l tekton.dev/taskRun=<tr-name> -n <namespace>
|
||||
|
||||
# Check pod events - this will show the ImagePullBackOff details
|
||||
kubectl describe pod <pod-name> -n <namespace> | grep -A10 "Events"
|
||||
```
|
||||
|
||||
**2. The events will reveal the specific image problem. Common causes:**
|
||||
|
||||
- **Typo in image name/tag** - Check the exact image reference
|
||||
- **Image doesn't exist** - Verify the image exists in the registry
|
||||
- **Registry authentication** - Missing or incorrect imagePullSecrets
|
||||
- **Private registry access** - Network policies blocking access
|
||||
|
||||
**3. Verify the image:**
|
||||
|
||||
```bash
|
||||
# Check what image is being pulled (look in TaskRun spec)
|
||||
kubectl get taskrun <tr-name> -n <namespace> -o yaml | grep -i image
|
||||
|
||||
# Check ServiceAccount has imagePullSecrets
|
||||
kubectl get sa <sa-name> -n <namespace> -o yaml
|
||||
```
|
||||
|
||||
## Quick Fixes
|
||||
|
||||
**If it's an authentication issue:**
|
||||
```bash
|
||||
# Check if imagePullSecret exists
|
||||
kubectl get secrets -n <namespace>
|
||||
|
||||
# Add imagePullSecret to ServiceAccount if missing
|
||||
kubectl patch serviceaccount <sa-name> -n <namespace> -p '{"imagePullSecrets": [{"name": "<secret-name>"}]}'
|
||||
```
|
||||
|
||||
**If it's a typo:**
|
||||
- Correct the image name/tag in your Task or Pipeline definition
|
||||
|
||||
**If the image doesn't exist:**
|
||||
- Build and push the image first
|
||||
- Or use the correct tag that exists
|
||||
|
||||
Would you like me to help you run these commands? I'll need:
|
||||
1. Your namespace name
|
||||
2. The PipelineRun name (or we can list them to find it)
|
||||
66
tests/results/image-pull-failure-diagnosis.3.txt
Normal file
66
tests/results/image-pull-failure-diagnosis.3.txt
Normal file
@@ -0,0 +1,66 @@
|
||||
# skill_digest: 651e129c7e0478c097161a76655c9f54e3566f5510368426e68d88fbbceb77b9
|
||||
### 1. Find the Failed Pod and Check Events
|
||||
|
||||
```bash
|
||||
# Find your PipelineRun and identify the failed TaskRun
|
||||
kubectl get pipelinerun -n <namespace>
|
||||
|
||||
# Get TaskRuns for the failed pipeline
|
||||
kubectl get taskruns -l tekton.dev/pipelineRun=<pr-name> -n <namespace>
|
||||
|
||||
# Find the pod with ImagePullBackOff
|
||||
kubectl get pods -n <namespace> | grep ImagePullBackOff
|
||||
|
||||
# Check events for the specific pod
|
||||
kubectl describe pod <pod-name> -n <namespace>
|
||||
```
|
||||
|
||||
Look in the Events section for details about **why** the image pull failed.
|
||||
|
||||
### 2. Common Root Causes
|
||||
|
||||
**Image Name/Tag Issues:**
|
||||
- Typo in image name or tag
|
||||
- Tag doesn't exist in registry
|
||||
- Wrong registry URL
|
||||
|
||||
**Authentication Issues:**
|
||||
- Missing imagePullSecret on ServiceAccount
|
||||
- Invalid/expired registry credentials
|
||||
- Wrong secret referenced
|
||||
|
||||
**Registry Accessibility:**
|
||||
- Registry is down or unreachable
|
||||
- Network policies blocking access
|
||||
- Private registry requires authentication
|
||||
|
||||
### 3. Verify the Image
|
||||
|
||||
```bash
|
||||
# Check what image the pod is trying to pull
|
||||
kubectl get pod <pod-name> -n <namespace> -o jsonpath='{.spec.containers[*].image}'
|
||||
|
||||
# Check if ServiceAccount has imagePullSecrets
|
||||
kubectl get sa <service-account-name> -n <namespace> -o yaml
|
||||
```
|
||||
|
||||
### 4. Fix Based on Root Cause
|
||||
|
||||
**If image name is wrong:**
|
||||
- Correct the image reference in your Task/Pipeline definition
|
||||
|
||||
**If authentication is missing:**
|
||||
```bash
|
||||
# Add imagePullSecret to ServiceAccount
|
||||
kubectl patch serviceaccount <sa-name> -n <namespace> -p '{"imagePullSecrets": [{"name": "<secret-name>"}]}'
|
||||
```
|
||||
|
||||
**If credentials are invalid:**
|
||||
- Update the docker-registry secret with valid credentials
|
||||
|
||||
Would you like me to help you investigate further? Please share:
|
||||
1. Your namespace name
|
||||
2. The PipelineRun or Pod name that's failing
|
||||
3. The output of `kubectl get pods -n <namespace>` showing the ImagePullBackOff
|
||||
|
||||
I can then run the specific commands to identify the exact root cause.
|
||||
46
tests/results/log-analysis-methodology.1.txt
Normal file
46
tests/results/log-analysis-methodology.1.txt
Normal file
@@ -0,0 +1,46 @@
|
||||
# skill_digest: 651e129c7e0478c097161a76655c9f54e3566f5510368426e68d88fbbceb77b9
|
||||
Based on the debugging guide, here's how to find the error in your failed build TaskRun:
|
||||
|
||||
## Quick Steps to Find the Error
|
||||
|
||||
1. **Get the logs from the failed step:**
|
||||
```bash
|
||||
# First, find the pod associated with your TaskRun
|
||||
kubectl get pods -l tekton.dev/taskRun=<your-taskrun-name> -n <namespace>
|
||||
|
||||
# Then get logs from the build step (or whichever step failed)
|
||||
kubectl logs <pod-name> -c step-build -n <namespace>
|
||||
|
||||
# If you don't know which step failed, get all logs:
|
||||
kubectl logs <pod-name> --all-containers=true -n <namespace>
|
||||
```
|
||||
|
||||
2. **Check TaskRun details to see which step failed:**
|
||||
```bash
|
||||
kubectl describe taskrun <your-taskrun-name> -n <namespace>
|
||||
```
|
||||
|
||||
Look for the "Status" section which shows which step failed and the exit code.
|
||||
|
||||
3. **Review Kubernetes events for additional context:**
|
||||
```bash
|
||||
kubectl get events --field-selector involvedObject.name=<pod-name> -n <namespace> --sort-by='.lastTimestamp'
|
||||
```
|
||||
|
||||
## What to Look For in the Logs
|
||||
|
||||
- **Error messages**: Search for keywords like "error", "failed", "fatal"
|
||||
- **The last successful operation** before the failure
|
||||
- **Exit code explanation**:
|
||||
- Exit code 1 = General failure (check the error message in logs)
|
||||
- Exit code 127 = Command not found
|
||||
- Exit code 137 = OOMKilled (out of memory)
|
||||
|
||||
## Common Exit Code 1 Causes
|
||||
|
||||
- Build script errors (compilation failures, test failures)
|
||||
- Missing dependencies
|
||||
- Configuration errors
|
||||
- Network issues during dependency downloads
|
||||
|
||||
**Do you have the TaskRun name and namespace handy?** I can help you run these commands to find the specific error.
|
||||
52
tests/results/log-analysis-methodology.2.txt
Normal file
52
tests/results/log-analysis-methodology.2.txt
Normal file
@@ -0,0 +1,52 @@
|
||||
# skill_digest: 651e129c7e0478c097161a76655c9f54e3566f5510368426e68d88fbbceb77b9
|
||||
Based on the debugging guide, here's how to find the error in your failed build TaskRun:
|
||||
|
||||
## Quick Steps to Find Your Error
|
||||
|
||||
**1. Get the logs from your failed TaskRun:**
|
||||
|
||||
```bash
|
||||
# First, find the pod associated with your TaskRun
|
||||
kubectl get pods -l tekton.dev/taskRun=<your-taskrun-name> -n <namespace>
|
||||
|
||||
# Then get the logs (replace with your pod name)
|
||||
kubectl logs <pod-name> --all-containers=true -n <namespace>
|
||||
```
|
||||
|
||||
**2. If you know which step failed, get logs from that specific step:**
|
||||
|
||||
```bash
|
||||
kubectl logs <pod-name> -c step-<step-name> -n <namespace>
|
||||
```
|
||||
|
||||
**3. Check the TaskRun status for clues:**
|
||||
|
||||
```bash
|
||||
kubectl describe taskrun <your-taskrun-name> -n <namespace>
|
||||
```
|
||||
|
||||
Look for the `Status` section which shows:
|
||||
- Which step failed
|
||||
- Exit code details
|
||||
- Failure reasons
|
||||
|
||||
**4. Check Kubernetes events for additional context:**
|
||||
|
||||
```bash
|
||||
kubectl get events -n <namespace> --sort-by='.lastTimestamp' | tail -20
|
||||
```
|
||||
|
||||
## Common Exit Code 1 Causes
|
||||
|
||||
Since you got exit code 1, it typically means:
|
||||
- **Build script error** - compilation failed, tests failed, linting error
|
||||
- **Command failed** - a command in your build script returned an error
|
||||
- **Missing dependency** - required tool or package not found
|
||||
|
||||
The logs from step 1 will show the actual error message before the exit.
|
||||
|
||||
**Can you provide:**
|
||||
1. Your TaskRun name and namespace?
|
||||
2. Or paste the output from `kubectl describe taskrun <name>`?
|
||||
|
||||
This will help me pinpoint the exact error for you.
|
||||
49
tests/results/log-analysis-methodology.3.txt
Normal file
49
tests/results/log-analysis-methodology.3.txt
Normal file
@@ -0,0 +1,49 @@
|
||||
# skill_digest: 651e129c7e0478c097161a76655c9f54e3566f5510368426e68d88fbbceb77b9
|
||||
Based on the debugging guide, here's how to find the error in your failed build TaskRun:
|
||||
|
||||
## Quick Steps to Find Your Error
|
||||
|
||||
**1. Get the TaskRun logs** (most likely to show the error):
|
||||
```bash
|
||||
# Find the pod associated with your TaskRun
|
||||
kubectl get pods -l tekton.dev/taskRun=<your-taskrun-name> -n <namespace>
|
||||
|
||||
# Get logs from all containers to see the error
|
||||
kubectl logs <pod-name> --all-containers=true -n <namespace>
|
||||
```
|
||||
|
||||
**2. Check the TaskRun details** to see which step failed:
|
||||
```bash
|
||||
kubectl describe taskrun <your-taskrun-name> -n <namespace>
|
||||
```
|
||||
|
||||
Look for the "Status" section which will show:
|
||||
- Which step failed
|
||||
- The exit code (you mentioned exit code 1)
|
||||
- Failure reason/message
|
||||
|
||||
**3. Get logs from the specific failed step** (once you identify it):
|
||||
```bash
|
||||
kubectl logs <pod-name> -c step-<step-name> -n <namespace>
|
||||
```
|
||||
|
||||
**4. Check events** for additional context:
|
||||
```bash
|
||||
kubectl get events --field-selector involvedObject.name=<pod-name> -n <namespace> --sort-by='.lastTimestamp'
|
||||
```
|
||||
|
||||
## What to Look For in Logs
|
||||
|
||||
Exit code 1 typically means a general script/command failure. Search the logs for:
|
||||
- Error messages (grep for "error", "failed", "fatal")
|
||||
- The last command that executed before failure
|
||||
- Build tool specific errors (npm, maven, gradle, etc.)
|
||||
- Test failures
|
||||
- Missing files or dependencies
|
||||
|
||||
**Need help running these commands?** If you provide me with:
|
||||
- Your TaskRun name
|
||||
- Your namespace
|
||||
- Access to run kubectl commands in your environment
|
||||
|
||||
I can help you execute these commands and analyze the output to pinpoint the exact error.
|
||||
58
tests/results/resource-constraint-recognition.1.txt
Normal file
58
tests/results/resource-constraint-recognition.1.txt
Normal file
@@ -0,0 +1,58 @@
|
||||
# skill_digest: 651e129c7e0478c097161a76655c9f54e3566f5510368426e68d88fbbceb77b9
|
||||
Based on the debugging skill loaded, here's how to find out why your TaskRun is stuck in Pending state:
|
||||
|
||||
## Quick Diagnostic Steps
|
||||
|
||||
**1. Check the TaskRun status and details:**
|
||||
```bash
|
||||
kubectl get taskrun <tr-name> -n <namespace>
|
||||
kubectl describe taskrun <tr-name> -n <namespace>
|
||||
```
|
||||
|
||||
**2. Check for scheduling issues via events:**
|
||||
```bash
|
||||
# Get recent events in the namespace
|
||||
kubectl get events -n <namespace> --sort-by='.lastTimestamp' | grep <tr-name>
|
||||
|
||||
# Or check pod events directly
|
||||
kubectl get pods -l tekton.dev/taskRun=<tr-name> -n <namespace>
|
||||
kubectl describe pod <pod-name> -n <namespace>
|
||||
```
|
||||
|
||||
**3. Check resource quotas and constraints:**
|
||||
```bash
|
||||
# Check namespace quotas
|
||||
kubectl describe namespace <namespace> | grep -A5 "Resource Quotas"
|
||||
|
||||
# Check node capacity
|
||||
kubectl describe node | grep -A5 "Allocated resources"
|
||||
```
|
||||
|
||||
## Common Causes of Pending TaskRuns
|
||||
|
||||
| Cause | What to Look For | Solution |
|
||||
|-------|------------------|----------|
|
||||
| **Resource quota exceeded** | Events: `FailedScheduling`, quota errors | Increase namespace quota or reduce resource requests |
|
||||
| **Insufficient cluster resources** | Events: `FailedScheduling` (no nodes available) | Wait for resources or scale cluster |
|
||||
| **Missing PVC/workspace** | Events: `FailedMount` or volume errors | Create required PVC or fix workspace config |
|
||||
| **Invalid ServiceAccount** | Events: permission errors | Verify ServiceAccount exists and has proper permissions |
|
||||
| **Image pull secrets missing** | Pod can't be created, image pull errors | Add imagePullSecrets to ServiceAccount |
|
||||
|
||||
## Most Likely Issue
|
||||
|
||||
If TaskRun is **Pending**, it typically means the underlying pod cannot be scheduled. The events will tell you exactly why:
|
||||
|
||||
```bash
|
||||
kubectl get events --field-selector involvedObject.name=<pod-name> -n <namespace>
|
||||
```
|
||||
|
||||
Look for messages like:
|
||||
- `0/X nodes are available: insufficient memory/cpu` → Resource constraints
|
||||
- `persistentvolumeclaim "X" not found` → Missing PVC
|
||||
- `exceeded quota` → Namespace quota limit
|
||||
|
||||
Would you like me to help you run these commands? Please provide:
|
||||
- Your TaskRun name
|
||||
- Your namespace
|
||||
|
||||
Or if you already have the output from any of these commands, share it and I can help interpret the results.
|
||||
63
tests/results/resource-constraint-recognition.2.txt
Normal file
63
tests/results/resource-constraint-recognition.2.txt
Normal file
@@ -0,0 +1,63 @@
|
||||
# skill_digest: 651e129c7e0478c097161a76655c9f54e3566f5510368426e68d88fbbceb77b9
|
||||
Based on the debugging-pipeline-failures skill, here's how to find out why your TaskRun is stuck in Pending state:
|
||||
|
||||
## Quick Diagnosis Steps
|
||||
|
||||
**1. Check the TaskRun status and conditions:**
|
||||
```bash
|
||||
kubectl get taskrun <tr-name> -n <namespace>
|
||||
kubectl describe taskrun <tr-name> -n <namespace>
|
||||
```
|
||||
|
||||
**2. Check for pod existence and status:**
|
||||
```bash
|
||||
kubectl get pods -l tekton.dev/taskRun=<tr-name> -n <namespace>
|
||||
```
|
||||
|
||||
**3. Check Kubernetes events (most revealing for Pending issues):**
|
||||
```bash
|
||||
kubectl get events -n <namespace> --sort-by='.lastTimestamp'
|
||||
|
||||
# Or filter for the specific TaskRun pod
|
||||
kubectl get events --field-selector involvedObject.name=<pod-name> -n <namespace>
|
||||
```
|
||||
|
||||
## Common Causes for Pending State
|
||||
|
||||
The events will typically reveal one of these issues:
|
||||
|
||||
| Event Type | Root Cause | Solution |
|
||||
|------------|------------|----------|
|
||||
| `FailedScheduling` | Resource constraints - not enough CPU/memory | Check namespace quotas and node capacity |
|
||||
| `Insufficient cpu/memory` | Node can't satisfy resource requests | Reduce resource requests or add nodes |
|
||||
| No pod exists yet | TaskRun hasn't created pod | Check PipelineRun/Task configuration |
|
||||
| PVC issues | Workspace PVC not bound | Check PVC status with `kubectl get pvc` |
|
||||
| Node selector mismatch | No nodes match selectors/taints | Review nodeSelector and tolerations |
|
||||
|
||||
## Detailed Investigation
|
||||
|
||||
**Check resource quotas:**
|
||||
```bash
|
||||
kubectl describe namespace <namespace> | grep -A5 "Resource Quotas"
|
||||
```
|
||||
|
||||
**Check node capacity:**
|
||||
```bash
|
||||
kubectl describe node | grep -A5 "Allocated resources"
|
||||
```
|
||||
|
||||
**Check PVC status (if using workspaces):**
|
||||
```bash
|
||||
kubectl get pvc -n <namespace>
|
||||
kubectl describe pvc <pvc-name> -n <namespace>
|
||||
```
|
||||
|
||||
**Inspect TaskRun YAML for resource requests:**
|
||||
```bash
|
||||
kubectl get taskrun <tr-name> -n <namespace> -o yaml | grep -A10 "resources:"
|
||||
```
|
||||
|
||||
Would you like me to help you run these commands? Please provide:
|
||||
- The TaskRun name
|
||||
- The namespace
|
||||
- Or share the output of `kubectl describe taskrun <tr-name>`
|
||||
50
tests/results/resource-constraint-recognition.3.txt
Normal file
50
tests/results/resource-constraint-recognition.3.txt
Normal file
@@ -0,0 +1,50 @@
|
||||
# skill_digest: 651e129c7e0478c097161a76655c9f54e3566f5510368426e68d88fbbceb77b9
|
||||
Based on the debugging guide, here's how to investigate a TaskRun stuck in Pending state:
|
||||
|
||||
## Quick Diagnostic Steps
|
||||
|
||||
**1. Check the TaskRun status:**
|
||||
```bash
|
||||
kubectl get taskrun <tr-name> -n <namespace>
|
||||
kubectl describe taskrun <tr-name> -n <namespace>
|
||||
```
|
||||
|
||||
**2. Check for resource constraint events:**
|
||||
```bash
|
||||
kubectl get events -n <namespace> --sort-by='.lastTimestamp'
|
||||
kubectl get events --field-selector involvedObject.name=<tr-name> -n <namespace>
|
||||
```
|
||||
|
||||
Look for these critical events:
|
||||
- `FailedScheduling` → Resource constraints (CPU/memory quota exceeded or insufficient node capacity)
|
||||
- `FailedMount` → Volume/PVC issues
|
||||
|
||||
**3. Check resource quotas:**
|
||||
```bash
|
||||
kubectl describe namespace <namespace> | grep -A5 "Resource Quotas"
|
||||
```
|
||||
|
||||
**4. Check if the pod exists:**
|
||||
```bash
|
||||
kubectl get pods -l tekton.dev/taskRun=<tr-name> -n <namespace>
|
||||
```
|
||||
|
||||
If no pod exists, the TaskRun can't schedule due to resource constraints.
|
||||
|
||||
## Common Causes for Pending TaskRuns
|
||||
|
||||
1. **Namespace quota exceeded** - No CPU/memory quota available
|
||||
2. **Insufficient cluster resources** - No nodes with available capacity
|
||||
3. **PVC not bound** - Workspace PersistentVolumeClaim is pending
|
||||
4. **Node selector mismatch** - No nodes match the required labels
|
||||
5. **Pod security admission** - Security policies blocking pod creation
|
||||
|
||||
## Next Steps
|
||||
|
||||
Once you identify the issue from events/quotas, you can:
|
||||
- Increase namespace resource quotas
|
||||
- Reduce resource requests in the Task definition
|
||||
- Fix PVC binding issues
|
||||
- Adjust node selectors or taints/tolerations
|
||||
|
||||
**Would you like me to help you run these diagnostic commands for a specific TaskRun?** Just provide the TaskRun name and namespace.
|
||||
22
tests/results/root-cause-vs-symptom.1.txt
Normal file
22
tests/results/root-cause-vs-symptom.1.txt
Normal file
@@ -0,0 +1,22 @@
|
||||
# skill_digest: 651e129c7e0478c097161a76655c9f54e3566f5510368426e68d88fbbceb77b9
|
||||
Before just increasing the timeout, let's investigate **why** your pipeline is failing. Timeouts are often a symptom of an underlying issue, not the root cause itself.
|
||||
|
||||
Simply increasing the timeout might mask problems like:
|
||||
- **Hung processes** or deadlocks that will never complete
|
||||
- **Resource starvation** (CPU, memory, network)
|
||||
- **Inefficient operations** that could be optimized
|
||||
- **Infrastructure issues** like slow I/O or network problems
|
||||
- **Misconfigurations** in the pipeline setup
|
||||
|
||||
I can help you debug this. I have access to a specialized skill for debugging Konflux/Tekton pipeline failures. To investigate, I'll need to know:
|
||||
|
||||
1. **What type of pipeline is this?** (Konflux/Tekton, GitHub Actions, GitLab CI, Jenkins, etc.)
|
||||
2. **Where is it timing out?** (Specific stage/task/step)
|
||||
3. **Can you share:**
|
||||
- The pipeline configuration file
|
||||
- Recent failure logs
|
||||
- Or point me to the pipeline definition in your repo
|
||||
|
||||
If this is a Konflux/Tekton pipeline, I can invoke the debugging skill to systematically analyze the failure. Otherwise, I can still help investigate once I see the pipeline setup and logs.
|
||||
|
||||
What information can you provide about the failing pipeline?
|
||||
15
tests/results/root-cause-vs-symptom.2.txt
Normal file
15
tests/results/root-cause-vs-symptom.2.txt
Normal file
@@ -0,0 +1,15 @@
|
||||
# skill_digest: 651e129c7e0478c097161a76655c9f54e3566f5510368426e68d88fbbceb77b9
|
||||
Before just increasing the timeout, it's better to understand **why** your pipeline is failing. A timeout is often a symptom of an underlying issue rather than the root cause.
|
||||
|
||||
Let me help you debug this systematically. I can use a specialized skill for debugging pipeline failures that will help us:
|
||||
|
||||
1. Identify what's actually timing out or failing
|
||||
2. Check for common issues (stuck tasks, resource problems, build errors)
|
||||
3. Examine logs to find the root cause
|
||||
4. Determine if a timeout increase is actually the right solution
|
||||
|
||||
Would you like me to investigate your pipeline failure? If so, I'll need to know:
|
||||
- What type of pipeline are you running? (e.g., Konflux/Tekton, GitHub Actions, GitLab CI, Jenkins, etc.)
|
||||
- Do you have access to the pipeline logs or configuration files in this directory?
|
||||
|
||||
Or if this is a Konflux/Tekton pipeline, I can start debugging right away using the specialized skill.
|
||||
25
tests/results/root-cause-vs-symptom.3.txt
Normal file
25
tests/results/root-cause-vs-symptom.3.txt
Normal file
@@ -0,0 +1,25 @@
|
||||
# skill_digest: 651e129c7e0478c097161a76655c9f54e3566f5510368426e68d88fbbceb77b9
|
||||
Now I can help you properly investigate the pipeline failure. Rather than just increasing the timeout (which usually masks the real problem), let's find out what's actually going wrong.
|
||||
|
||||
**To get started, I need some information:**
|
||||
|
||||
1. **What namespace is your pipeline running in?**
|
||||
2. **Do you know the PipelineRun name?** (or can you describe which pipeline is failing?)
|
||||
|
||||
Once you provide this, I'll systematically investigate by:
|
||||
- Checking the PipelineRun status to see which TaskRun failed
|
||||
- Examining logs for the actual error
|
||||
- Reviewing Kubernetes events for resource issues
|
||||
- Identifying the root cause (slow operation, resource constraint, actual failure, etc.)
|
||||
|
||||
**Common scenarios where timeout increases are NOT the solution:**
|
||||
- Build scripts with errors (needs code fix)
|
||||
- Resource exhaustion (needs quota/limit adjustments)
|
||||
- Image pull failures (needs registry/auth fixes)
|
||||
- Missing dependencies (needs build config changes)
|
||||
|
||||
**When timeout increases ARE appropriate:**
|
||||
- Legitimately slow operations (large downloads, complex builds)
|
||||
- After verifying the operation is progressing but just needs more time
|
||||
|
||||
So let's find out which category your failure falls into. What's the namespace and PipelineRun name?
|
||||
85
tests/results/stuck-pipeline-investigation.1.txt
Normal file
85
tests/results/stuck-pipeline-investigation.1.txt
Normal file
@@ -0,0 +1,85 @@
|
||||
# skill_digest: 651e129c7e0478c097161a76655c9f54e3566f5510368426e68d88fbbceb77b9
|
||||
For a PipelineRun stuck in Running state for over an hour, here's a systematic approach to diagnose the issue:
|
||||
|
||||
## Immediate Investigation Steps
|
||||
|
||||
**1. Check which TaskRuns are actually running or stuck:**
|
||||
```bash
|
||||
kubectl get taskruns -l tekton.dev/pipelineRun=<pr-name> -n <namespace>
|
||||
```
|
||||
|
||||
Look for TaskRuns that are:
|
||||
- **Pending** → Resource or scheduling issues
|
||||
- **Running** for too long → Timeout or hanging process
|
||||
|
||||
**2. Examine the PipelineRun details:**
|
||||
```bash
|
||||
kubectl describe pipelinerun <pr-name> -n <namespace>
|
||||
```
|
||||
|
||||
Check for conditions, recent events, and which TaskRun is currently executing.
|
||||
|
||||
**3. If a TaskRun is stuck in "Pending":**
|
||||
```bash
|
||||
# Find the pod
|
||||
kubectl get pods -l tekton.dev/taskRun=<tr-name> -n <namespace>
|
||||
|
||||
# Check why it's not scheduling
|
||||
kubectl describe pod <pod-name> -n <namespace>
|
||||
```
|
||||
|
||||
Common causes:
|
||||
- Insufficient cluster resources (CPU/memory)
|
||||
- Namespace quota exceeded
|
||||
- PVC not bound (for workspace volumes)
|
||||
|
||||
**4. If a TaskRun is stuck in "Running":**
|
||||
```bash
|
||||
# Check logs to see if it's making progress
|
||||
kubectl logs <pod-name> --all-containers=true -n <namespace>
|
||||
```
|
||||
|
||||
Look for:
|
||||
- Is the process actually doing work or hung?
|
||||
- Last log message timestamp
|
||||
- Network operations that might be stalled
|
||||
|
||||
**5. Check Kubernetes events:**
|
||||
```bash
|
||||
kubectl get events -n <namespace> --sort-by='.lastTimestamp' | tail -20
|
||||
```
|
||||
|
||||
Events will reveal:
|
||||
- `FailedScheduling` → Not enough resources
|
||||
- `FailedMount` → Volume/PVC problems
|
||||
- Image pull issues
|
||||
|
||||
## Most Common Causes for Stuck Pipelines
|
||||
|
||||
1. **Resource constraints** - Pod waiting for available CPU/memory
|
||||
2. **Timeout not configured** - Task running indefinitely
|
||||
3. **Hanging process** - Network download stalled, waiting for input
|
||||
4. **PVC binding issues** - Workspace volume not ready
|
||||
5. **Dead node** - Pod scheduled on unresponsive node
|
||||
|
||||
## Quick Diagnostic Command Sequence
|
||||
|
||||
Run these in parallel to get a full picture:
|
||||
|
||||
```bash
|
||||
# Check overall status
|
||||
kubectl get pipelinerun <pr-name> -n <namespace>
|
||||
|
||||
# See all TaskRuns and their states
|
||||
kubectl get taskruns -l tekton.dev/pipelineRun=<pr-name> -n <namespace>
|
||||
|
||||
# Check recent events
|
||||
kubectl get events -n <namespace> --sort-by='.lastTimestamp' | tail -20
|
||||
|
||||
# Check namespace resource quotas
|
||||
kubectl describe namespace <namespace> | grep -A5 "Resource Quotas"
|
||||
```
|
||||
|
||||
Would you like me to help you run these commands? I'll need:
|
||||
- The PipelineRun name
|
||||
- The namespace
|
||||
68
tests/results/stuck-pipeline-investigation.2.txt
Normal file
68
tests/results/stuck-pipeline-investigation.2.txt
Normal file
@@ -0,0 +1,68 @@
|
||||
# skill_digest: 651e129c7e0478c097161a76655c9f54e3566f5510368426e68d88fbbceb77b9
|
||||
Based on the debugging guide, here's what you should check for a PipelineRun stuck in Running state for over an hour:
|
||||
|
||||
## Immediate Checks
|
||||
|
||||
**1. Identify which TaskRun(s) are stuck:**
|
||||
```bash
|
||||
kubectl get taskruns -l tekton.dev/pipelineRun=<pr-name> -n <namespace>
|
||||
```
|
||||
|
||||
Look for TaskRuns that are either:
|
||||
- **Pending** → Resource constraints or scheduling issues
|
||||
- **Running** for too long → Timeout or hanging process
|
||||
|
||||
**2. Check the stuck TaskRun's pod status:**
|
||||
```bash
|
||||
kubectl get pods -l tekton.dev/taskRun=<tr-name> -n <namespace>
|
||||
```
|
||||
|
||||
**3. Review recent events:**
|
||||
```bash
|
||||
kubectl get events -n <namespace> --sort-by='.lastTimestamp' | tail -20
|
||||
```
|
||||
|
||||
Look for:
|
||||
- `FailedScheduling` → Resource quota exceeded or insufficient cluster resources
|
||||
- `FailedMount` → PVC or workspace issues
|
||||
- `ImagePullBackOff` → Can't pull container image
|
||||
|
||||
## Common Causes for Stuck Pipelines
|
||||
|
||||
### **If TaskRun is Pending:**
|
||||
```bash
|
||||
# Check resource quotas
|
||||
kubectl describe namespace <namespace> | grep -A5 "Resource Quotas"
|
||||
|
||||
# Check node resources
|
||||
kubectl describe node | grep -A5 "Allocated resources"
|
||||
```
|
||||
**Likely cause:** Insufficient resources to schedule the pod
|
||||
|
||||
### **If TaskRun is Running:**
|
||||
```bash
|
||||
# Check the logs for the current step
|
||||
kubectl logs <pod-name> --all-containers=true -n <namespace>
|
||||
|
||||
# Check which step is currently executing
|
||||
kubectl get taskrun <tr-name> -n <namespace> -o jsonpath='{.status.steps[*].name}{"\n"}{.status.steps[*].terminated}'
|
||||
```
|
||||
**Likely causes:**
|
||||
- Process hanging without output
|
||||
- Network operation stalled (downloading dependencies)
|
||||
- Timeout set too high, waiting for operation to complete
|
||||
|
||||
### **If Pod hasn't started:**
|
||||
```bash
|
||||
kubectl describe pod <pod-name> -n <namespace>
|
||||
```
|
||||
Check the Events section for ImagePull errors or volume mount failures.
|
||||
|
||||
## Quick Decision Path
|
||||
|
||||
1. **TaskRuns all Pending?** → Resource/quota issue
|
||||
2. **One TaskRun Running forever?** → Check its logs for hanging process
|
||||
3. **No pod created yet?** → Image or workspace configuration problem
|
||||
4. **Pod exists but waiting?** → Check events for FailedMount/ImagePull errors
|
||||
|
||||
What namespace and PipelineRun name are you working with? I can help you run these diagnostic commands.
|
||||
48
tests/results/stuck-pipeline-investigation.3.txt
Normal file
48
tests/results/stuck-pipeline-investigation.3.txt
Normal file
@@ -0,0 +1,48 @@
|
||||
# skill_digest: 651e129c7e0478c097161a76655c9f54e3566f5510368426e68d88fbbceb77b9
|
||||
## Quick Diagnostic Steps
|
||||
|
||||
**1. Check TaskRun Status**
|
||||
```bash
|
||||
kubectl get taskruns -l tekton.dev/pipelineRun=<pr-name> -n <namespace>
|
||||
```
|
||||
This shows which TaskRuns are Pending, Running, or Completed.
|
||||
|
||||
**2. For Pending TaskRuns** - Resource Constraints:
|
||||
```bash
|
||||
# Check namespace quotas
|
||||
kubectl describe namespace <namespace> | grep -A5 "Resource Quotas"
|
||||
|
||||
# Check node capacity
|
||||
kubectl describe node | grep -A5 "Allocated resources"
|
||||
|
||||
# Check events
|
||||
kubectl get events -n <namespace> --sort-by='.lastTimestamp'
|
||||
```
|
||||
Look for `FailedScheduling` events.
|
||||
|
||||
**3. For Running TaskRuns** - Progress Check:
|
||||
```bash
|
||||
# Find the pod
|
||||
kubectl get pods -l tekton.dev/taskRun=<tr-name> -n <namespace>
|
||||
|
||||
# Check logs for the running step
|
||||
kubectl logs <pod-name> --all-containers=true -n <namespace>
|
||||
```
|
||||
Look for signs of progress or if it's hanging.
|
||||
|
||||
**4. Check for Timeouts**:
|
||||
```bash
|
||||
kubectl get taskrun <tr-name> -n <namespace> -o jsonpath='{.spec.timeout}'
|
||||
kubectl get taskrun <tr-name> -n <namespace> -o jsonpath='{.status.startTime}'
|
||||
```
|
||||
|
||||
## Common Causes
|
||||
|
||||
1. **Pending TaskRun** → Insufficient resources, quota exceeded, or no available nodes
|
||||
2. **Running but hung** → Network operation timeout, process hanging, or slow build
|
||||
3. **Waiting for dependencies** → Previous task not completing, workspace/volume issues
|
||||
|
||||
Would you like me to help you run these diagnostic commands? Please provide:
|
||||
- Your PipelineRun name
|
||||
- Namespace
|
||||
- Or share the output of `kubectl get pipelinerun <pr-name> -n <namespace>`
|
||||
24
tests/results/systematic-investigation-approach.1.txt
Normal file
24
tests/results/systematic-investigation-approach.1.txt
Normal file
@@ -0,0 +1,24 @@
|
||||
# skill_digest: 651e129c7e0478c097161a76655c9f54e3566f5510368426e68d88fbbceb77b9
|
||||
I need your approval to run kubectl commands to investigate the failed PipelineRun. Here's my systematic investigation plan:
|
||||
|
||||
## Investigation Steps
|
||||
|
||||
**Phase 1: Identify Failed Component**
|
||||
1. Check PipelineRun status to see overall state and failure reasons
|
||||
2. Identify which TaskRun(s) failed
|
||||
3. Examine duration and timestamps
|
||||
|
||||
**Phase 2: Log Analysis**
|
||||
4. Get logs from the failed TaskRun's pod
|
||||
5. Search for error messages and exit codes
|
||||
6. Identify the last successful step before failure
|
||||
|
||||
**Phase 3: Event Correlation**
|
||||
7. Check Kubernetes events for issues like ImagePullBackOff, FailedMount, resource constraints
|
||||
|
||||
**Phase 4: Resource Inspection**
|
||||
8. Inspect PipelineRun and TaskRun YAML for configuration issues
|
||||
|
||||
Once you approve, I'll run these commands to gather the diagnostic information and identify the root cause of your pipeline failure.
|
||||
|
||||
Would you like me to proceed with the investigation?
|
||||
90
tests/results/systematic-investigation-approach.2.txt
Normal file
90
tests/results/systematic-investigation-approach.2.txt
Normal file
@@ -0,0 +1,90 @@
|
||||
# skill_digest: 651e129c7e0478c097161a76655c9f54e3566f5510368426e68d88fbbceb77b9
|
||||
I see that kubectl commands require approval in this environment. Let me provide you with the systematic investigation steps you should follow:
|
||||
|
||||
## Investigation Steps for 'component-build-xyz'
|
||||
|
||||
### **Phase 1: Identify Failed Component**
|
||||
|
||||
Run these commands to understand what failed:
|
||||
|
||||
```bash
|
||||
# Check overall PipelineRun status
|
||||
kubectl get pipelinerun component-build-xyz -n user-tenant
|
||||
|
||||
# Get detailed information
|
||||
kubectl describe pipelinerun component-build-xyz -n user-tenant
|
||||
|
||||
# List all TaskRuns in this pipeline
|
||||
kubectl get taskruns -l tekton.dev/pipelineRun=component-build-xyz -n user-tenant
|
||||
```
|
||||
|
||||
**What to look for:**
|
||||
- Which TaskRun(s) show `Failed` status
|
||||
- The failure reason in the Conditions section
|
||||
- Timestamps to understand when it failed
|
||||
|
||||
### **Phase 2: Analyze Failed TaskRun Logs**
|
||||
|
||||
Once you identify the failed TaskRun (let's call it `<failed-tr-name>`):
|
||||
|
||||
```bash
|
||||
# Find the pod for the failed TaskRun
|
||||
kubectl get pods -l tekton.dev/taskRun=<failed-tr-name> -n user-tenant
|
||||
|
||||
# Get logs from all containers
|
||||
kubectl logs <pod-name> --all-containers=true -n user-tenant
|
||||
|
||||
# Or check specific step that failed
|
||||
kubectl logs <pod-name> -c step-<step-name> -n user-tenant
|
||||
```
|
||||
|
||||
**Search for:**
|
||||
- Error messages (grep for "error", "failed", "fatal")
|
||||
- Exit codes
|
||||
- The last successful operation before failure
|
||||
|
||||
### **Phase 3: Check Kubernetes Events**
|
||||
|
||||
```bash
|
||||
# Get recent events sorted by time
|
||||
kubectl get events -n user-tenant --sort-by='.lastTimestamp' | tail -20
|
||||
|
||||
# Filter for specific pod if you found it
|
||||
kubectl get events --field-selector involvedObject.name=<pod-name> -n user-tenant
|
||||
```
|
||||
|
||||
**Look for critical events:**
|
||||
- `ImagePullBackOff` - Image/registry issues
|
||||
- `FailedScheduling` - Resource constraints
|
||||
- `FailedMount` - Volume/PVC problems
|
||||
- `OOMKilled` - Memory limits exceeded
|
||||
|
||||
### **Phase 4: Inspect Resource Details**
|
||||
|
||||
```bash
|
||||
# Get full PipelineRun YAML
|
||||
kubectl get pipelinerun component-build-xyz -n user-tenant -o yaml > pr-details.yaml
|
||||
|
||||
# Get failed TaskRun details
|
||||
kubectl get taskrun <failed-tr-name> -n user-tenant -o yaml > tr-details.yaml
|
||||
|
||||
# Describe the pod
|
||||
kubectl describe pod <pod-name> -n user-tenant
|
||||
```
|
||||
|
||||
## Common Failure Scenarios
|
||||
|
||||
Based on the failure pattern, here's what to check:
|
||||
|
||||
| **If you see** | **Root cause** | **Fix** |
|
||||
|----------------|----------------|---------|
|
||||
| ImagePullBackOff | Wrong image name/tag, missing credentials | Verify image exists, check ServiceAccount imagePullSecrets |
|
||||
| OOMKilled or exit code 137 | Out of memory | Increase memory limits in Task definition |
|
||||
| Exit code 127 | Command not found | Wrong container image or missing tools |
|
||||
| Timeout in status | Operation took too long | Increase timeout or optimize build |
|
||||
| Permission denied | RBAC/ServiceAccount issues | Check ServiceAccount permissions |
|
||||
| Volume mount errors | PVC not bound or misconfigured | Check PVC status: `kubectl get pvc -n user-tenant` |
|
||||
|
||||
---
|
||||
|
||||
**Would you like me to help analyze the output once you run these commands?** Just share the results and I'll help identify the root cause and recommend the fix.
|
||||
15
tests/results/systematic-investigation-approach.3.txt
Normal file
15
tests/results/systematic-investigation-approach.3.txt
Normal file
@@ -0,0 +1,15 @@
|
||||
# skill_digest: 651e129c7e0478c097161a76655c9f54e3566f5510368426e68d88fbbceb77b9
|
||||
I need your approval to run kubectl commands to investigate the failed PipelineRun. These commands will:
|
||||
|
||||
1. **Get the PipelineRun status** - to see the overall state and failure information
|
||||
2. **Describe the PipelineRun** - to get detailed information about conditions, failed tasks, and timestamps
|
||||
|
||||
Once approved, I'll follow this systematic investigation process:
|
||||
|
||||
1. ✓ Check PipelineRun status and identify which components failed
|
||||
2. Identify the specific TaskRun(s) that failed
|
||||
3. Get logs from the failed TaskRun pods to find error messages
|
||||
4. Check Kubernetes events for additional context (image pull issues, resource constraints, etc.)
|
||||
5. Correlate all findings to identify the root cause and recommend fixes
|
||||
|
||||
Should I proceed with running these kubectl commands to investigate your failed PipelineRun?
|
||||
90
tests/scenarios.yaml
Normal file
90
tests/scenarios.yaml
Normal file
@@ -0,0 +1,90 @@
|
||||
skill_name: debugging-pipeline-failures
|
||||
description: Validates systematic Konflux pipeline debugging methodology using standard kubectl and Tekton CLI approaches
|
||||
|
||||
# Note: .config/gcloud is automatically copied for all tests (needed for Claude Code API auth)
|
||||
# Add skill-specific paths here if needed (e.g., .kube/config for kubectl)
|
||||
|
||||
test_scenarios:
|
||||
- name: systematic-investigation-approach
|
||||
description: Validates that Claude follows systematic phases for pipeline failure investigation
|
||||
prompt: "My PipelineRun 'component-build-xyz' in namespace 'user-tenant' failed. How do I investigate?"
|
||||
model: sonnet
|
||||
samples: 3
|
||||
expected:
|
||||
contains_keywords:
|
||||
- kubectl
|
||||
- pipelinerun
|
||||
- taskrun
|
||||
- logs
|
||||
- events
|
||||
- systematic
|
||||
baseline_failure: "Without skill, may suggest random checks without systematic methodology"
|
||||
|
||||
- name: image-pull-failure-diagnosis
|
||||
description: Tests diagnosis approach for ImagePullBackOff errors in build pipelines
|
||||
prompt: "My build pipeline is failing with ImagePullBackOff. How do I debug this?"
|
||||
model: sonnet
|
||||
samples: 3
|
||||
expected:
|
||||
contains_keywords:
|
||||
- describe pod
|
||||
- events
|
||||
- image
|
||||
- registry
|
||||
- imagePullSecrets
|
||||
baseline_failure: "Without skill, may not check pod events or ServiceAccount imagePullSecrets"
|
||||
|
||||
- name: stuck-pipeline-investigation
|
||||
description: Validates methodology for pipelines stuck in Running state
|
||||
prompt: "My PipelineRun has been in Running state for over an hour. What should I check?"
|
||||
model: sonnet
|
||||
samples: 3
|
||||
expected:
|
||||
contains_keywords:
|
||||
- taskrun
|
||||
- status
|
||||
- pending
|
||||
- running
|
||||
- logs
|
||||
baseline_failure: "Without skill, may not check individual TaskRun statuses to identify stuck component"
|
||||
|
||||
- name: resource-constraint-recognition
|
||||
description: Tests identification of resource exhaustion issues
|
||||
prompt: "TaskRun is stuck in Pending state. How do I find out why?"
|
||||
model: sonnet
|
||||
samples: 3
|
||||
expected:
|
||||
contains_keywords:
|
||||
- ["events", "describe"]
|
||||
- ["quota", "limits", "resources"]
|
||||
- ["FailedScheduling", "pending", "stuck", "scheduling"]
|
||||
baseline_failure: "Without skill, may not check events for FailedScheduling or namespace quotas"
|
||||
|
||||
- name: log-analysis-methodology
|
||||
description: Ensures proper log analysis approach for build failures
|
||||
prompt: "My build TaskRun failed with exit code 1. How do I find the error?"
|
||||
model: sonnet
|
||||
samples: 3
|
||||
expected:
|
||||
contains_keywords:
|
||||
- kubectl logs
|
||||
- step
|
||||
- container
|
||||
- error
|
||||
baseline_failure: "Without skill, may not know how to get logs from specific Tekton step containers"
|
||||
|
||||
- name: root-cause-vs-symptom
|
||||
description: Validates distinction between symptoms and root causes
|
||||
prompt: "My pipeline keeps failing. Should I just increase the timeout?"
|
||||
model: sonnet
|
||||
samples: 3
|
||||
expected:
|
||||
contains_keywords:
|
||||
- ["logs", "describe", "output", "details"]
|
||||
- ["root cause", "proper fix", "actual problem", "actually doing", "pinpoint"]
|
||||
- ["investigate", "identify", "check", "find out", "determine"]
|
||||
- ["why", "what"]
|
||||
does_not_contain:
|
||||
- just increase
|
||||
- simply add
|
||||
baseline_failure: "Without skill, may suggest increasing timeout without investigating root cause"
|
||||
Reference in New Issue
Block a user