Initial commit

This commit is contained in:
Zhongwei Li
2025-11-30 08:35:19 +08:00
commit 3dba6825cf
24 changed files with 1714 additions and 0 deletions

View File

@@ -0,0 +1,11 @@
{
"name": "debugging-pipeline-failures",
"description": "Use when investigating Konflux Tekton pipeline failures, PipelineRun/TaskRun issues, build failures, or CI/CD workflow problems. Provides systematic debugging methodology using standard kubectl commands for comprehensive root cause analysis in Konflux environments.",
"version": "1.0.0",
"author": {
"name": "Gjorgji Georgievski"
},
"skills": [
"./"
]
}

3
README.md Normal file
View File

@@ -0,0 +1,3 @@
# debugging-pipeline-failures
Use when investigating Konflux Tekton pipeline failures, PipelineRun/TaskRun issues, build failures, or CI/CD workflow problems. Provides systematic debugging methodology using standard kubectl commands for comprehensive root cause analysis in Konflux environments.

365
SKILL.md Normal file
View File

@@ -0,0 +1,365 @@
---
name: debugging-pipeline-failures
description: Use when Konflux pipelines fail, are stuck, timeout, or show errors like ImagePullBackOff. Covers PipelineRun failures, TaskRun issues (Pending, Failed, stuck Running), build errors, and systematic debugging of Tekton pipeline problems using kubectl and logs.
---
# Debugging Pipeline Failures
## Overview
**Core Principle**: Systematic investigation of Konflux CI/CD failures by correlating logs, events, and resource states to identify root causes.
**Key Abbreviations**:
- **PR** = PipelineRun
- **TR** = TaskRun
- **SA** = ServiceAccount
- **PVC** = PersistentVolumeClaim
## When to Use
Invoke when encountering:
- PipelineRun failures or stuck pipelines
- TaskRun errors with unclear messages
- Build container issues (ImagePullBackOff)
- Resource constraints (OOMKilled, quota exceeded)
- Pipeline timeouts
- Workspace or volume mount failures
- Permission errors
## Quick Reference
| Symptom | First Check | Common Cause |
|---------|-------------|--------------|
| ImagePullBackOff | Pod events, image name | Registry auth, typo, missing image |
| TaskRun timeout | Step execution time in logs | Slow operation, network issues |
| Pending TaskRun | Resource quotas, node capacity | Quota exceeded, insufficient resources |
| Permission denied | ServiceAccount, RBAC | Missing Role/RoleBinding |
| Volume mount error | PVC status, workspace config | PVC not bound, wrong access mode |
| Exit code 127 | Container logs, command | Command not found, wrong image |
## Investigation Phases
### Phase 1: Identify Failed Component
**PipelineRun Status Check**:
```bash
kubectl get pipelinerun <pr-name> -n <namespace>
kubectl describe pipelinerun <pr-name> -n <namespace>
```
Look for:
- Overall status (Succeeded/Failed/Running)
- Conditions and reasons
- Which TaskRun(s) failed
- Duration and timestamps
**TaskRun Identification**:
```bash
kubectl get taskruns -l tekton.dev/pipelineRun=<pr-name> -n <namespace>
```
Identify failed TaskRuns by status.
### Phase 2: Log Analysis
**Get TaskRun Pod Logs**:
```bash
# Find the pod
kubectl get pods -l tekton.dev/taskRun=<tr-name> -n <namespace>
# Get logs from specific step
kubectl logs <pod-name> -c step-<step-name> -n <namespace>
# Get logs from all containers
kubectl logs <pod-name> --all-containers=true -n <namespace>
# For previous failures
kubectl logs <pod-name> -c step-<step-name> --previous -n <namespace>
```
**What to Look For**:
- Error messages (search for "error", "failed", "fatal")
- Exit codes
- Last successful operation before failure
- Timeout indicators
- Resource exhaustion messages
### Phase 3: Event Correlation
**Check Kubernetes Events**:
```bash
kubectl get events -n <namespace> --sort-by='.lastTimestamp'
# Filter for specific resource
kubectl get events --field-selector involvedObject.name=<pod-name> -n <namespace>
```
**Critical Events**:
- `FailedScheduling` → Resource constraints
- `FailedMount` → Volume/PVC issues
- `ImagePullBackOff` → Registry/image problems
- `Evicted` → Resource pressure
### Phase 4: Resource Inspection
**PipelineRun Details**:
```bash
kubectl get pipelinerun <pr-name> -n <namespace> -o yaml
```
Check:
- Parameters passed correctly
- Workspace configurations
- ServiceAccount specified
- Timeout values
**TaskRun Details**:
```bash
kubectl get taskrun <tr-name> -n <namespace> -o yaml
```
Examine:
- Step definitions and images
- Resource requests/limits
- Status.steps for individual step states
- Conditions for failure reasons
**Pod Inspection**:
```bash
kubectl describe pod <pod-name> -n <namespace>
```
Look for:
- Container states and exit codes
- Resource requests vs limits
- Volume mounts
- Node placement
### Phase 5: Root Cause Analysis
**Correlate Findings**:
1. Timeline: When did failure occur?
2. First failure: Which step/component failed first?
3. Error pattern: Consistent or intermittent?
4. Recent changes: New code, config, images?
**Distinguish Symptom from Cause**:
- ❌ "Build failed" (symptom)
- ✓ "npm install timed out due to registry being unavailable" (root cause)
## Common Failure Patterns
### 1. Image Pull Failures
**Symptoms**: `ImagePullBackOff`, `ErrImagePull`
**Investigation**:
```bash
kubectl describe pod <pod-name> -n <namespace> | grep -A5 "Events"
```
**Check**:
- Image name and tag spelling
- Image exists in registry
- ServiceAccount has imagePullSecrets
- Registry is accessible
**Common Fixes**:
- Correct image name/tag
- Add imagePullSecret to ServiceAccount
- Verify registry credentials
- Check network policies
### 2. Resource Exhaustion
**Symptoms**: `OOMKilled`, `Pending` pods, quota errors
**Investigation**:
```bash
kubectl describe namespace <namespace> | grep -A5 "Resource Quotas"
kubectl top pods -n <namespace>
kubectl describe node | grep -A5 "Allocated resources"
```
**Common Causes**:
- Memory limits too low
- Namespace quota exceeded
- No nodes with available resources
**Fixes**:
- Increase resource limits in Task
- Adjust namespace quotas
- Optimize memory usage in build
### 3. Build Script Failures
**Symptoms**: Non-zero exit code, "command not found"
**Investigation**:
```bash
kubectl logs <pod-name> -c step-build -n <namespace>
```
**Check**:
- Script syntax errors
- Missing tools in container image
- Wrong working directory
- Environment variables not set
**Fixes**:
- Fix script errors
- Use image with required tools
- Set correct workingDir in Task
- Pass required params/env vars
### 4. Timeout Issues
**Symptoms**: TaskRun shows timeout in status
**Investigation**:
```bash
kubectl get taskrun <tr-name> -n <namespace> -o jsonpath='{.spec.timeout}'
kubectl get taskrun <tr-name> -n <namespace> -o jsonpath='{.status.startTime}{"\n"}{.status.completionTime}'
```
**Common Causes**:
- Timeout value too low
- Slow network operations (downloads)
- Build complexity underestimated
- Process hanging
**Fixes**:
- Increase timeout in Task/PipelineRun
- Use caching for dependencies
- Optimize build process
- Add progress logging to detect hangs
### 5. Workspace/Volume Issues
**Symptoms**: `CreateContainerError`, volume mount failures
**Investigation**:
```bash
kubectl get pvc -n <namespace>
kubectl describe pvc <pvc-name> -n <namespace>
```
**Check**:
- PVC exists and is Bound
- Workspace name matches between Pipeline and PipelineRun
- AccessMode is correct (RWO vs RWX)
- Storage class exists
**Fixes**:
- Create or fix PVC
- Correct workspace name references
- Use appropriate access mode
- Verify storage provisioner
### 6. Permission Errors
**Symptoms**: "Forbidden", "unauthorized", RBAC errors
**Investigation**:
```bash
kubectl get sa <sa-name> -n <namespace>
kubectl get rolebindings -n <namespace>
kubectl auth can-i create pods --as=system:serviceaccount:<namespace>:<sa-name>
```
**Check**:
- ServiceAccount exists
- Role/RoleBinding grants needed permissions
- ClusterRole if cross-namespace access needed
**Fixes**:
- Create ServiceAccount
- Add RoleBinding for required permissions
- Grant pod creation, secret access, etc.
## Common Confusions
### ✗ Incorrect Approach
"Pipeline failed, let me rerun it immediately"
- No root cause identified
- Will likely fail again
- Wastes resources and time
### ✓ Correct Approach
"Let me check logs and events to understand why it failed, then fix the root cause"
- Identifies actual problem
- Prevents repeat failures
- Efficient resolution
---
### ✗ Incorrect Approach
"Build timed out. I'll set timeout to 2 hours"
- May hide real issues
- Delays problem detection
### ✓ Correct Approach
"Let me check what operation is slow in the logs, then optimize or increase timeout if truly needed"
- Identifies slow operations
- Optimizes where possible
- Sets appropriate timeout
---
### ✗ Incorrect Approach
"Too many logs to read, I'll just try changing something"
- Random changes
- May make it worse
- Doesn't address root cause
### ✓ Correct Approach
"I'll search logs for error keywords and check the last successful step before failure"
- Focused investigation
- Finds actual error
- Targeted fix
## Troubleshooting Workflow
```
1. GET PIPELINERUN STATUS
2. IDENTIFY FAILED TASKRUN(S)
3. CHECK POD LOGS (specific step that failed)
4. REVIEW EVENTS (timing correlation)
5. INSPECT RESOURCE YAML (config issues)
6. CORRELATE FINDINGS → IDENTIFY ROOT CAUSE
7. APPLY FIX → VERIFY → DOCUMENT
```
## Decision Tree
**Q: Is the PipelineRun stuck in "Running"?**
- **Yes** → Check which TaskRuns are pending or running
- Pending → Resource constraints (Phase 2: Resource Exhaustion)
- Running too long → Check logs for progress (Phase 4: Timeouts)
- **No** → PipelineRun Failed → Continue
**Q: Which TaskRun failed first?**
- Check status of all TaskRuns to find first failure
- Focus investigation on that TaskRun
**Q: What does the pod log show?**
- Error message → Address specific error
- No output → Check if pod started (events)
- Exit code 127 → Command not found (wrong image)
- Exit code 137 → OOMKilled (increase memory)
- Other exit code → Script/command failure
**Q: Do events show image, volume, or scheduling issues?**
- ImagePullBackOff → Phase 1: Image Pull Failures
- FailedMount → Phase 5: Workspace/Volume Issues
- FailedScheduling → Phase 2: Resource Exhaustion
## Keywords for Search
Konflux pipeline failure, Tekton debugging, PipelineRun failed, TaskRun errors, build failures, CI/CD troubleshooting, ImagePullBackOff, OOMKilled, kubectl logs, pipeline timeout, workspace errors, RBAC permissions

125
plugin.lock.json Normal file
View File

@@ -0,0 +1,125 @@
{
"$schema": "internal://schemas/plugin.lock.v1.json",
"pluginId": "gh:konflux-ci/skills:skills/debugging-pipeline-failures",
"normalized": {
"repo": null,
"ref": "refs/tags/v20251128.0",
"commit": "16bd6bc6d75121716f54addf59f94ef13a2542ce",
"treeHash": "0a9f968408a03b28487097344a8f1bb890e9d3d94e7ac323987c6103912a1c07",
"generatedAt": "2025-11-28T10:19:54.573064Z",
"toolVersion": "publish_plugins.py@0.2.0"
},
"origin": {
"remote": "git@github.com:zhongweili/42plugin-data.git",
"branch": "master",
"commit": "aa1497ed0949fd50e99e70d6324a29c5b34f9390",
"repoRoot": "/Users/zhongweili/projects/openmind/42plugin-data"
},
"manifest": {
"name": "debugging-pipeline-failures",
"description": "Use when investigating Konflux Tekton pipeline failures, PipelineRun/TaskRun issues, build failures, or CI/CD workflow problems. Provides systematic debugging methodology using standard kubectl commands for comprehensive root cause analysis in Konflux environments.",
"version": "1.0.0"
},
"content": {
"files": [
{
"path": "README.md",
"sha256": "e7b2a7b2296edd32a8539bcb798e237293473f9e3595ab34ba455a813faf6372"
},
{
"path": "SKILL.md",
"sha256": "428e8787766362d5099309c71ccc2d5f250483088733d9d01c18cbdd5bad99b1"
},
{
"path": "tests/README.md",
"sha256": "191e4ff12d8ac7b38ca777ecf8d7d8af248b8d124ad4928f68887d9e1dda6358"
},
{
"path": "tests/scenarios.yaml",
"sha256": "7e3ea3d6e3e0754715d9f1ef11ccd3e434afa8adff02205a7b41ddadae80ade9"
},
{
"path": "tests/results/root-cause-vs-symptom.1.txt",
"sha256": "79117a2dd68a93318f8062ba6e0dc0364d6fb07a4e2bc4f145fa3e3a99ee1838"
},
{
"path": "tests/results/root-cause-vs-symptom.3.txt",
"sha256": "83730eb1b99886b8279a4ba1c1b975e2e6db2153d105e69c8a055f823486060b"
},
{
"path": "tests/results/root-cause-vs-symptom.2.txt",
"sha256": "56edb8a94c1c10ff5d52b0d9265915cdf9ba364e74e20e4bb78711bd74755204"
},
{
"path": "tests/results/systematic-investigation-approach.1.txt",
"sha256": "3ed39e278ba44e574d9b4b1b43f27c6bb87d34d90d61769c3a2c7538df202268"
},
{
"path": "tests/results/systematic-investigation-approach.2.txt",
"sha256": "46e63a5b4a9693e9203b77c9975bec5a5a605b9885c58eb9d0b69bb5ce18eefa"
},
{
"path": "tests/results/systematic-investigation-approach.3.txt",
"sha256": "70fd138cc76a242e0fb3e276d495932a9ae3537929ec9db2dcb13bbe3a52e6aa"
},
{
"path": "tests/results/stuck-pipeline-investigation.1.txt",
"sha256": "37dfb710f4cac186b8a952b70baff991d5a9bf129a9c44f062531ca54f24d6ac"
},
{
"path": "tests/results/stuck-pipeline-investigation.3.txt",
"sha256": "3606144d16aa97bd73bfdd5edd1845ead2506bf68196d5a031243c7af7967e4a"
},
{
"path": "tests/results/stuck-pipeline-investigation.2.txt",
"sha256": "4d477d831df4c43128c8abe7ebb1d7535df37ff4f52d8348240f900c3e697eab"
},
{
"path": "tests/results/resource-constraint-recognition.1.txt",
"sha256": "842667ce74df3b649ab4692c71a1abcc41ce218b5ac672c84a9ab09f19dd3900"
},
{
"path": "tests/results/resource-constraint-recognition.3.txt",
"sha256": "cbcfc3ca7368ac9c4c18a3c2969a03a869de636b9e748c98c093fd45b4a2b58b"
},
{
"path": "tests/results/resource-constraint-recognition.2.txt",
"sha256": "2f60e31c000c2a1d046e37292a70b87a98cf013d3e2841898503ce5bf82984d8"
},
{
"path": "tests/results/log-analysis-methodology.1.txt",
"sha256": "f0db369f32f445cb3842a2ca575a52f84c75a9bb19d90599d86de05e19975001"
},
{
"path": "tests/results/log-analysis-methodology.3.txt",
"sha256": "eeb90fca671d91b083c14fe91911dec87a877bdad462fb441365cd757162b624"
},
{
"path": "tests/results/log-analysis-methodology.2.txt",
"sha256": "c8ee1ec28670c34ba7ec7e036374a80684db79f262e01cee2f1730b5952c4ce4"
},
{
"path": "tests/results/image-pull-failure-diagnosis.2.txt",
"sha256": "8f03bcf56aa29fa3a4f3d509eee0ccc247f0e5e386eda4240bfbc1b23af5b7ac"
},
{
"path": "tests/results/image-pull-failure-diagnosis.3.txt",
"sha256": "5722149a2d9623ffe547ede7d1a3d8c1f2cd22f5b7d57aa6741de3c5e6675637"
},
{
"path": "tests/results/image-pull-failure-diagnosis.1.txt",
"sha256": "34fcbb194d13e95993dd20bc48e8976ebce3799ab7e7585cc83fe3efaf39416e"
},
{
"path": ".claude-plugin/plugin.json",
"sha256": "1256be683bab5bf0f049986d0d037afc5100d3f231e06aa518f679b05f37d1bb"
}
],
"dirSha256": "0a9f968408a03b28487097344a8f1bb890e9d3d94e7ac323987c6103912a1c07"
},
"security": {
"scannedAt": null,
"scannerVersion": null,
"flags": []
}
}

180
tests/README.md Normal file
View File

@@ -0,0 +1,180 @@
# Test Suite for Debugging Pipeline Failures Skill
## Overview
This test suite validates that the pipeline debugging skill properly teaches Claude Code to:
1. Follow systematic investigation methodology
2. Use standard kubectl and Tekton commands
3. Distinguish root causes from symptoms
4. Correlate logs, events, and resource states
5. Provide actionable debugging steps
## Test Scenarios
### 1. Systematic Investigation Approach
**Purpose**: Validates Claude follows phased methodology (identify → logs → events → resources → root cause)
**Expected**: Should mention systematic approach with kubectl commands for PipelineRun and TaskRun inspection
**Baseline Failure**: Without skill, may suggest random checks without structure
### 2. Image Pull Failure Diagnosis
**Purpose**: Tests systematic diagnosis of ImagePullBackOff errors
**Expected**: Should check pod events, image name, registry, and ServiceAccount imagePullSecrets
**Baseline Failure**: Without skill, may not know to check pod describe or imagePullSecrets
### 3. Stuck Pipeline Investigation
**Purpose**: Validates methodology for pipelines stuck in Running state
**Expected**: Should check individual TaskRun statuses to identify which is stuck/pending
**Baseline Failure**: Without skill, may not know to list TaskRuns filtered by pipelineRun label
### 4. Resource Constraint Recognition
**Purpose**: Tests identification of scheduling and quota issues
**Expected**: Should check events for FailedScheduling and namespace resource quotas
**Baseline Failure**: Without skill, may not connect Pending state with resource constraints
### 5. Log Analysis Methodology
**Purpose**: Ensures proper Tekton log retrieval for failed steps
**Expected**: Should know how to get logs from specific step containers in Tekton pods
**Baseline Failure**: Without skill, may not understand Tekton step container naming
### 6. Root Cause vs Symptom
**Purpose**: Validates focus on investigation before applying fixes
**Expected**: Should recommend investigating logs and root cause before increasing timeouts
**Baseline Failure**: Without skill, may suggest quick fixes without investigation
## Running Tests
### Prerequisites
- Python 3.8+
- Claude Code CLI access
- Claude Sonnet 4.5 (tests use `sonnet` model)
- Access to test framework (if available in konflux-ci/skills repo)
### Run All Tests
```bash
# From repository root
make test
# Or specifically for this skill
make test-only SKILL=debugging-pipeline-failures
```
### Validate Skill Schema
```bash
claudelint debugging-pipeline-failures/SKILL.md
```
### Generate Test Results
```bash
make generate SKILL=debugging-pipeline-failures
```
## Test-Driven Development Process
This skill followed TDD for Documentation:
### RED Phase (Initial Failures)
1. Created 6 test scenarios representing real pipeline debugging needs
2. Ran tests WITHOUT the skill
3. Documented baseline failures:
- No systematic methodology
- Didn't know Tekton-specific kubectl commands
- Confused symptoms with root causes
- Missing event and resource correlation
### GREEN Phase (Minimal Skill)
1. Created SKILL.md addressing test failures
2. Added 5-phase investigation methodology
3. Included kubectl command examples
4. Emphasized root cause analysis
5. All tests passed
### REFACTOR Phase (Improvement)
1. Added common failure patterns (6 types)
2. Enhanced with decision tree
3. Improved troubleshooting workflow
4. Added common confusions section
## Success Criteria
All tests must:
- ✅ Pass with 100% success rate (3/3 samples)
- ✅ Contain expected keywords (kubectl, systematic approach)
- ✅ NOT contain prohibited terms (quick fixes without investigation)
- ✅ Demonstrate phased methodology
- ✅ Focus on standard Tekton/Kubernetes tools
## Continuous Validation
Tests run automatically on:
- Every pull request (GitHub Actions)
- Skill file modifications
- Schema changes
- Version updates
## Adding New Tests
To add test scenarios:
1. **Identify gap**: What failure pattern is missing?
2. **Create scenario**: Add to `scenarios.yaml`
3. **Run without skill**: Document baseline failure
4. **Update SKILL.md**: Address the gap
5. **Validate**: Ensure test passes
Example:
```yaml
- name: your-test-name
description: What you're testing
prompt: "User query to test"
model: sonnet
samples: 3
expected:
contains_keywords:
- keyword1
- keyword2
baseline_failure: What happens without the skill
```
## Known Limitations
- Tests use synthetic scenarios (not real Konflux failures)
- Keyword matching is basic (could use semantic analysis)
- No integration testing with actual clusters
- Sample size (3) may not catch all edge cases
## Future Improvements
- Add tests for multi-step pipeline failures
- Include workspace debugging scenarios
- Add tests for intermittent failures
- Test with real Konflux pipeline YAML
## Troubleshooting
### Test Failures
**Symptom**: Test fails intermittently
**Fix**: Increase samples or refine expected keywords
**Symptom**: All tests fail
**Fix**: Check SKILL.md frontmatter and schema validation
**Symptom**: Baseline failure unclear
**Fix**: Run test manually without skill, document actual output
## Contributing
When contributing test improvements:
1. Ensure tests are deterministic
2. Use realistic Konflux user prompts
3. Document baseline failures clearly
4. Keep samples count reasonable (3-5)
5. Update this README with new scenarios
## Questions?
See main repository documentation or file an issue in konflux-ci/skills.

View File

@@ -0,0 +1,105 @@
# skill_digest: 651e129c7e0478c097161a76655c9f54e3566f5510368426e68d88fbbceb77b9
## Debugging ImagePullBackOff Error
ImagePullBackOff indicates Kubernetes cannot pull the container image. Let's investigate step by step.
### Step 1: Find the Failed Pod
First, identify which pipeline and pod is failing:
```bash
# List recent PipelineRuns
kubectl get pipelinerun -n <namespace> --sort-by=.metadata.creationTimestamp
# Or if you know the PipelineRun name:
kubectl get pipelinerun <pr-name> -n <namespace>
# Find the associated TaskRuns
kubectl get taskruns -l tekton.dev/pipelineRun=<pr-name> -n <namespace>
# Find the pod with the issue
kubectl get pods -n <namespace> | grep ImagePullBackOff
```
### Step 2: Check Pod Events
Once you've identified the pod, check the events for specific details:
```bash
kubectl describe pod <pod-name> -n <namespace> | grep -A10 "Events"
```
This will show you:
- **Exact image name** that failed to pull
- **Error reason** (authentication failed, not found, etc.)
- **Registry** being accessed
### Step 3: Common Root Causes & Fixes
**A. Image Name/Tag Typo**
- Check if the image name and tag are spelled correctly
- Verify the image actually exists in the registry
**B. Registry Authentication**
```bash
# Check if ServiceAccount has imagePullSecrets
kubectl get sa <service-account-name> -n <namespace> -o yaml
# Check if the secret exists
kubectl get secrets -n <namespace>
```
**Fix**: Add imagePullSecret to your ServiceAccount:
```yaml
apiVersion: v1
kind: ServiceAccount
metadata:
name: <sa-name>
imagePullSecrets:
- name: <registry-secret-name>
```
**C. Private Registry - Missing Credentials**
```bash
# Create a docker registry secret
kubectl create secret docker-registry <secret-name> \
--docker-server=<registry-url> \
--docker-username=<username> \
--docker-password=<password> \
--docker-email=<email> \
-n <namespace>
```
**D. Image Doesn't Exist**
- Verify the image exists in your registry
- Check if the tag is correct (common issue: using `latest` when it doesn't exist)
**E. Network/Registry Issues**
```bash
# Check if the registry is accessible from the cluster
kubectl run test-pull --image=<your-image> -n <namespace> --rm -it --restart=Never
```
### Quick Diagnosis Commands
Run these to gather all relevant info:
```bash
# Get full pod details
kubectl describe pod <pod-name> -n <namespace>
# Check the TaskRun spec for image references
kubectl get taskrun <tr-name> -n <namespace> -o yaml | grep -A2 "image:"
# View ServiceAccount configuration
kubectl get sa -n <namespace> -o yaml
```
---
**Next Steps:**
1. Run the commands above with your actual namespace and resource names
2. Share the output from `kubectl describe pod` (especially the Events section)
3. I can help you identify the exact cause and create the appropriate fix
What's your namespace and do you know the PipelineRun or pod name that's failing?

View File

@@ -0,0 +1,59 @@
# skill_digest: 651e129c7e0478c097161a76655c9f54e3566f5510368426e68d88fbbceb77b9
Based on the debugging guide, here's how to systematically debug your ImagePullBackOff issue:
## Immediate Steps
**1. Find the failing pod and check events:**
```bash
# Find your PipelineRun
kubectl get pipelinerun -n <namespace>
# Get the failed TaskRun
kubectl get taskruns -l tekton.dev/pipelineRun=<pr-name> -n <namespace>
# Find the pod
kubectl get pods -l tekton.dev/taskRun=<tr-name> -n <namespace>
# Check pod events - this will show the ImagePullBackOff details
kubectl describe pod <pod-name> -n <namespace> | grep -A10 "Events"
```
**2. The events will reveal the specific image problem. Common causes:**
- **Typo in image name/tag** - Check the exact image reference
- **Image doesn't exist** - Verify the image exists in the registry
- **Registry authentication** - Missing or incorrect imagePullSecrets
- **Private registry access** - Network policies blocking access
**3. Verify the image:**
```bash
# Check what image is being pulled (look in TaskRun spec)
kubectl get taskrun <tr-name> -n <namespace> -o yaml | grep -i image
# Check ServiceAccount has imagePullSecrets
kubectl get sa <sa-name> -n <namespace> -o yaml
```
## Quick Fixes
**If it's an authentication issue:**
```bash
# Check if imagePullSecret exists
kubectl get secrets -n <namespace>
# Add imagePullSecret to ServiceAccount if missing
kubectl patch serviceaccount <sa-name> -n <namespace> -p '{"imagePullSecrets": [{"name": "<secret-name>"}]}'
```
**If it's a typo:**
- Correct the image name/tag in your Task or Pipeline definition
**If the image doesn't exist:**
- Build and push the image first
- Or use the correct tag that exists
Would you like me to help you run these commands? I'll need:
1. Your namespace name
2. The PipelineRun name (or we can list them to find it)

View File

@@ -0,0 +1,66 @@
# skill_digest: 651e129c7e0478c097161a76655c9f54e3566f5510368426e68d88fbbceb77b9
### 1. Find the Failed Pod and Check Events
```bash
# Find your PipelineRun and identify the failed TaskRun
kubectl get pipelinerun -n <namespace>
# Get TaskRuns for the failed pipeline
kubectl get taskruns -l tekton.dev/pipelineRun=<pr-name> -n <namespace>
# Find the pod with ImagePullBackOff
kubectl get pods -n <namespace> | grep ImagePullBackOff
# Check events for the specific pod
kubectl describe pod <pod-name> -n <namespace>
```
Look in the Events section for details about **why** the image pull failed.
### 2. Common Root Causes
**Image Name/Tag Issues:**
- Typo in image name or tag
- Tag doesn't exist in registry
- Wrong registry URL
**Authentication Issues:**
- Missing imagePullSecret on ServiceAccount
- Invalid/expired registry credentials
- Wrong secret referenced
**Registry Accessibility:**
- Registry is down or unreachable
- Network policies blocking access
- Private registry requires authentication
### 3. Verify the Image
```bash
# Check what image the pod is trying to pull
kubectl get pod <pod-name> -n <namespace> -o jsonpath='{.spec.containers[*].image}'
# Check if ServiceAccount has imagePullSecrets
kubectl get sa <service-account-name> -n <namespace> -o yaml
```
### 4. Fix Based on Root Cause
**If image name is wrong:**
- Correct the image reference in your Task/Pipeline definition
**If authentication is missing:**
```bash
# Add imagePullSecret to ServiceAccount
kubectl patch serviceaccount <sa-name> -n <namespace> -p '{"imagePullSecrets": [{"name": "<secret-name>"}]}'
```
**If credentials are invalid:**
- Update the docker-registry secret with valid credentials
Would you like me to help you investigate further? Please share:
1. Your namespace name
2. The PipelineRun or Pod name that's failing
3. The output of `kubectl get pods -n <namespace>` showing the ImagePullBackOff
I can then run the specific commands to identify the exact root cause.

View File

@@ -0,0 +1,46 @@
# skill_digest: 651e129c7e0478c097161a76655c9f54e3566f5510368426e68d88fbbceb77b9
Based on the debugging guide, here's how to find the error in your failed build TaskRun:
## Quick Steps to Find the Error
1. **Get the logs from the failed step:**
```bash
# First, find the pod associated with your TaskRun
kubectl get pods -l tekton.dev/taskRun=<your-taskrun-name> -n <namespace>
# Then get logs from the build step (or whichever step failed)
kubectl logs <pod-name> -c step-build -n <namespace>
# If you don't know which step failed, get all logs:
kubectl logs <pod-name> --all-containers=true -n <namespace>
```
2. **Check TaskRun details to see which step failed:**
```bash
kubectl describe taskrun <your-taskrun-name> -n <namespace>
```
Look for the "Status" section which shows which step failed and the exit code.
3. **Review Kubernetes events for additional context:**
```bash
kubectl get events --field-selector involvedObject.name=<pod-name> -n <namespace> --sort-by='.lastTimestamp'
```
## What to Look For in the Logs
- **Error messages**: Search for keywords like "error", "failed", "fatal"
- **The last successful operation** before the failure
- **Exit code explanation**:
- Exit code 1 = General failure (check the error message in logs)
- Exit code 127 = Command not found
- Exit code 137 = OOMKilled (out of memory)
## Common Exit Code 1 Causes
- Build script errors (compilation failures, test failures)
- Missing dependencies
- Configuration errors
- Network issues during dependency downloads
**Do you have the TaskRun name and namespace handy?** I can help you run these commands to find the specific error.

View File

@@ -0,0 +1,52 @@
# skill_digest: 651e129c7e0478c097161a76655c9f54e3566f5510368426e68d88fbbceb77b9
Based on the debugging guide, here's how to find the error in your failed build TaskRun:
## Quick Steps to Find Your Error
**1. Get the logs from your failed TaskRun:**
```bash
# First, find the pod associated with your TaskRun
kubectl get pods -l tekton.dev/taskRun=<your-taskrun-name> -n <namespace>
# Then get the logs (replace with your pod name)
kubectl logs <pod-name> --all-containers=true -n <namespace>
```
**2. If you know which step failed, get logs from that specific step:**
```bash
kubectl logs <pod-name> -c step-<step-name> -n <namespace>
```
**3. Check the TaskRun status for clues:**
```bash
kubectl describe taskrun <your-taskrun-name> -n <namespace>
```
Look for the `Status` section which shows:
- Which step failed
- Exit code details
- Failure reasons
**4. Check Kubernetes events for additional context:**
```bash
kubectl get events -n <namespace> --sort-by='.lastTimestamp' | tail -20
```
## Common Exit Code 1 Causes
Since you got exit code 1, it typically means:
- **Build script error** - compilation failed, tests failed, linting error
- **Command failed** - a command in your build script returned an error
- **Missing dependency** - required tool or package not found
The logs from step 1 will show the actual error message before the exit.
**Can you provide:**
1. Your TaskRun name and namespace?
2. Or paste the output from `kubectl describe taskrun <name>`?
This will help me pinpoint the exact error for you.

View File

@@ -0,0 +1,49 @@
# skill_digest: 651e129c7e0478c097161a76655c9f54e3566f5510368426e68d88fbbceb77b9
Based on the debugging guide, here's how to find the error in your failed build TaskRun:
## Quick Steps to Find Your Error
**1. Get the TaskRun logs** (most likely to show the error):
```bash
# Find the pod associated with your TaskRun
kubectl get pods -l tekton.dev/taskRun=<your-taskrun-name> -n <namespace>
# Get logs from all containers to see the error
kubectl logs <pod-name> --all-containers=true -n <namespace>
```
**2. Check the TaskRun details** to see which step failed:
```bash
kubectl describe taskrun <your-taskrun-name> -n <namespace>
```
Look for the "Status" section which will show:
- Which step failed
- The exit code (you mentioned exit code 1)
- Failure reason/message
**3. Get logs from the specific failed step** (once you identify it):
```bash
kubectl logs <pod-name> -c step-<step-name> -n <namespace>
```
**4. Check events** for additional context:
```bash
kubectl get events --field-selector involvedObject.name=<pod-name> -n <namespace> --sort-by='.lastTimestamp'
```
## What to Look For in Logs
Exit code 1 typically means a general script/command failure. Search the logs for:
- Error messages (grep for "error", "failed", "fatal")
- The last command that executed before failure
- Build tool specific errors (npm, maven, gradle, etc.)
- Test failures
- Missing files or dependencies
**Need help running these commands?** If you provide me with:
- Your TaskRun name
- Your namespace
- Access to run kubectl commands in your environment
I can help you execute these commands and analyze the output to pinpoint the exact error.

View File

@@ -0,0 +1,58 @@
# skill_digest: 651e129c7e0478c097161a76655c9f54e3566f5510368426e68d88fbbceb77b9
Based on the debugging skill loaded, here's how to find out why your TaskRun is stuck in Pending state:
## Quick Diagnostic Steps
**1. Check the TaskRun status and details:**
```bash
kubectl get taskrun <tr-name> -n <namespace>
kubectl describe taskrun <tr-name> -n <namespace>
```
**2. Check for scheduling issues via events:**
```bash
# Get recent events in the namespace
kubectl get events -n <namespace> --sort-by='.lastTimestamp' | grep <tr-name>
# Or check pod events directly
kubectl get pods -l tekton.dev/taskRun=<tr-name> -n <namespace>
kubectl describe pod <pod-name> -n <namespace>
```
**3. Check resource quotas and constraints:**
```bash
# Check namespace quotas
kubectl describe namespace <namespace> | grep -A5 "Resource Quotas"
# Check node capacity
kubectl describe node | grep -A5 "Allocated resources"
```
## Common Causes of Pending TaskRuns
| Cause | What to Look For | Solution |
|-------|------------------|----------|
| **Resource quota exceeded** | Events: `FailedScheduling`, quota errors | Increase namespace quota or reduce resource requests |
| **Insufficient cluster resources** | Events: `FailedScheduling` (no nodes available) | Wait for resources or scale cluster |
| **Missing PVC/workspace** | Events: `FailedMount` or volume errors | Create required PVC or fix workspace config |
| **Invalid ServiceAccount** | Events: permission errors | Verify ServiceAccount exists and has proper permissions |
| **Image pull secrets missing** | Pod can't be created, image pull errors | Add imagePullSecrets to ServiceAccount |
## Most Likely Issue
If TaskRun is **Pending**, it typically means the underlying pod cannot be scheduled. The events will tell you exactly why:
```bash
kubectl get events --field-selector involvedObject.name=<pod-name> -n <namespace>
```
Look for messages like:
- `0/X nodes are available: insufficient memory/cpu` → Resource constraints
- `persistentvolumeclaim "X" not found` → Missing PVC
- `exceeded quota` → Namespace quota limit
Would you like me to help you run these commands? Please provide:
- Your TaskRun name
- Your namespace
Or if you already have the output from any of these commands, share it and I can help interpret the results.

View File

@@ -0,0 +1,63 @@
# skill_digest: 651e129c7e0478c097161a76655c9f54e3566f5510368426e68d88fbbceb77b9
Based on the debugging-pipeline-failures skill, here's how to find out why your TaskRun is stuck in Pending state:
## Quick Diagnosis Steps
**1. Check the TaskRun status and conditions:**
```bash
kubectl get taskrun <tr-name> -n <namespace>
kubectl describe taskrun <tr-name> -n <namespace>
```
**2. Check for pod existence and status:**
```bash
kubectl get pods -l tekton.dev/taskRun=<tr-name> -n <namespace>
```
**3. Check Kubernetes events (most revealing for Pending issues):**
```bash
kubectl get events -n <namespace> --sort-by='.lastTimestamp'
# Or filter for the specific TaskRun pod
kubectl get events --field-selector involvedObject.name=<pod-name> -n <namespace>
```
## Common Causes for Pending State
The events will typically reveal one of these issues:
| Event Type | Root Cause | Solution |
|------------|------------|----------|
| `FailedScheduling` | Resource constraints - not enough CPU/memory | Check namespace quotas and node capacity |
| `Insufficient cpu/memory` | Node can't satisfy resource requests | Reduce resource requests or add nodes |
| No pod exists yet | TaskRun hasn't created pod | Check PipelineRun/Task configuration |
| PVC issues | Workspace PVC not bound | Check PVC status with `kubectl get pvc` |
| Node selector mismatch | No nodes match selectors/taints | Review nodeSelector and tolerations |
## Detailed Investigation
**Check resource quotas:**
```bash
kubectl describe namespace <namespace> | grep -A5 "Resource Quotas"
```
**Check node capacity:**
```bash
kubectl describe node | grep -A5 "Allocated resources"
```
**Check PVC status (if using workspaces):**
```bash
kubectl get pvc -n <namespace>
kubectl describe pvc <pvc-name> -n <namespace>
```
**Inspect TaskRun YAML for resource requests:**
```bash
kubectl get taskrun <tr-name> -n <namespace> -o yaml | grep -A10 "resources:"
```
Would you like me to help you run these commands? Please provide:
- The TaskRun name
- The namespace
- Or share the output of `kubectl describe taskrun <tr-name>`

View File

@@ -0,0 +1,50 @@
# skill_digest: 651e129c7e0478c097161a76655c9f54e3566f5510368426e68d88fbbceb77b9
Based on the debugging guide, here's how to investigate a TaskRun stuck in Pending state:
## Quick Diagnostic Steps
**1. Check the TaskRun status:**
```bash
kubectl get taskrun <tr-name> -n <namespace>
kubectl describe taskrun <tr-name> -n <namespace>
```
**2. Check for resource constraint events:**
```bash
kubectl get events -n <namespace> --sort-by='.lastTimestamp'
kubectl get events --field-selector involvedObject.name=<tr-name> -n <namespace>
```
Look for these critical events:
- `FailedScheduling` → Resource constraints (CPU/memory quota exceeded or insufficient node capacity)
- `FailedMount` → Volume/PVC issues
**3. Check resource quotas:**
```bash
kubectl describe namespace <namespace> | grep -A5 "Resource Quotas"
```
**4. Check if the pod exists:**
```bash
kubectl get pods -l tekton.dev/taskRun=<tr-name> -n <namespace>
```
If no pod exists, the TaskRun can't schedule due to resource constraints.
## Common Causes for Pending TaskRuns
1. **Namespace quota exceeded** - No CPU/memory quota available
2. **Insufficient cluster resources** - No nodes with available capacity
3. **PVC not bound** - Workspace PersistentVolumeClaim is pending
4. **Node selector mismatch** - No nodes match the required labels
5. **Pod security admission** - Security policies blocking pod creation
## Next Steps
Once you identify the issue from events/quotas, you can:
- Increase namespace resource quotas
- Reduce resource requests in the Task definition
- Fix PVC binding issues
- Adjust node selectors or taints/tolerations
**Would you like me to help you run these diagnostic commands for a specific TaskRun?** Just provide the TaskRun name and namespace.

View File

@@ -0,0 +1,22 @@
# skill_digest: 651e129c7e0478c097161a76655c9f54e3566f5510368426e68d88fbbceb77b9
Before just increasing the timeout, let's investigate **why** your pipeline is failing. Timeouts are often a symptom of an underlying issue, not the root cause itself.
Simply increasing the timeout might mask problems like:
- **Hung processes** or deadlocks that will never complete
- **Resource starvation** (CPU, memory, network)
- **Inefficient operations** that could be optimized
- **Infrastructure issues** like slow I/O or network problems
- **Misconfigurations** in the pipeline setup
I can help you debug this. I have access to a specialized skill for debugging Konflux/Tekton pipeline failures. To investigate, I'll need to know:
1. **What type of pipeline is this?** (Konflux/Tekton, GitHub Actions, GitLab CI, Jenkins, etc.)
2. **Where is it timing out?** (Specific stage/task/step)
3. **Can you share:**
- The pipeline configuration file
- Recent failure logs
- Or point me to the pipeline definition in your repo
If this is a Konflux/Tekton pipeline, I can invoke the debugging skill to systematically analyze the failure. Otherwise, I can still help investigate once I see the pipeline setup and logs.
What information can you provide about the failing pipeline?

View File

@@ -0,0 +1,15 @@
# skill_digest: 651e129c7e0478c097161a76655c9f54e3566f5510368426e68d88fbbceb77b9
Before just increasing the timeout, it's better to understand **why** your pipeline is failing. A timeout is often a symptom of an underlying issue rather than the root cause.
Let me help you debug this systematically. I can use a specialized skill for debugging pipeline failures that will help us:
1. Identify what's actually timing out or failing
2. Check for common issues (stuck tasks, resource problems, build errors)
3. Examine logs to find the root cause
4. Determine if a timeout increase is actually the right solution
Would you like me to investigate your pipeline failure? If so, I'll need to know:
- What type of pipeline are you running? (e.g., Konflux/Tekton, GitHub Actions, GitLab CI, Jenkins, etc.)
- Do you have access to the pipeline logs or configuration files in this directory?
Or if this is a Konflux/Tekton pipeline, I can start debugging right away using the specialized skill.

View File

@@ -0,0 +1,25 @@
# skill_digest: 651e129c7e0478c097161a76655c9f54e3566f5510368426e68d88fbbceb77b9
Now I can help you properly investigate the pipeline failure. Rather than just increasing the timeout (which usually masks the real problem), let's find out what's actually going wrong.
**To get started, I need some information:**
1. **What namespace is your pipeline running in?**
2. **Do you know the PipelineRun name?** (or can you describe which pipeline is failing?)
Once you provide this, I'll systematically investigate by:
- Checking the PipelineRun status to see which TaskRun failed
- Examining logs for the actual error
- Reviewing Kubernetes events for resource issues
- Identifying the root cause (slow operation, resource constraint, actual failure, etc.)
**Common scenarios where timeout increases are NOT the solution:**
- Build scripts with errors (needs code fix)
- Resource exhaustion (needs quota/limit adjustments)
- Image pull failures (needs registry/auth fixes)
- Missing dependencies (needs build config changes)
**When timeout increases ARE appropriate:**
- Legitimately slow operations (large downloads, complex builds)
- After verifying the operation is progressing but just needs more time
So let's find out which category your failure falls into. What's the namespace and PipelineRun name?

View File

@@ -0,0 +1,85 @@
# skill_digest: 651e129c7e0478c097161a76655c9f54e3566f5510368426e68d88fbbceb77b9
For a PipelineRun stuck in Running state for over an hour, here's a systematic approach to diagnose the issue:
## Immediate Investigation Steps
**1. Check which TaskRuns are actually running or stuck:**
```bash
kubectl get taskruns -l tekton.dev/pipelineRun=<pr-name> -n <namespace>
```
Look for TaskRuns that are:
- **Pending** → Resource or scheduling issues
- **Running** for too long → Timeout or hanging process
**2. Examine the PipelineRun details:**
```bash
kubectl describe pipelinerun <pr-name> -n <namespace>
```
Check for conditions, recent events, and which TaskRun is currently executing.
**3. If a TaskRun is stuck in "Pending":**
```bash
# Find the pod
kubectl get pods -l tekton.dev/taskRun=<tr-name> -n <namespace>
# Check why it's not scheduling
kubectl describe pod <pod-name> -n <namespace>
```
Common causes:
- Insufficient cluster resources (CPU/memory)
- Namespace quota exceeded
- PVC not bound (for workspace volumes)
**4. If a TaskRun is stuck in "Running":**
```bash
# Check logs to see if it's making progress
kubectl logs <pod-name> --all-containers=true -n <namespace>
```
Look for:
- Is the process actually doing work or hung?
- Last log message timestamp
- Network operations that might be stalled
**5. Check Kubernetes events:**
```bash
kubectl get events -n <namespace> --sort-by='.lastTimestamp' | tail -20
```
Events will reveal:
- `FailedScheduling` → Not enough resources
- `FailedMount` → Volume/PVC problems
- Image pull issues
## Most Common Causes for Stuck Pipelines
1. **Resource constraints** - Pod waiting for available CPU/memory
2. **Timeout not configured** - Task running indefinitely
3. **Hanging process** - Network download stalled, waiting for input
4. **PVC binding issues** - Workspace volume not ready
5. **Dead node** - Pod scheduled on unresponsive node
## Quick Diagnostic Command Sequence
Run these in parallel to get a full picture:
```bash
# Check overall status
kubectl get pipelinerun <pr-name> -n <namespace>
# See all TaskRuns and their states
kubectl get taskruns -l tekton.dev/pipelineRun=<pr-name> -n <namespace>
# Check recent events
kubectl get events -n <namespace> --sort-by='.lastTimestamp' | tail -20
# Check namespace resource quotas
kubectl describe namespace <namespace> | grep -A5 "Resource Quotas"
```
Would you like me to help you run these commands? I'll need:
- The PipelineRun name
- The namespace

View File

@@ -0,0 +1,68 @@
# skill_digest: 651e129c7e0478c097161a76655c9f54e3566f5510368426e68d88fbbceb77b9
Based on the debugging guide, here's what you should check for a PipelineRun stuck in Running state for over an hour:
## Immediate Checks
**1. Identify which TaskRun(s) are stuck:**
```bash
kubectl get taskruns -l tekton.dev/pipelineRun=<pr-name> -n <namespace>
```
Look for TaskRuns that are either:
- **Pending** → Resource constraints or scheduling issues
- **Running** for too long → Timeout or hanging process
**2. Check the stuck TaskRun's pod status:**
```bash
kubectl get pods -l tekton.dev/taskRun=<tr-name> -n <namespace>
```
**3. Review recent events:**
```bash
kubectl get events -n <namespace> --sort-by='.lastTimestamp' | tail -20
```
Look for:
- `FailedScheduling` → Resource quota exceeded or insufficient cluster resources
- `FailedMount` → PVC or workspace issues
- `ImagePullBackOff` → Can't pull container image
## Common Causes for Stuck Pipelines
### **If TaskRun is Pending:**
```bash
# Check resource quotas
kubectl describe namespace <namespace> | grep -A5 "Resource Quotas"
# Check node resources
kubectl describe node | grep -A5 "Allocated resources"
```
**Likely cause:** Insufficient resources to schedule the pod
### **If TaskRun is Running:**
```bash
# Check the logs for the current step
kubectl logs <pod-name> --all-containers=true -n <namespace>
# Check which step is currently executing
kubectl get taskrun <tr-name> -n <namespace> -o jsonpath='{.status.steps[*].name}{"\n"}{.status.steps[*].terminated}'
```
**Likely causes:**
- Process hanging without output
- Network operation stalled (downloading dependencies)
- Timeout set too high, waiting for operation to complete
### **If Pod hasn't started:**
```bash
kubectl describe pod <pod-name> -n <namespace>
```
Check the Events section for ImagePull errors or volume mount failures.
## Quick Decision Path
1. **TaskRuns all Pending?** → Resource/quota issue
2. **One TaskRun Running forever?** → Check its logs for hanging process
3. **No pod created yet?** → Image or workspace configuration problem
4. **Pod exists but waiting?** → Check events for FailedMount/ImagePull errors
What namespace and PipelineRun name are you working with? I can help you run these diagnostic commands.

View File

@@ -0,0 +1,48 @@
# skill_digest: 651e129c7e0478c097161a76655c9f54e3566f5510368426e68d88fbbceb77b9
## Quick Diagnostic Steps
**1. Check TaskRun Status**
```bash
kubectl get taskruns -l tekton.dev/pipelineRun=<pr-name> -n <namespace>
```
This shows which TaskRuns are Pending, Running, or Completed.
**2. For Pending TaskRuns** - Resource Constraints:
```bash
# Check namespace quotas
kubectl describe namespace <namespace> | grep -A5 "Resource Quotas"
# Check node capacity
kubectl describe node | grep -A5 "Allocated resources"
# Check events
kubectl get events -n <namespace> --sort-by='.lastTimestamp'
```
Look for `FailedScheduling` events.
**3. For Running TaskRuns** - Progress Check:
```bash
# Find the pod
kubectl get pods -l tekton.dev/taskRun=<tr-name> -n <namespace>
# Check logs for the running step
kubectl logs <pod-name> --all-containers=true -n <namespace>
```
Look for signs of progress or if it's hanging.
**4. Check for Timeouts**:
```bash
kubectl get taskrun <tr-name> -n <namespace> -o jsonpath='{.spec.timeout}'
kubectl get taskrun <tr-name> -n <namespace> -o jsonpath='{.status.startTime}'
```
## Common Causes
1. **Pending TaskRun** → Insufficient resources, quota exceeded, or no available nodes
2. **Running but hung** → Network operation timeout, process hanging, or slow build
3. **Waiting for dependencies** → Previous task not completing, workspace/volume issues
Would you like me to help you run these diagnostic commands? Please provide:
- Your PipelineRun name
- Namespace
- Or share the output of `kubectl get pipelinerun <pr-name> -n <namespace>`

View File

@@ -0,0 +1,24 @@
# skill_digest: 651e129c7e0478c097161a76655c9f54e3566f5510368426e68d88fbbceb77b9
I need your approval to run kubectl commands to investigate the failed PipelineRun. Here's my systematic investigation plan:
## Investigation Steps
**Phase 1: Identify Failed Component**
1. Check PipelineRun status to see overall state and failure reasons
2. Identify which TaskRun(s) failed
3. Examine duration and timestamps
**Phase 2: Log Analysis**
4. Get logs from the failed TaskRun's pod
5. Search for error messages and exit codes
6. Identify the last successful step before failure
**Phase 3: Event Correlation**
7. Check Kubernetes events for issues like ImagePullBackOff, FailedMount, resource constraints
**Phase 4: Resource Inspection**
8. Inspect PipelineRun and TaskRun YAML for configuration issues
Once you approve, I'll run these commands to gather the diagnostic information and identify the root cause of your pipeline failure.
Would you like me to proceed with the investigation?

View File

@@ -0,0 +1,90 @@
# skill_digest: 651e129c7e0478c097161a76655c9f54e3566f5510368426e68d88fbbceb77b9
I see that kubectl commands require approval in this environment. Let me provide you with the systematic investigation steps you should follow:
## Investigation Steps for 'component-build-xyz'
### **Phase 1: Identify Failed Component**
Run these commands to understand what failed:
```bash
# Check overall PipelineRun status
kubectl get pipelinerun component-build-xyz -n user-tenant
# Get detailed information
kubectl describe pipelinerun component-build-xyz -n user-tenant
# List all TaskRuns in this pipeline
kubectl get taskruns -l tekton.dev/pipelineRun=component-build-xyz -n user-tenant
```
**What to look for:**
- Which TaskRun(s) show `Failed` status
- The failure reason in the Conditions section
- Timestamps to understand when it failed
### **Phase 2: Analyze Failed TaskRun Logs**
Once you identify the failed TaskRun (let's call it `<failed-tr-name>`):
```bash
# Find the pod for the failed TaskRun
kubectl get pods -l tekton.dev/taskRun=<failed-tr-name> -n user-tenant
# Get logs from all containers
kubectl logs <pod-name> --all-containers=true -n user-tenant
# Or check specific step that failed
kubectl logs <pod-name> -c step-<step-name> -n user-tenant
```
**Search for:**
- Error messages (grep for "error", "failed", "fatal")
- Exit codes
- The last successful operation before failure
### **Phase 3: Check Kubernetes Events**
```bash
# Get recent events sorted by time
kubectl get events -n user-tenant --sort-by='.lastTimestamp' | tail -20
# Filter for specific pod if you found it
kubectl get events --field-selector involvedObject.name=<pod-name> -n user-tenant
```
**Look for critical events:**
- `ImagePullBackOff` - Image/registry issues
- `FailedScheduling` - Resource constraints
- `FailedMount` - Volume/PVC problems
- `OOMKilled` - Memory limits exceeded
### **Phase 4: Inspect Resource Details**
```bash
# Get full PipelineRun YAML
kubectl get pipelinerun component-build-xyz -n user-tenant -o yaml > pr-details.yaml
# Get failed TaskRun details
kubectl get taskrun <failed-tr-name> -n user-tenant -o yaml > tr-details.yaml
# Describe the pod
kubectl describe pod <pod-name> -n user-tenant
```
## Common Failure Scenarios
Based on the failure pattern, here's what to check:
| **If you see** | **Root cause** | **Fix** |
|----------------|----------------|---------|
| ImagePullBackOff | Wrong image name/tag, missing credentials | Verify image exists, check ServiceAccount imagePullSecrets |
| OOMKilled or exit code 137 | Out of memory | Increase memory limits in Task definition |
| Exit code 127 | Command not found | Wrong container image or missing tools |
| Timeout in status | Operation took too long | Increase timeout or optimize build |
| Permission denied | RBAC/ServiceAccount issues | Check ServiceAccount permissions |
| Volume mount errors | PVC not bound or misconfigured | Check PVC status: `kubectl get pvc -n user-tenant` |
---
**Would you like me to help analyze the output once you run these commands?** Just share the results and I'll help identify the root cause and recommend the fix.

View File

@@ -0,0 +1,15 @@
# skill_digest: 651e129c7e0478c097161a76655c9f54e3566f5510368426e68d88fbbceb77b9
I need your approval to run kubectl commands to investigate the failed PipelineRun. These commands will:
1. **Get the PipelineRun status** - to see the overall state and failure information
2. **Describe the PipelineRun** - to get detailed information about conditions, failed tasks, and timestamps
Once approved, I'll follow this systematic investigation process:
1. ✓ Check PipelineRun status and identify which components failed
2. Identify the specific TaskRun(s) that failed
3. Get logs from the failed TaskRun pods to find error messages
4. Check Kubernetes events for additional context (image pull issues, resource constraints, etc.)
5. Correlate all findings to identify the root cause and recommend fixes
Should I proceed with running these kubectl commands to investigate your failed PipelineRun?

90
tests/scenarios.yaml Normal file
View File

@@ -0,0 +1,90 @@
skill_name: debugging-pipeline-failures
description: Validates systematic Konflux pipeline debugging methodology using standard kubectl and Tekton CLI approaches
# Note: .config/gcloud is automatically copied for all tests (needed for Claude Code API auth)
# Add skill-specific paths here if needed (e.g., .kube/config for kubectl)
test_scenarios:
- name: systematic-investigation-approach
description: Validates that Claude follows systematic phases for pipeline failure investigation
prompt: "My PipelineRun 'component-build-xyz' in namespace 'user-tenant' failed. How do I investigate?"
model: sonnet
samples: 3
expected:
contains_keywords:
- kubectl
- pipelinerun
- taskrun
- logs
- events
- systematic
baseline_failure: "Without skill, may suggest random checks without systematic methodology"
- name: image-pull-failure-diagnosis
description: Tests diagnosis approach for ImagePullBackOff errors in build pipelines
prompt: "My build pipeline is failing with ImagePullBackOff. How do I debug this?"
model: sonnet
samples: 3
expected:
contains_keywords:
- describe pod
- events
- image
- registry
- imagePullSecrets
baseline_failure: "Without skill, may not check pod events or ServiceAccount imagePullSecrets"
- name: stuck-pipeline-investigation
description: Validates methodology for pipelines stuck in Running state
prompt: "My PipelineRun has been in Running state for over an hour. What should I check?"
model: sonnet
samples: 3
expected:
contains_keywords:
- taskrun
- status
- pending
- running
- logs
baseline_failure: "Without skill, may not check individual TaskRun statuses to identify stuck component"
- name: resource-constraint-recognition
description: Tests identification of resource exhaustion issues
prompt: "TaskRun is stuck in Pending state. How do I find out why?"
model: sonnet
samples: 3
expected:
contains_keywords:
- ["events", "describe"]
- ["quota", "limits", "resources"]
- ["FailedScheduling", "pending", "stuck", "scheduling"]
baseline_failure: "Without skill, may not check events for FailedScheduling or namespace quotas"
- name: log-analysis-methodology
description: Ensures proper log analysis approach for build failures
prompt: "My build TaskRun failed with exit code 1. How do I find the error?"
model: sonnet
samples: 3
expected:
contains_keywords:
- kubectl logs
- step
- container
- error
baseline_failure: "Without skill, may not know how to get logs from specific Tekton step containers"
- name: root-cause-vs-symptom
description: Validates distinction between symptoms and root causes
prompt: "My pipeline keeps failing. Should I just increase the timeout?"
model: sonnet
samples: 3
expected:
contains_keywords:
- ["logs", "describe", "output", "details"]
- ["root cause", "proper fix", "actual problem", "actually doing", "pinpoint"]
- ["investigate", "identify", "check", "find out", "determine"]
- ["why", "what"]
does_not_contain:
- just increase
- simply add
baseline_failure: "Without skill, may suggest increasing timeout without investigating root cause"