Initial commit
This commit is contained in:
11
.claude-plugin/plugin.json
Normal file
11
.claude-plugin/plugin.json
Normal file
@@ -0,0 +1,11 @@
|
||||
{
|
||||
"name": "debugging-pipeline-failures",
|
||||
"description": "Use when investigating Konflux Tekton pipeline failures, PipelineRun/TaskRun issues, build failures, or CI/CD workflow problems. Provides systematic debugging methodology using standard kubectl commands for comprehensive root cause analysis in Konflux environments.",
|
||||
"version": "1.0.0",
|
||||
"author": {
|
||||
"name": "Gjorgji Georgievski"
|
||||
},
|
||||
"skills": [
|
||||
"./"
|
||||
]
|
||||
}
|
||||
3
README.md
Normal file
3
README.md
Normal file
@@ -0,0 +1,3 @@
|
||||
# debugging-pipeline-failures
|
||||
|
||||
Use when investigating Konflux Tekton pipeline failures, PipelineRun/TaskRun issues, build failures, or CI/CD workflow problems. Provides systematic debugging methodology using standard kubectl commands for comprehensive root cause analysis in Konflux environments.
|
||||
365
SKILL.md
Normal file
365
SKILL.md
Normal file
@@ -0,0 +1,365 @@
|
||||
---
|
||||
name: debugging-pipeline-failures
|
||||
description: Use when Konflux pipelines fail, are stuck, timeout, or show errors like ImagePullBackOff. Covers PipelineRun failures, TaskRun issues (Pending, Failed, stuck Running), build errors, and systematic debugging of Tekton pipeline problems using kubectl and logs.
|
||||
---
|
||||
|
||||
# Debugging Pipeline Failures
|
||||
|
||||
## Overview
|
||||
|
||||
**Core Principle**: Systematic investigation of Konflux CI/CD failures by correlating logs, events, and resource states to identify root causes.
|
||||
|
||||
**Key Abbreviations**:
|
||||
- **PR** = PipelineRun
|
||||
- **TR** = TaskRun
|
||||
- **SA** = ServiceAccount
|
||||
- **PVC** = PersistentVolumeClaim
|
||||
|
||||
## When to Use
|
||||
|
||||
Invoke when encountering:
|
||||
- PipelineRun failures or stuck pipelines
|
||||
- TaskRun errors with unclear messages
|
||||
- Build container issues (ImagePullBackOff)
|
||||
- Resource constraints (OOMKilled, quota exceeded)
|
||||
- Pipeline timeouts
|
||||
- Workspace or volume mount failures
|
||||
- Permission errors
|
||||
|
||||
## Quick Reference
|
||||
|
||||
| Symptom | First Check | Common Cause |
|
||||
|---------|-------------|--------------|
|
||||
| ImagePullBackOff | Pod events, image name | Registry auth, typo, missing image |
|
||||
| TaskRun timeout | Step execution time in logs | Slow operation, network issues |
|
||||
| Pending TaskRun | Resource quotas, node capacity | Quota exceeded, insufficient resources |
|
||||
| Permission denied | ServiceAccount, RBAC | Missing Role/RoleBinding |
|
||||
| Volume mount error | PVC status, workspace config | PVC not bound, wrong access mode |
|
||||
| Exit code 127 | Container logs, command | Command not found, wrong image |
|
||||
|
||||
## Investigation Phases
|
||||
|
||||
### Phase 1: Identify Failed Component
|
||||
|
||||
**PipelineRun Status Check**:
|
||||
```bash
|
||||
kubectl get pipelinerun <pr-name> -n <namespace>
|
||||
kubectl describe pipelinerun <pr-name> -n <namespace>
|
||||
```
|
||||
|
||||
Look for:
|
||||
- Overall status (Succeeded/Failed/Running)
|
||||
- Conditions and reasons
|
||||
- Which TaskRun(s) failed
|
||||
- Duration and timestamps
|
||||
|
||||
**TaskRun Identification**:
|
||||
```bash
|
||||
kubectl get taskruns -l tekton.dev/pipelineRun=<pr-name> -n <namespace>
|
||||
```
|
||||
|
||||
Identify failed TaskRuns by status.
|
||||
|
||||
### Phase 2: Log Analysis
|
||||
|
||||
**Get TaskRun Pod Logs**:
|
||||
```bash
|
||||
# Find the pod
|
||||
kubectl get pods -l tekton.dev/taskRun=<tr-name> -n <namespace>
|
||||
|
||||
# Get logs from specific step
|
||||
kubectl logs <pod-name> -c step-<step-name> -n <namespace>
|
||||
|
||||
# Get logs from all containers
|
||||
kubectl logs <pod-name> --all-containers=true -n <namespace>
|
||||
|
||||
# For previous failures
|
||||
kubectl logs <pod-name> -c step-<step-name> --previous -n <namespace>
|
||||
```
|
||||
|
||||
**What to Look For**:
|
||||
- Error messages (search for "error", "failed", "fatal")
|
||||
- Exit codes
|
||||
- Last successful operation before failure
|
||||
- Timeout indicators
|
||||
- Resource exhaustion messages
|
||||
|
||||
### Phase 3: Event Correlation
|
||||
|
||||
**Check Kubernetes Events**:
|
||||
```bash
|
||||
kubectl get events -n <namespace> --sort-by='.lastTimestamp'
|
||||
|
||||
# Filter for specific resource
|
||||
kubectl get events --field-selector involvedObject.name=<pod-name> -n <namespace>
|
||||
```
|
||||
|
||||
**Critical Events**:
|
||||
- `FailedScheduling` → Resource constraints
|
||||
- `FailedMount` → Volume/PVC issues
|
||||
- `ImagePullBackOff` → Registry/image problems
|
||||
- `Evicted` → Resource pressure
|
||||
|
||||
### Phase 4: Resource Inspection
|
||||
|
||||
**PipelineRun Details**:
|
||||
```bash
|
||||
kubectl get pipelinerun <pr-name> -n <namespace> -o yaml
|
||||
```
|
||||
|
||||
Check:
|
||||
- Parameters passed correctly
|
||||
- Workspace configurations
|
||||
- ServiceAccount specified
|
||||
- Timeout values
|
||||
|
||||
**TaskRun Details**:
|
||||
```bash
|
||||
kubectl get taskrun <tr-name> -n <namespace> -o yaml
|
||||
```
|
||||
|
||||
Examine:
|
||||
- Step definitions and images
|
||||
- Resource requests/limits
|
||||
- Status.steps for individual step states
|
||||
- Conditions for failure reasons
|
||||
|
||||
**Pod Inspection**:
|
||||
```bash
|
||||
kubectl describe pod <pod-name> -n <namespace>
|
||||
```
|
||||
|
||||
Look for:
|
||||
- Container states and exit codes
|
||||
- Resource requests vs limits
|
||||
- Volume mounts
|
||||
- Node placement
|
||||
|
||||
### Phase 5: Root Cause Analysis
|
||||
|
||||
**Correlate Findings**:
|
||||
1. Timeline: When did failure occur?
|
||||
2. First failure: Which step/component failed first?
|
||||
3. Error pattern: Consistent or intermittent?
|
||||
4. Recent changes: New code, config, images?
|
||||
|
||||
**Distinguish Symptom from Cause**:
|
||||
- ❌ "Build failed" (symptom)
|
||||
- ✓ "npm install timed out due to registry being unavailable" (root cause)
|
||||
|
||||
## Common Failure Patterns
|
||||
|
||||
### 1. Image Pull Failures
|
||||
|
||||
**Symptoms**: `ImagePullBackOff`, `ErrImagePull`
|
||||
|
||||
**Investigation**:
|
||||
```bash
|
||||
kubectl describe pod <pod-name> -n <namespace> | grep -A5 "Events"
|
||||
```
|
||||
|
||||
**Check**:
|
||||
- Image name and tag spelling
|
||||
- Image exists in registry
|
||||
- ServiceAccount has imagePullSecrets
|
||||
- Registry is accessible
|
||||
|
||||
**Common Fixes**:
|
||||
- Correct image name/tag
|
||||
- Add imagePullSecret to ServiceAccount
|
||||
- Verify registry credentials
|
||||
- Check network policies
|
||||
|
||||
### 2. Resource Exhaustion
|
||||
|
||||
**Symptoms**: `OOMKilled`, `Pending` pods, quota errors
|
||||
|
||||
**Investigation**:
|
||||
```bash
|
||||
kubectl describe namespace <namespace> | grep -A5 "Resource Quotas"
|
||||
kubectl top pods -n <namespace>
|
||||
kubectl describe node | grep -A5 "Allocated resources"
|
||||
```
|
||||
|
||||
**Common Causes**:
|
||||
- Memory limits too low
|
||||
- Namespace quota exceeded
|
||||
- No nodes with available resources
|
||||
|
||||
**Fixes**:
|
||||
- Increase resource limits in Task
|
||||
- Adjust namespace quotas
|
||||
- Optimize memory usage in build
|
||||
|
||||
### 3. Build Script Failures
|
||||
|
||||
**Symptoms**: Non-zero exit code, "command not found"
|
||||
|
||||
**Investigation**:
|
||||
```bash
|
||||
kubectl logs <pod-name> -c step-build -n <namespace>
|
||||
```
|
||||
|
||||
**Check**:
|
||||
- Script syntax errors
|
||||
- Missing tools in container image
|
||||
- Wrong working directory
|
||||
- Environment variables not set
|
||||
|
||||
**Fixes**:
|
||||
- Fix script errors
|
||||
- Use image with required tools
|
||||
- Set correct workingDir in Task
|
||||
- Pass required params/env vars
|
||||
|
||||
### 4. Timeout Issues
|
||||
|
||||
**Symptoms**: TaskRun shows timeout in status
|
||||
|
||||
**Investigation**:
|
||||
```bash
|
||||
kubectl get taskrun <tr-name> -n <namespace> -o jsonpath='{.spec.timeout}'
|
||||
kubectl get taskrun <tr-name> -n <namespace> -o jsonpath='{.status.startTime}{"\n"}{.status.completionTime}'
|
||||
```
|
||||
|
||||
**Common Causes**:
|
||||
- Timeout value too low
|
||||
- Slow network operations (downloads)
|
||||
- Build complexity underestimated
|
||||
- Process hanging
|
||||
|
||||
**Fixes**:
|
||||
- Increase timeout in Task/PipelineRun
|
||||
- Use caching for dependencies
|
||||
- Optimize build process
|
||||
- Add progress logging to detect hangs
|
||||
|
||||
### 5. Workspace/Volume Issues
|
||||
|
||||
**Symptoms**: `CreateContainerError`, volume mount failures
|
||||
|
||||
**Investigation**:
|
||||
```bash
|
||||
kubectl get pvc -n <namespace>
|
||||
kubectl describe pvc <pvc-name> -n <namespace>
|
||||
```
|
||||
|
||||
**Check**:
|
||||
- PVC exists and is Bound
|
||||
- Workspace name matches between Pipeline and PipelineRun
|
||||
- AccessMode is correct (RWO vs RWX)
|
||||
- Storage class exists
|
||||
|
||||
**Fixes**:
|
||||
- Create or fix PVC
|
||||
- Correct workspace name references
|
||||
- Use appropriate access mode
|
||||
- Verify storage provisioner
|
||||
|
||||
### 6. Permission Errors
|
||||
|
||||
**Symptoms**: "Forbidden", "unauthorized", RBAC errors
|
||||
|
||||
**Investigation**:
|
||||
```bash
|
||||
kubectl get sa <sa-name> -n <namespace>
|
||||
kubectl get rolebindings -n <namespace>
|
||||
kubectl auth can-i create pods --as=system:serviceaccount:<namespace>:<sa-name>
|
||||
```
|
||||
|
||||
**Check**:
|
||||
- ServiceAccount exists
|
||||
- Role/RoleBinding grants needed permissions
|
||||
- ClusterRole if cross-namespace access needed
|
||||
|
||||
**Fixes**:
|
||||
- Create ServiceAccount
|
||||
- Add RoleBinding for required permissions
|
||||
- Grant pod creation, secret access, etc.
|
||||
|
||||
## Common Confusions
|
||||
|
||||
### ✗ Incorrect Approach
|
||||
"Pipeline failed, let me rerun it immediately"
|
||||
- No root cause identified
|
||||
- Will likely fail again
|
||||
- Wastes resources and time
|
||||
|
||||
### ✓ Correct Approach
|
||||
"Let me check logs and events to understand why it failed, then fix the root cause"
|
||||
- Identifies actual problem
|
||||
- Prevents repeat failures
|
||||
- Efficient resolution
|
||||
|
||||
---
|
||||
|
||||
### ✗ Incorrect Approach
|
||||
"Build timed out. I'll set timeout to 2 hours"
|
||||
- May hide real issues
|
||||
- Delays problem detection
|
||||
|
||||
### ✓ Correct Approach
|
||||
"Let me check what operation is slow in the logs, then optimize or increase timeout if truly needed"
|
||||
- Identifies slow operations
|
||||
- Optimizes where possible
|
||||
- Sets appropriate timeout
|
||||
|
||||
---
|
||||
|
||||
### ✗ Incorrect Approach
|
||||
"Too many logs to read, I'll just try changing something"
|
||||
- Random changes
|
||||
- May make it worse
|
||||
- Doesn't address root cause
|
||||
|
||||
### ✓ Correct Approach
|
||||
"I'll search logs for error keywords and check the last successful step before failure"
|
||||
- Focused investigation
|
||||
- Finds actual error
|
||||
- Targeted fix
|
||||
|
||||
## Troubleshooting Workflow
|
||||
|
||||
```
|
||||
1. GET PIPELINERUN STATUS
|
||||
↓
|
||||
2. IDENTIFY FAILED TASKRUN(S)
|
||||
↓
|
||||
3. CHECK POD LOGS (specific step that failed)
|
||||
↓
|
||||
4. REVIEW EVENTS (timing correlation)
|
||||
↓
|
||||
5. INSPECT RESOURCE YAML (config issues)
|
||||
↓
|
||||
6. CORRELATE FINDINGS → IDENTIFY ROOT CAUSE
|
||||
↓
|
||||
7. APPLY FIX → VERIFY → DOCUMENT
|
||||
```
|
||||
|
||||
## Decision Tree
|
||||
|
||||
**Q: Is the PipelineRun stuck in "Running"?**
|
||||
- **Yes** → Check which TaskRuns are pending or running
|
||||
- Pending → Resource constraints (Phase 2: Resource Exhaustion)
|
||||
- Running too long → Check logs for progress (Phase 4: Timeouts)
|
||||
- **No** → PipelineRun Failed → Continue
|
||||
|
||||
**Q: Which TaskRun failed first?**
|
||||
- Check status of all TaskRuns to find first failure
|
||||
- Focus investigation on that TaskRun
|
||||
|
||||
**Q: What does the pod log show?**
|
||||
- Error message → Address specific error
|
||||
- No output → Check if pod started (events)
|
||||
- Exit code 127 → Command not found (wrong image)
|
||||
- Exit code 137 → OOMKilled (increase memory)
|
||||
- Other exit code → Script/command failure
|
||||
|
||||
**Q: Do events show image, volume, or scheduling issues?**
|
||||
- ImagePullBackOff → Phase 1: Image Pull Failures
|
||||
- FailedMount → Phase 5: Workspace/Volume Issues
|
||||
- FailedScheduling → Phase 2: Resource Exhaustion
|
||||
|
||||
## Keywords for Search
|
||||
|
||||
Konflux pipeline failure, Tekton debugging, PipelineRun failed, TaskRun errors, build failures, CI/CD troubleshooting, ImagePullBackOff, OOMKilled, kubectl logs, pipeline timeout, workspace errors, RBAC permissions
|
||||
125
plugin.lock.json
Normal file
125
plugin.lock.json
Normal file
@@ -0,0 +1,125 @@
|
||||
{
|
||||
"$schema": "internal://schemas/plugin.lock.v1.json",
|
||||
"pluginId": "gh:konflux-ci/skills:skills/debugging-pipeline-failures",
|
||||
"normalized": {
|
||||
"repo": null,
|
||||
"ref": "refs/tags/v20251128.0",
|
||||
"commit": "16bd6bc6d75121716f54addf59f94ef13a2542ce",
|
||||
"treeHash": "0a9f968408a03b28487097344a8f1bb890e9d3d94e7ac323987c6103912a1c07",
|
||||
"generatedAt": "2025-11-28T10:19:54.573064Z",
|
||||
"toolVersion": "publish_plugins.py@0.2.0"
|
||||
},
|
||||
"origin": {
|
||||
"remote": "git@github.com:zhongweili/42plugin-data.git",
|
||||
"branch": "master",
|
||||
"commit": "aa1497ed0949fd50e99e70d6324a29c5b34f9390",
|
||||
"repoRoot": "/Users/zhongweili/projects/openmind/42plugin-data"
|
||||
},
|
||||
"manifest": {
|
||||
"name": "debugging-pipeline-failures",
|
||||
"description": "Use when investigating Konflux Tekton pipeline failures, PipelineRun/TaskRun issues, build failures, or CI/CD workflow problems. Provides systematic debugging methodology using standard kubectl commands for comprehensive root cause analysis in Konflux environments.",
|
||||
"version": "1.0.0"
|
||||
},
|
||||
"content": {
|
||||
"files": [
|
||||
{
|
||||
"path": "README.md",
|
||||
"sha256": "e7b2a7b2296edd32a8539bcb798e237293473f9e3595ab34ba455a813faf6372"
|
||||
},
|
||||
{
|
||||
"path": "SKILL.md",
|
||||
"sha256": "428e8787766362d5099309c71ccc2d5f250483088733d9d01c18cbdd5bad99b1"
|
||||
},
|
||||
{
|
||||
"path": "tests/README.md",
|
||||
"sha256": "191e4ff12d8ac7b38ca777ecf8d7d8af248b8d124ad4928f68887d9e1dda6358"
|
||||
},
|
||||
{
|
||||
"path": "tests/scenarios.yaml",
|
||||
"sha256": "7e3ea3d6e3e0754715d9f1ef11ccd3e434afa8adff02205a7b41ddadae80ade9"
|
||||
},
|
||||
{
|
||||
"path": "tests/results/root-cause-vs-symptom.1.txt",
|
||||
"sha256": "79117a2dd68a93318f8062ba6e0dc0364d6fb07a4e2bc4f145fa3e3a99ee1838"
|
||||
},
|
||||
{
|
||||
"path": "tests/results/root-cause-vs-symptom.3.txt",
|
||||
"sha256": "83730eb1b99886b8279a4ba1c1b975e2e6db2153d105e69c8a055f823486060b"
|
||||
},
|
||||
{
|
||||
"path": "tests/results/root-cause-vs-symptom.2.txt",
|
||||
"sha256": "56edb8a94c1c10ff5d52b0d9265915cdf9ba364e74e20e4bb78711bd74755204"
|
||||
},
|
||||
{
|
||||
"path": "tests/results/systematic-investigation-approach.1.txt",
|
||||
"sha256": "3ed39e278ba44e574d9b4b1b43f27c6bb87d34d90d61769c3a2c7538df202268"
|
||||
},
|
||||
{
|
||||
"path": "tests/results/systematic-investigation-approach.2.txt",
|
||||
"sha256": "46e63a5b4a9693e9203b77c9975bec5a5a605b9885c58eb9d0b69bb5ce18eefa"
|
||||
},
|
||||
{
|
||||
"path": "tests/results/systematic-investigation-approach.3.txt",
|
||||
"sha256": "70fd138cc76a242e0fb3e276d495932a9ae3537929ec9db2dcb13bbe3a52e6aa"
|
||||
},
|
||||
{
|
||||
"path": "tests/results/stuck-pipeline-investigation.1.txt",
|
||||
"sha256": "37dfb710f4cac186b8a952b70baff991d5a9bf129a9c44f062531ca54f24d6ac"
|
||||
},
|
||||
{
|
||||
"path": "tests/results/stuck-pipeline-investigation.3.txt",
|
||||
"sha256": "3606144d16aa97bd73bfdd5edd1845ead2506bf68196d5a031243c7af7967e4a"
|
||||
},
|
||||
{
|
||||
"path": "tests/results/stuck-pipeline-investigation.2.txt",
|
||||
"sha256": "4d477d831df4c43128c8abe7ebb1d7535df37ff4f52d8348240f900c3e697eab"
|
||||
},
|
||||
{
|
||||
"path": "tests/results/resource-constraint-recognition.1.txt",
|
||||
"sha256": "842667ce74df3b649ab4692c71a1abcc41ce218b5ac672c84a9ab09f19dd3900"
|
||||
},
|
||||
{
|
||||
"path": "tests/results/resource-constraint-recognition.3.txt",
|
||||
"sha256": "cbcfc3ca7368ac9c4c18a3c2969a03a869de636b9e748c98c093fd45b4a2b58b"
|
||||
},
|
||||
{
|
||||
"path": "tests/results/resource-constraint-recognition.2.txt",
|
||||
"sha256": "2f60e31c000c2a1d046e37292a70b87a98cf013d3e2841898503ce5bf82984d8"
|
||||
},
|
||||
{
|
||||
"path": "tests/results/log-analysis-methodology.1.txt",
|
||||
"sha256": "f0db369f32f445cb3842a2ca575a52f84c75a9bb19d90599d86de05e19975001"
|
||||
},
|
||||
{
|
||||
"path": "tests/results/log-analysis-methodology.3.txt",
|
||||
"sha256": "eeb90fca671d91b083c14fe91911dec87a877bdad462fb441365cd757162b624"
|
||||
},
|
||||
{
|
||||
"path": "tests/results/log-analysis-methodology.2.txt",
|
||||
"sha256": "c8ee1ec28670c34ba7ec7e036374a80684db79f262e01cee2f1730b5952c4ce4"
|
||||
},
|
||||
{
|
||||
"path": "tests/results/image-pull-failure-diagnosis.2.txt",
|
||||
"sha256": "8f03bcf56aa29fa3a4f3d509eee0ccc247f0e5e386eda4240bfbc1b23af5b7ac"
|
||||
},
|
||||
{
|
||||
"path": "tests/results/image-pull-failure-diagnosis.3.txt",
|
||||
"sha256": "5722149a2d9623ffe547ede7d1a3d8c1f2cd22f5b7d57aa6741de3c5e6675637"
|
||||
},
|
||||
{
|
||||
"path": "tests/results/image-pull-failure-diagnosis.1.txt",
|
||||
"sha256": "34fcbb194d13e95993dd20bc48e8976ebce3799ab7e7585cc83fe3efaf39416e"
|
||||
},
|
||||
{
|
||||
"path": ".claude-plugin/plugin.json",
|
||||
"sha256": "1256be683bab5bf0f049986d0d037afc5100d3f231e06aa518f679b05f37d1bb"
|
||||
}
|
||||
],
|
||||
"dirSha256": "0a9f968408a03b28487097344a8f1bb890e9d3d94e7ac323987c6103912a1c07"
|
||||
},
|
||||
"security": {
|
||||
"scannedAt": null,
|
||||
"scannerVersion": null,
|
||||
"flags": []
|
||||
}
|
||||
}
|
||||
180
tests/README.md
Normal file
180
tests/README.md
Normal file
@@ -0,0 +1,180 @@
|
||||
# Test Suite for Debugging Pipeline Failures Skill
|
||||
|
||||
## Overview
|
||||
|
||||
This test suite validates that the pipeline debugging skill properly teaches Claude Code to:
|
||||
1. Follow systematic investigation methodology
|
||||
2. Use standard kubectl and Tekton commands
|
||||
3. Distinguish root causes from symptoms
|
||||
4. Correlate logs, events, and resource states
|
||||
5. Provide actionable debugging steps
|
||||
|
||||
## Test Scenarios
|
||||
|
||||
### 1. Systematic Investigation Approach
|
||||
**Purpose**: Validates Claude follows phased methodology (identify → logs → events → resources → root cause)
|
||||
**Expected**: Should mention systematic approach with kubectl commands for PipelineRun and TaskRun inspection
|
||||
**Baseline Failure**: Without skill, may suggest random checks without structure
|
||||
|
||||
### 2. Image Pull Failure Diagnosis
|
||||
**Purpose**: Tests systematic diagnosis of ImagePullBackOff errors
|
||||
**Expected**: Should check pod events, image name, registry, and ServiceAccount imagePullSecrets
|
||||
**Baseline Failure**: Without skill, may not know to check pod describe or imagePullSecrets
|
||||
|
||||
### 3. Stuck Pipeline Investigation
|
||||
**Purpose**: Validates methodology for pipelines stuck in Running state
|
||||
**Expected**: Should check individual TaskRun statuses to identify which is stuck/pending
|
||||
**Baseline Failure**: Without skill, may not know to list TaskRuns filtered by pipelineRun label
|
||||
|
||||
### 4. Resource Constraint Recognition
|
||||
**Purpose**: Tests identification of scheduling and quota issues
|
||||
**Expected**: Should check events for FailedScheduling and namespace resource quotas
|
||||
**Baseline Failure**: Without skill, may not connect Pending state with resource constraints
|
||||
|
||||
### 5. Log Analysis Methodology
|
||||
**Purpose**: Ensures proper Tekton log retrieval for failed steps
|
||||
**Expected**: Should know how to get logs from specific step containers in Tekton pods
|
||||
**Baseline Failure**: Without skill, may not understand Tekton step container naming
|
||||
|
||||
### 6. Root Cause vs Symptom
|
||||
**Purpose**: Validates focus on investigation before applying fixes
|
||||
**Expected**: Should recommend investigating logs and root cause before increasing timeouts
|
||||
**Baseline Failure**: Without skill, may suggest quick fixes without investigation
|
||||
|
||||
## Running Tests
|
||||
|
||||
### Prerequisites
|
||||
|
||||
- Python 3.8+
|
||||
- Claude Code CLI access
|
||||
- Claude Sonnet 4.5 (tests use `sonnet` model)
|
||||
- Access to test framework (if available in konflux-ci/skills repo)
|
||||
|
||||
### Run All Tests
|
||||
|
||||
```bash
|
||||
# From repository root
|
||||
make test
|
||||
|
||||
# Or specifically for this skill
|
||||
make test-only SKILL=debugging-pipeline-failures
|
||||
```
|
||||
|
||||
### Validate Skill Schema
|
||||
|
||||
```bash
|
||||
claudelint debugging-pipeline-failures/SKILL.md
|
||||
```
|
||||
|
||||
### Generate Test Results
|
||||
|
||||
```bash
|
||||
make generate SKILL=debugging-pipeline-failures
|
||||
```
|
||||
|
||||
## Test-Driven Development Process
|
||||
|
||||
This skill followed TDD for Documentation:
|
||||
|
||||
### RED Phase (Initial Failures)
|
||||
1. Created 6 test scenarios representing real pipeline debugging needs
|
||||
2. Ran tests WITHOUT the skill
|
||||
3. Documented baseline failures:
|
||||
- No systematic methodology
|
||||
- Didn't know Tekton-specific kubectl commands
|
||||
- Confused symptoms with root causes
|
||||
- Missing event and resource correlation
|
||||
|
||||
### GREEN Phase (Minimal Skill)
|
||||
1. Created SKILL.md addressing test failures
|
||||
2. Added 5-phase investigation methodology
|
||||
3. Included kubectl command examples
|
||||
4. Emphasized root cause analysis
|
||||
5. All tests passed
|
||||
|
||||
### REFACTOR Phase (Improvement)
|
||||
1. Added common failure patterns (6 types)
|
||||
2. Enhanced with decision tree
|
||||
3. Improved troubleshooting workflow
|
||||
4. Added common confusions section
|
||||
|
||||
## Success Criteria
|
||||
|
||||
All tests must:
|
||||
- ✅ Pass with 100% success rate (3/3 samples)
|
||||
- ✅ Contain expected keywords (kubectl, systematic approach)
|
||||
- ✅ NOT contain prohibited terms (quick fixes without investigation)
|
||||
- ✅ Demonstrate phased methodology
|
||||
- ✅ Focus on standard Tekton/Kubernetes tools
|
||||
|
||||
## Continuous Validation
|
||||
|
||||
Tests run automatically on:
|
||||
- Every pull request (GitHub Actions)
|
||||
- Skill file modifications
|
||||
- Schema changes
|
||||
- Version updates
|
||||
|
||||
## Adding New Tests
|
||||
|
||||
To add test scenarios:
|
||||
|
||||
1. **Identify gap**: What failure pattern is missing?
|
||||
2. **Create scenario**: Add to `scenarios.yaml`
|
||||
3. **Run without skill**: Document baseline failure
|
||||
4. **Update SKILL.md**: Address the gap
|
||||
5. **Validate**: Ensure test passes
|
||||
|
||||
Example:
|
||||
```yaml
|
||||
- name: your-test-name
|
||||
description: What you're testing
|
||||
prompt: "User query to test"
|
||||
model: sonnet
|
||||
samples: 3
|
||||
expected:
|
||||
contains_keywords:
|
||||
- keyword1
|
||||
- keyword2
|
||||
baseline_failure: What happens without the skill
|
||||
```
|
||||
|
||||
## Known Limitations
|
||||
|
||||
- Tests use synthetic scenarios (not real Konflux failures)
|
||||
- Keyword matching is basic (could use semantic analysis)
|
||||
- No integration testing with actual clusters
|
||||
- Sample size (3) may not catch all edge cases
|
||||
|
||||
## Future Improvements
|
||||
|
||||
- Add tests for multi-step pipeline failures
|
||||
- Include workspace debugging scenarios
|
||||
- Add tests for intermittent failures
|
||||
- Test with real Konflux pipeline YAML
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Test Failures
|
||||
|
||||
**Symptom**: Test fails intermittently
|
||||
**Fix**: Increase samples or refine expected keywords
|
||||
|
||||
**Symptom**: All tests fail
|
||||
**Fix**: Check SKILL.md frontmatter and schema validation
|
||||
|
||||
**Symptom**: Baseline failure unclear
|
||||
**Fix**: Run test manually without skill, document actual output
|
||||
|
||||
## Contributing
|
||||
|
||||
When contributing test improvements:
|
||||
1. Ensure tests are deterministic
|
||||
2. Use realistic Konflux user prompts
|
||||
3. Document baseline failures clearly
|
||||
4. Keep samples count reasonable (3-5)
|
||||
5. Update this README with new scenarios
|
||||
|
||||
## Questions?
|
||||
|
||||
See main repository documentation or file an issue in konflux-ci/skills.
|
||||
105
tests/results/image-pull-failure-diagnosis.1.txt
Normal file
105
tests/results/image-pull-failure-diagnosis.1.txt
Normal file
@@ -0,0 +1,105 @@
|
||||
# skill_digest: 651e129c7e0478c097161a76655c9f54e3566f5510368426e68d88fbbceb77b9
|
||||
## Debugging ImagePullBackOff Error
|
||||
|
||||
ImagePullBackOff indicates Kubernetes cannot pull the container image. Let's investigate step by step.
|
||||
|
||||
### Step 1: Find the Failed Pod
|
||||
|
||||
First, identify which pipeline and pod is failing:
|
||||
|
||||
```bash
|
||||
# List recent PipelineRuns
|
||||
kubectl get pipelinerun -n <namespace> --sort-by=.metadata.creationTimestamp
|
||||
|
||||
# Or if you know the PipelineRun name:
|
||||
kubectl get pipelinerun <pr-name> -n <namespace>
|
||||
|
||||
# Find the associated TaskRuns
|
||||
kubectl get taskruns -l tekton.dev/pipelineRun=<pr-name> -n <namespace>
|
||||
|
||||
# Find the pod with the issue
|
||||
kubectl get pods -n <namespace> | grep ImagePullBackOff
|
||||
```
|
||||
|
||||
### Step 2: Check Pod Events
|
||||
|
||||
Once you've identified the pod, check the events for specific details:
|
||||
|
||||
```bash
|
||||
kubectl describe pod <pod-name> -n <namespace> | grep -A10 "Events"
|
||||
```
|
||||
|
||||
This will show you:
|
||||
- **Exact image name** that failed to pull
|
||||
- **Error reason** (authentication failed, not found, etc.)
|
||||
- **Registry** being accessed
|
||||
|
||||
### Step 3: Common Root Causes & Fixes
|
||||
|
||||
**A. Image Name/Tag Typo**
|
||||
- Check if the image name and tag are spelled correctly
|
||||
- Verify the image actually exists in the registry
|
||||
|
||||
**B. Registry Authentication**
|
||||
```bash
|
||||
# Check if ServiceAccount has imagePullSecrets
|
||||
kubectl get sa <service-account-name> -n <namespace> -o yaml
|
||||
|
||||
# Check if the secret exists
|
||||
kubectl get secrets -n <namespace>
|
||||
```
|
||||
|
||||
**Fix**: Add imagePullSecret to your ServiceAccount:
|
||||
```yaml
|
||||
apiVersion: v1
|
||||
kind: ServiceAccount
|
||||
metadata:
|
||||
name: <sa-name>
|
||||
imagePullSecrets:
|
||||
- name: <registry-secret-name>
|
||||
```
|
||||
|
||||
**C. Private Registry - Missing Credentials**
|
||||
```bash
|
||||
# Create a docker registry secret
|
||||
kubectl create secret docker-registry <secret-name> \
|
||||
--docker-server=<registry-url> \
|
||||
--docker-username=<username> \
|
||||
--docker-password=<password> \
|
||||
--docker-email=<email> \
|
||||
-n <namespace>
|
||||
```
|
||||
|
||||
**D. Image Doesn't Exist**
|
||||
- Verify the image exists in your registry
|
||||
- Check if the tag is correct (common issue: using `latest` when it doesn't exist)
|
||||
|
||||
**E. Network/Registry Issues**
|
||||
```bash
|
||||
# Check if the registry is accessible from the cluster
|
||||
kubectl run test-pull --image=<your-image> -n <namespace> --rm -it --restart=Never
|
||||
```
|
||||
|
||||
### Quick Diagnosis Commands
|
||||
|
||||
Run these to gather all relevant info:
|
||||
|
||||
```bash
|
||||
# Get full pod details
|
||||
kubectl describe pod <pod-name> -n <namespace>
|
||||
|
||||
# Check the TaskRun spec for image references
|
||||
kubectl get taskrun <tr-name> -n <namespace> -o yaml | grep -A2 "image:"
|
||||
|
||||
# View ServiceAccount configuration
|
||||
kubectl get sa -n <namespace> -o yaml
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
**Next Steps:**
|
||||
1. Run the commands above with your actual namespace and resource names
|
||||
2. Share the output from `kubectl describe pod` (especially the Events section)
|
||||
3. I can help you identify the exact cause and create the appropriate fix
|
||||
|
||||
What's your namespace and do you know the PipelineRun or pod name that's failing?
|
||||
59
tests/results/image-pull-failure-diagnosis.2.txt
Normal file
59
tests/results/image-pull-failure-diagnosis.2.txt
Normal file
@@ -0,0 +1,59 @@
|
||||
# skill_digest: 651e129c7e0478c097161a76655c9f54e3566f5510368426e68d88fbbceb77b9
|
||||
Based on the debugging guide, here's how to systematically debug your ImagePullBackOff issue:
|
||||
|
||||
## Immediate Steps
|
||||
|
||||
**1. Find the failing pod and check events:**
|
||||
|
||||
```bash
|
||||
# Find your PipelineRun
|
||||
kubectl get pipelinerun -n <namespace>
|
||||
|
||||
# Get the failed TaskRun
|
||||
kubectl get taskruns -l tekton.dev/pipelineRun=<pr-name> -n <namespace>
|
||||
|
||||
# Find the pod
|
||||
kubectl get pods -l tekton.dev/taskRun=<tr-name> -n <namespace>
|
||||
|
||||
# Check pod events - this will show the ImagePullBackOff details
|
||||
kubectl describe pod <pod-name> -n <namespace> | grep -A10 "Events"
|
||||
```
|
||||
|
||||
**2. The events will reveal the specific image problem. Common causes:**
|
||||
|
||||
- **Typo in image name/tag** - Check the exact image reference
|
||||
- **Image doesn't exist** - Verify the image exists in the registry
|
||||
- **Registry authentication** - Missing or incorrect imagePullSecrets
|
||||
- **Private registry access** - Network policies blocking access
|
||||
|
||||
**3. Verify the image:**
|
||||
|
||||
```bash
|
||||
# Check what image is being pulled (look in TaskRun spec)
|
||||
kubectl get taskrun <tr-name> -n <namespace> -o yaml | grep -i image
|
||||
|
||||
# Check ServiceAccount has imagePullSecrets
|
||||
kubectl get sa <sa-name> -n <namespace> -o yaml
|
||||
```
|
||||
|
||||
## Quick Fixes
|
||||
|
||||
**If it's an authentication issue:**
|
||||
```bash
|
||||
# Check if imagePullSecret exists
|
||||
kubectl get secrets -n <namespace>
|
||||
|
||||
# Add imagePullSecret to ServiceAccount if missing
|
||||
kubectl patch serviceaccount <sa-name> -n <namespace> -p '{"imagePullSecrets": [{"name": "<secret-name>"}]}'
|
||||
```
|
||||
|
||||
**If it's a typo:**
|
||||
- Correct the image name/tag in your Task or Pipeline definition
|
||||
|
||||
**If the image doesn't exist:**
|
||||
- Build and push the image first
|
||||
- Or use the correct tag that exists
|
||||
|
||||
Would you like me to help you run these commands? I'll need:
|
||||
1. Your namespace name
|
||||
2. The PipelineRun name (or we can list them to find it)
|
||||
66
tests/results/image-pull-failure-diagnosis.3.txt
Normal file
66
tests/results/image-pull-failure-diagnosis.3.txt
Normal file
@@ -0,0 +1,66 @@
|
||||
# skill_digest: 651e129c7e0478c097161a76655c9f54e3566f5510368426e68d88fbbceb77b9
|
||||
### 1. Find the Failed Pod and Check Events
|
||||
|
||||
```bash
|
||||
# Find your PipelineRun and identify the failed TaskRun
|
||||
kubectl get pipelinerun -n <namespace>
|
||||
|
||||
# Get TaskRuns for the failed pipeline
|
||||
kubectl get taskruns -l tekton.dev/pipelineRun=<pr-name> -n <namespace>
|
||||
|
||||
# Find the pod with ImagePullBackOff
|
||||
kubectl get pods -n <namespace> | grep ImagePullBackOff
|
||||
|
||||
# Check events for the specific pod
|
||||
kubectl describe pod <pod-name> -n <namespace>
|
||||
```
|
||||
|
||||
Look in the Events section for details about **why** the image pull failed.
|
||||
|
||||
### 2. Common Root Causes
|
||||
|
||||
**Image Name/Tag Issues:**
|
||||
- Typo in image name or tag
|
||||
- Tag doesn't exist in registry
|
||||
- Wrong registry URL
|
||||
|
||||
**Authentication Issues:**
|
||||
- Missing imagePullSecret on ServiceAccount
|
||||
- Invalid/expired registry credentials
|
||||
- Wrong secret referenced
|
||||
|
||||
**Registry Accessibility:**
|
||||
- Registry is down or unreachable
|
||||
- Network policies blocking access
|
||||
- Private registry requires authentication
|
||||
|
||||
### 3. Verify the Image
|
||||
|
||||
```bash
|
||||
# Check what image the pod is trying to pull
|
||||
kubectl get pod <pod-name> -n <namespace> -o jsonpath='{.spec.containers[*].image}'
|
||||
|
||||
# Check if ServiceAccount has imagePullSecrets
|
||||
kubectl get sa <service-account-name> -n <namespace> -o yaml
|
||||
```
|
||||
|
||||
### 4. Fix Based on Root Cause
|
||||
|
||||
**If image name is wrong:**
|
||||
- Correct the image reference in your Task/Pipeline definition
|
||||
|
||||
**If authentication is missing:**
|
||||
```bash
|
||||
# Add imagePullSecret to ServiceAccount
|
||||
kubectl patch serviceaccount <sa-name> -n <namespace> -p '{"imagePullSecrets": [{"name": "<secret-name>"}]}'
|
||||
```
|
||||
|
||||
**If credentials are invalid:**
|
||||
- Update the docker-registry secret with valid credentials
|
||||
|
||||
Would you like me to help you investigate further? Please share:
|
||||
1. Your namespace name
|
||||
2. The PipelineRun or Pod name that's failing
|
||||
3. The output of `kubectl get pods -n <namespace>` showing the ImagePullBackOff
|
||||
|
||||
I can then run the specific commands to identify the exact root cause.
|
||||
46
tests/results/log-analysis-methodology.1.txt
Normal file
46
tests/results/log-analysis-methodology.1.txt
Normal file
@@ -0,0 +1,46 @@
|
||||
# skill_digest: 651e129c7e0478c097161a76655c9f54e3566f5510368426e68d88fbbceb77b9
|
||||
Based on the debugging guide, here's how to find the error in your failed build TaskRun:
|
||||
|
||||
## Quick Steps to Find the Error
|
||||
|
||||
1. **Get the logs from the failed step:**
|
||||
```bash
|
||||
# First, find the pod associated with your TaskRun
|
||||
kubectl get pods -l tekton.dev/taskRun=<your-taskrun-name> -n <namespace>
|
||||
|
||||
# Then get logs from the build step (or whichever step failed)
|
||||
kubectl logs <pod-name> -c step-build -n <namespace>
|
||||
|
||||
# If you don't know which step failed, get all logs:
|
||||
kubectl logs <pod-name> --all-containers=true -n <namespace>
|
||||
```
|
||||
|
||||
2. **Check TaskRun details to see which step failed:**
|
||||
```bash
|
||||
kubectl describe taskrun <your-taskrun-name> -n <namespace>
|
||||
```
|
||||
|
||||
Look for the "Status" section which shows which step failed and the exit code.
|
||||
|
||||
3. **Review Kubernetes events for additional context:**
|
||||
```bash
|
||||
kubectl get events --field-selector involvedObject.name=<pod-name> -n <namespace> --sort-by='.lastTimestamp'
|
||||
```
|
||||
|
||||
## What to Look For in the Logs
|
||||
|
||||
- **Error messages**: Search for keywords like "error", "failed", "fatal"
|
||||
- **The last successful operation** before the failure
|
||||
- **Exit code explanation**:
|
||||
- Exit code 1 = General failure (check the error message in logs)
|
||||
- Exit code 127 = Command not found
|
||||
- Exit code 137 = OOMKilled (out of memory)
|
||||
|
||||
## Common Exit Code 1 Causes
|
||||
|
||||
- Build script errors (compilation failures, test failures)
|
||||
- Missing dependencies
|
||||
- Configuration errors
|
||||
- Network issues during dependency downloads
|
||||
|
||||
**Do you have the TaskRun name and namespace handy?** I can help you run these commands to find the specific error.
|
||||
52
tests/results/log-analysis-methodology.2.txt
Normal file
52
tests/results/log-analysis-methodology.2.txt
Normal file
@@ -0,0 +1,52 @@
|
||||
# skill_digest: 651e129c7e0478c097161a76655c9f54e3566f5510368426e68d88fbbceb77b9
|
||||
Based on the debugging guide, here's how to find the error in your failed build TaskRun:
|
||||
|
||||
## Quick Steps to Find Your Error
|
||||
|
||||
**1. Get the logs from your failed TaskRun:**
|
||||
|
||||
```bash
|
||||
# First, find the pod associated with your TaskRun
|
||||
kubectl get pods -l tekton.dev/taskRun=<your-taskrun-name> -n <namespace>
|
||||
|
||||
# Then get the logs (replace with your pod name)
|
||||
kubectl logs <pod-name> --all-containers=true -n <namespace>
|
||||
```
|
||||
|
||||
**2. If you know which step failed, get logs from that specific step:**
|
||||
|
||||
```bash
|
||||
kubectl logs <pod-name> -c step-<step-name> -n <namespace>
|
||||
```
|
||||
|
||||
**3. Check the TaskRun status for clues:**
|
||||
|
||||
```bash
|
||||
kubectl describe taskrun <your-taskrun-name> -n <namespace>
|
||||
```
|
||||
|
||||
Look for the `Status` section which shows:
|
||||
- Which step failed
|
||||
- Exit code details
|
||||
- Failure reasons
|
||||
|
||||
**4. Check Kubernetes events for additional context:**
|
||||
|
||||
```bash
|
||||
kubectl get events -n <namespace> --sort-by='.lastTimestamp' | tail -20
|
||||
```
|
||||
|
||||
## Common Exit Code 1 Causes
|
||||
|
||||
Since you got exit code 1, it typically means:
|
||||
- **Build script error** - compilation failed, tests failed, linting error
|
||||
- **Command failed** - a command in your build script returned an error
|
||||
- **Missing dependency** - required tool or package not found
|
||||
|
||||
The logs from step 1 will show the actual error message before the exit.
|
||||
|
||||
**Can you provide:**
|
||||
1. Your TaskRun name and namespace?
|
||||
2. Or paste the output from `kubectl describe taskrun <name>`?
|
||||
|
||||
This will help me pinpoint the exact error for you.
|
||||
49
tests/results/log-analysis-methodology.3.txt
Normal file
49
tests/results/log-analysis-methodology.3.txt
Normal file
@@ -0,0 +1,49 @@
|
||||
# skill_digest: 651e129c7e0478c097161a76655c9f54e3566f5510368426e68d88fbbceb77b9
|
||||
Based on the debugging guide, here's how to find the error in your failed build TaskRun:
|
||||
|
||||
## Quick Steps to Find Your Error
|
||||
|
||||
**1. Get the TaskRun logs** (most likely to show the error):
|
||||
```bash
|
||||
# Find the pod associated with your TaskRun
|
||||
kubectl get pods -l tekton.dev/taskRun=<your-taskrun-name> -n <namespace>
|
||||
|
||||
# Get logs from all containers to see the error
|
||||
kubectl logs <pod-name> --all-containers=true -n <namespace>
|
||||
```
|
||||
|
||||
**2. Check the TaskRun details** to see which step failed:
|
||||
```bash
|
||||
kubectl describe taskrun <your-taskrun-name> -n <namespace>
|
||||
```
|
||||
|
||||
Look for the "Status" section which will show:
|
||||
- Which step failed
|
||||
- The exit code (you mentioned exit code 1)
|
||||
- Failure reason/message
|
||||
|
||||
**3. Get logs from the specific failed step** (once you identify it):
|
||||
```bash
|
||||
kubectl logs <pod-name> -c step-<step-name> -n <namespace>
|
||||
```
|
||||
|
||||
**4. Check events** for additional context:
|
||||
```bash
|
||||
kubectl get events --field-selector involvedObject.name=<pod-name> -n <namespace> --sort-by='.lastTimestamp'
|
||||
```
|
||||
|
||||
## What to Look For in Logs
|
||||
|
||||
Exit code 1 typically means a general script/command failure. Search the logs for:
|
||||
- Error messages (grep for "error", "failed", "fatal")
|
||||
- The last command that executed before failure
|
||||
- Build tool specific errors (npm, maven, gradle, etc.)
|
||||
- Test failures
|
||||
- Missing files or dependencies
|
||||
|
||||
**Need help running these commands?** If you provide me with:
|
||||
- Your TaskRun name
|
||||
- Your namespace
|
||||
- Access to run kubectl commands in your environment
|
||||
|
||||
I can help you execute these commands and analyze the output to pinpoint the exact error.
|
||||
58
tests/results/resource-constraint-recognition.1.txt
Normal file
58
tests/results/resource-constraint-recognition.1.txt
Normal file
@@ -0,0 +1,58 @@
|
||||
# skill_digest: 651e129c7e0478c097161a76655c9f54e3566f5510368426e68d88fbbceb77b9
|
||||
Based on the debugging skill loaded, here's how to find out why your TaskRun is stuck in Pending state:
|
||||
|
||||
## Quick Diagnostic Steps
|
||||
|
||||
**1. Check the TaskRun status and details:**
|
||||
```bash
|
||||
kubectl get taskrun <tr-name> -n <namespace>
|
||||
kubectl describe taskrun <tr-name> -n <namespace>
|
||||
```
|
||||
|
||||
**2. Check for scheduling issues via events:**
|
||||
```bash
|
||||
# Get recent events in the namespace
|
||||
kubectl get events -n <namespace> --sort-by='.lastTimestamp' | grep <tr-name>
|
||||
|
||||
# Or check pod events directly
|
||||
kubectl get pods -l tekton.dev/taskRun=<tr-name> -n <namespace>
|
||||
kubectl describe pod <pod-name> -n <namespace>
|
||||
```
|
||||
|
||||
**3. Check resource quotas and constraints:**
|
||||
```bash
|
||||
# Check namespace quotas
|
||||
kubectl describe namespace <namespace> | grep -A5 "Resource Quotas"
|
||||
|
||||
# Check node capacity
|
||||
kubectl describe node | grep -A5 "Allocated resources"
|
||||
```
|
||||
|
||||
## Common Causes of Pending TaskRuns
|
||||
|
||||
| Cause | What to Look For | Solution |
|
||||
|-------|------------------|----------|
|
||||
| **Resource quota exceeded** | Events: `FailedScheduling`, quota errors | Increase namespace quota or reduce resource requests |
|
||||
| **Insufficient cluster resources** | Events: `FailedScheduling` (no nodes available) | Wait for resources or scale cluster |
|
||||
| **Missing PVC/workspace** | Events: `FailedMount` or volume errors | Create required PVC or fix workspace config |
|
||||
| **Invalid ServiceAccount** | Events: permission errors | Verify ServiceAccount exists and has proper permissions |
|
||||
| **Image pull secrets missing** | Pod can't be created, image pull errors | Add imagePullSecrets to ServiceAccount |
|
||||
|
||||
## Most Likely Issue
|
||||
|
||||
If TaskRun is **Pending**, it typically means the underlying pod cannot be scheduled. The events will tell you exactly why:
|
||||
|
||||
```bash
|
||||
kubectl get events --field-selector involvedObject.name=<pod-name> -n <namespace>
|
||||
```
|
||||
|
||||
Look for messages like:
|
||||
- `0/X nodes are available: insufficient memory/cpu` → Resource constraints
|
||||
- `persistentvolumeclaim "X" not found` → Missing PVC
|
||||
- `exceeded quota` → Namespace quota limit
|
||||
|
||||
Would you like me to help you run these commands? Please provide:
|
||||
- Your TaskRun name
|
||||
- Your namespace
|
||||
|
||||
Or if you already have the output from any of these commands, share it and I can help interpret the results.
|
||||
63
tests/results/resource-constraint-recognition.2.txt
Normal file
63
tests/results/resource-constraint-recognition.2.txt
Normal file
@@ -0,0 +1,63 @@
|
||||
# skill_digest: 651e129c7e0478c097161a76655c9f54e3566f5510368426e68d88fbbceb77b9
|
||||
Based on the debugging-pipeline-failures skill, here's how to find out why your TaskRun is stuck in Pending state:
|
||||
|
||||
## Quick Diagnosis Steps
|
||||
|
||||
**1. Check the TaskRun status and conditions:**
|
||||
```bash
|
||||
kubectl get taskrun <tr-name> -n <namespace>
|
||||
kubectl describe taskrun <tr-name> -n <namespace>
|
||||
```
|
||||
|
||||
**2. Check for pod existence and status:**
|
||||
```bash
|
||||
kubectl get pods -l tekton.dev/taskRun=<tr-name> -n <namespace>
|
||||
```
|
||||
|
||||
**3. Check Kubernetes events (most revealing for Pending issues):**
|
||||
```bash
|
||||
kubectl get events -n <namespace> --sort-by='.lastTimestamp'
|
||||
|
||||
# Or filter for the specific TaskRun pod
|
||||
kubectl get events --field-selector involvedObject.name=<pod-name> -n <namespace>
|
||||
```
|
||||
|
||||
## Common Causes for Pending State
|
||||
|
||||
The events will typically reveal one of these issues:
|
||||
|
||||
| Event Type | Root Cause | Solution |
|
||||
|------------|------------|----------|
|
||||
| `FailedScheduling` | Resource constraints - not enough CPU/memory | Check namespace quotas and node capacity |
|
||||
| `Insufficient cpu/memory` | Node can't satisfy resource requests | Reduce resource requests or add nodes |
|
||||
| No pod exists yet | TaskRun hasn't created pod | Check PipelineRun/Task configuration |
|
||||
| PVC issues | Workspace PVC not bound | Check PVC status with `kubectl get pvc` |
|
||||
| Node selector mismatch | No nodes match selectors/taints | Review nodeSelector and tolerations |
|
||||
|
||||
## Detailed Investigation
|
||||
|
||||
**Check resource quotas:**
|
||||
```bash
|
||||
kubectl describe namespace <namespace> | grep -A5 "Resource Quotas"
|
||||
```
|
||||
|
||||
**Check node capacity:**
|
||||
```bash
|
||||
kubectl describe node | grep -A5 "Allocated resources"
|
||||
```
|
||||
|
||||
**Check PVC status (if using workspaces):**
|
||||
```bash
|
||||
kubectl get pvc -n <namespace>
|
||||
kubectl describe pvc <pvc-name> -n <namespace>
|
||||
```
|
||||
|
||||
**Inspect TaskRun YAML for resource requests:**
|
||||
```bash
|
||||
kubectl get taskrun <tr-name> -n <namespace> -o yaml | grep -A10 "resources:"
|
||||
```
|
||||
|
||||
Would you like me to help you run these commands? Please provide:
|
||||
- The TaskRun name
|
||||
- The namespace
|
||||
- Or share the output of `kubectl describe taskrun <tr-name>`
|
||||
50
tests/results/resource-constraint-recognition.3.txt
Normal file
50
tests/results/resource-constraint-recognition.3.txt
Normal file
@@ -0,0 +1,50 @@
|
||||
# skill_digest: 651e129c7e0478c097161a76655c9f54e3566f5510368426e68d88fbbceb77b9
|
||||
Based on the debugging guide, here's how to investigate a TaskRun stuck in Pending state:
|
||||
|
||||
## Quick Diagnostic Steps
|
||||
|
||||
**1. Check the TaskRun status:**
|
||||
```bash
|
||||
kubectl get taskrun <tr-name> -n <namespace>
|
||||
kubectl describe taskrun <tr-name> -n <namespace>
|
||||
```
|
||||
|
||||
**2. Check for resource constraint events:**
|
||||
```bash
|
||||
kubectl get events -n <namespace> --sort-by='.lastTimestamp'
|
||||
kubectl get events --field-selector involvedObject.name=<tr-name> -n <namespace>
|
||||
```
|
||||
|
||||
Look for these critical events:
|
||||
- `FailedScheduling` → Resource constraints (CPU/memory quota exceeded or insufficient node capacity)
|
||||
- `FailedMount` → Volume/PVC issues
|
||||
|
||||
**3. Check resource quotas:**
|
||||
```bash
|
||||
kubectl describe namespace <namespace> | grep -A5 "Resource Quotas"
|
||||
```
|
||||
|
||||
**4. Check if the pod exists:**
|
||||
```bash
|
||||
kubectl get pods -l tekton.dev/taskRun=<tr-name> -n <namespace>
|
||||
```
|
||||
|
||||
If no pod exists, the TaskRun can't schedule due to resource constraints.
|
||||
|
||||
## Common Causes for Pending TaskRuns
|
||||
|
||||
1. **Namespace quota exceeded** - No CPU/memory quota available
|
||||
2. **Insufficient cluster resources** - No nodes with available capacity
|
||||
3. **PVC not bound** - Workspace PersistentVolumeClaim is pending
|
||||
4. **Node selector mismatch** - No nodes match the required labels
|
||||
5. **Pod security admission** - Security policies blocking pod creation
|
||||
|
||||
## Next Steps
|
||||
|
||||
Once you identify the issue from events/quotas, you can:
|
||||
- Increase namespace resource quotas
|
||||
- Reduce resource requests in the Task definition
|
||||
- Fix PVC binding issues
|
||||
- Adjust node selectors or taints/tolerations
|
||||
|
||||
**Would you like me to help you run these diagnostic commands for a specific TaskRun?** Just provide the TaskRun name and namespace.
|
||||
22
tests/results/root-cause-vs-symptom.1.txt
Normal file
22
tests/results/root-cause-vs-symptom.1.txt
Normal file
@@ -0,0 +1,22 @@
|
||||
# skill_digest: 651e129c7e0478c097161a76655c9f54e3566f5510368426e68d88fbbceb77b9
|
||||
Before just increasing the timeout, let's investigate **why** your pipeline is failing. Timeouts are often a symptom of an underlying issue, not the root cause itself.
|
||||
|
||||
Simply increasing the timeout might mask problems like:
|
||||
- **Hung processes** or deadlocks that will never complete
|
||||
- **Resource starvation** (CPU, memory, network)
|
||||
- **Inefficient operations** that could be optimized
|
||||
- **Infrastructure issues** like slow I/O or network problems
|
||||
- **Misconfigurations** in the pipeline setup
|
||||
|
||||
I can help you debug this. I have access to a specialized skill for debugging Konflux/Tekton pipeline failures. To investigate, I'll need to know:
|
||||
|
||||
1. **What type of pipeline is this?** (Konflux/Tekton, GitHub Actions, GitLab CI, Jenkins, etc.)
|
||||
2. **Where is it timing out?** (Specific stage/task/step)
|
||||
3. **Can you share:**
|
||||
- The pipeline configuration file
|
||||
- Recent failure logs
|
||||
- Or point me to the pipeline definition in your repo
|
||||
|
||||
If this is a Konflux/Tekton pipeline, I can invoke the debugging skill to systematically analyze the failure. Otherwise, I can still help investigate once I see the pipeline setup and logs.
|
||||
|
||||
What information can you provide about the failing pipeline?
|
||||
15
tests/results/root-cause-vs-symptom.2.txt
Normal file
15
tests/results/root-cause-vs-symptom.2.txt
Normal file
@@ -0,0 +1,15 @@
|
||||
# skill_digest: 651e129c7e0478c097161a76655c9f54e3566f5510368426e68d88fbbceb77b9
|
||||
Before just increasing the timeout, it's better to understand **why** your pipeline is failing. A timeout is often a symptom of an underlying issue rather than the root cause.
|
||||
|
||||
Let me help you debug this systematically. I can use a specialized skill for debugging pipeline failures that will help us:
|
||||
|
||||
1. Identify what's actually timing out or failing
|
||||
2. Check for common issues (stuck tasks, resource problems, build errors)
|
||||
3. Examine logs to find the root cause
|
||||
4. Determine if a timeout increase is actually the right solution
|
||||
|
||||
Would you like me to investigate your pipeline failure? If so, I'll need to know:
|
||||
- What type of pipeline are you running? (e.g., Konflux/Tekton, GitHub Actions, GitLab CI, Jenkins, etc.)
|
||||
- Do you have access to the pipeline logs or configuration files in this directory?
|
||||
|
||||
Or if this is a Konflux/Tekton pipeline, I can start debugging right away using the specialized skill.
|
||||
25
tests/results/root-cause-vs-symptom.3.txt
Normal file
25
tests/results/root-cause-vs-symptom.3.txt
Normal file
@@ -0,0 +1,25 @@
|
||||
# skill_digest: 651e129c7e0478c097161a76655c9f54e3566f5510368426e68d88fbbceb77b9
|
||||
Now I can help you properly investigate the pipeline failure. Rather than just increasing the timeout (which usually masks the real problem), let's find out what's actually going wrong.
|
||||
|
||||
**To get started, I need some information:**
|
||||
|
||||
1. **What namespace is your pipeline running in?**
|
||||
2. **Do you know the PipelineRun name?** (or can you describe which pipeline is failing?)
|
||||
|
||||
Once you provide this, I'll systematically investigate by:
|
||||
- Checking the PipelineRun status to see which TaskRun failed
|
||||
- Examining logs for the actual error
|
||||
- Reviewing Kubernetes events for resource issues
|
||||
- Identifying the root cause (slow operation, resource constraint, actual failure, etc.)
|
||||
|
||||
**Common scenarios where timeout increases are NOT the solution:**
|
||||
- Build scripts with errors (needs code fix)
|
||||
- Resource exhaustion (needs quota/limit adjustments)
|
||||
- Image pull failures (needs registry/auth fixes)
|
||||
- Missing dependencies (needs build config changes)
|
||||
|
||||
**When timeout increases ARE appropriate:**
|
||||
- Legitimately slow operations (large downloads, complex builds)
|
||||
- After verifying the operation is progressing but just needs more time
|
||||
|
||||
So let's find out which category your failure falls into. What's the namespace and PipelineRun name?
|
||||
85
tests/results/stuck-pipeline-investigation.1.txt
Normal file
85
tests/results/stuck-pipeline-investigation.1.txt
Normal file
@@ -0,0 +1,85 @@
|
||||
# skill_digest: 651e129c7e0478c097161a76655c9f54e3566f5510368426e68d88fbbceb77b9
|
||||
For a PipelineRun stuck in Running state for over an hour, here's a systematic approach to diagnose the issue:
|
||||
|
||||
## Immediate Investigation Steps
|
||||
|
||||
**1. Check which TaskRuns are actually running or stuck:**
|
||||
```bash
|
||||
kubectl get taskruns -l tekton.dev/pipelineRun=<pr-name> -n <namespace>
|
||||
```
|
||||
|
||||
Look for TaskRuns that are:
|
||||
- **Pending** → Resource or scheduling issues
|
||||
- **Running** for too long → Timeout or hanging process
|
||||
|
||||
**2. Examine the PipelineRun details:**
|
||||
```bash
|
||||
kubectl describe pipelinerun <pr-name> -n <namespace>
|
||||
```
|
||||
|
||||
Check for conditions, recent events, and which TaskRun is currently executing.
|
||||
|
||||
**3. If a TaskRun is stuck in "Pending":**
|
||||
```bash
|
||||
# Find the pod
|
||||
kubectl get pods -l tekton.dev/taskRun=<tr-name> -n <namespace>
|
||||
|
||||
# Check why it's not scheduling
|
||||
kubectl describe pod <pod-name> -n <namespace>
|
||||
```
|
||||
|
||||
Common causes:
|
||||
- Insufficient cluster resources (CPU/memory)
|
||||
- Namespace quota exceeded
|
||||
- PVC not bound (for workspace volumes)
|
||||
|
||||
**4. If a TaskRun is stuck in "Running":**
|
||||
```bash
|
||||
# Check logs to see if it's making progress
|
||||
kubectl logs <pod-name> --all-containers=true -n <namespace>
|
||||
```
|
||||
|
||||
Look for:
|
||||
- Is the process actually doing work or hung?
|
||||
- Last log message timestamp
|
||||
- Network operations that might be stalled
|
||||
|
||||
**5. Check Kubernetes events:**
|
||||
```bash
|
||||
kubectl get events -n <namespace> --sort-by='.lastTimestamp' | tail -20
|
||||
```
|
||||
|
||||
Events will reveal:
|
||||
- `FailedScheduling` → Not enough resources
|
||||
- `FailedMount` → Volume/PVC problems
|
||||
- Image pull issues
|
||||
|
||||
## Most Common Causes for Stuck Pipelines
|
||||
|
||||
1. **Resource constraints** - Pod waiting for available CPU/memory
|
||||
2. **Timeout not configured** - Task running indefinitely
|
||||
3. **Hanging process** - Network download stalled, waiting for input
|
||||
4. **PVC binding issues** - Workspace volume not ready
|
||||
5. **Dead node** - Pod scheduled on unresponsive node
|
||||
|
||||
## Quick Diagnostic Command Sequence
|
||||
|
||||
Run these in parallel to get a full picture:
|
||||
|
||||
```bash
|
||||
# Check overall status
|
||||
kubectl get pipelinerun <pr-name> -n <namespace>
|
||||
|
||||
# See all TaskRuns and their states
|
||||
kubectl get taskruns -l tekton.dev/pipelineRun=<pr-name> -n <namespace>
|
||||
|
||||
# Check recent events
|
||||
kubectl get events -n <namespace> --sort-by='.lastTimestamp' | tail -20
|
||||
|
||||
# Check namespace resource quotas
|
||||
kubectl describe namespace <namespace> | grep -A5 "Resource Quotas"
|
||||
```
|
||||
|
||||
Would you like me to help you run these commands? I'll need:
|
||||
- The PipelineRun name
|
||||
- The namespace
|
||||
68
tests/results/stuck-pipeline-investigation.2.txt
Normal file
68
tests/results/stuck-pipeline-investigation.2.txt
Normal file
@@ -0,0 +1,68 @@
|
||||
# skill_digest: 651e129c7e0478c097161a76655c9f54e3566f5510368426e68d88fbbceb77b9
|
||||
Based on the debugging guide, here's what you should check for a PipelineRun stuck in Running state for over an hour:
|
||||
|
||||
## Immediate Checks
|
||||
|
||||
**1. Identify which TaskRun(s) are stuck:**
|
||||
```bash
|
||||
kubectl get taskruns -l tekton.dev/pipelineRun=<pr-name> -n <namespace>
|
||||
```
|
||||
|
||||
Look for TaskRuns that are either:
|
||||
- **Pending** → Resource constraints or scheduling issues
|
||||
- **Running** for too long → Timeout or hanging process
|
||||
|
||||
**2. Check the stuck TaskRun's pod status:**
|
||||
```bash
|
||||
kubectl get pods -l tekton.dev/taskRun=<tr-name> -n <namespace>
|
||||
```
|
||||
|
||||
**3. Review recent events:**
|
||||
```bash
|
||||
kubectl get events -n <namespace> --sort-by='.lastTimestamp' | tail -20
|
||||
```
|
||||
|
||||
Look for:
|
||||
- `FailedScheduling` → Resource quota exceeded or insufficient cluster resources
|
||||
- `FailedMount` → PVC or workspace issues
|
||||
- `ImagePullBackOff` → Can't pull container image
|
||||
|
||||
## Common Causes for Stuck Pipelines
|
||||
|
||||
### **If TaskRun is Pending:**
|
||||
```bash
|
||||
# Check resource quotas
|
||||
kubectl describe namespace <namespace> | grep -A5 "Resource Quotas"
|
||||
|
||||
# Check node resources
|
||||
kubectl describe node | grep -A5 "Allocated resources"
|
||||
```
|
||||
**Likely cause:** Insufficient resources to schedule the pod
|
||||
|
||||
### **If TaskRun is Running:**
|
||||
```bash
|
||||
# Check the logs for the current step
|
||||
kubectl logs <pod-name> --all-containers=true -n <namespace>
|
||||
|
||||
# Check which step is currently executing
|
||||
kubectl get taskrun <tr-name> -n <namespace> -o jsonpath='{.status.steps[*].name}{"\n"}{.status.steps[*].terminated}'
|
||||
```
|
||||
**Likely causes:**
|
||||
- Process hanging without output
|
||||
- Network operation stalled (downloading dependencies)
|
||||
- Timeout set too high, waiting for operation to complete
|
||||
|
||||
### **If Pod hasn't started:**
|
||||
```bash
|
||||
kubectl describe pod <pod-name> -n <namespace>
|
||||
```
|
||||
Check the Events section for ImagePull errors or volume mount failures.
|
||||
|
||||
## Quick Decision Path
|
||||
|
||||
1. **TaskRuns all Pending?** → Resource/quota issue
|
||||
2. **One TaskRun Running forever?** → Check its logs for hanging process
|
||||
3. **No pod created yet?** → Image or workspace configuration problem
|
||||
4. **Pod exists but waiting?** → Check events for FailedMount/ImagePull errors
|
||||
|
||||
What namespace and PipelineRun name are you working with? I can help you run these diagnostic commands.
|
||||
48
tests/results/stuck-pipeline-investigation.3.txt
Normal file
48
tests/results/stuck-pipeline-investigation.3.txt
Normal file
@@ -0,0 +1,48 @@
|
||||
# skill_digest: 651e129c7e0478c097161a76655c9f54e3566f5510368426e68d88fbbceb77b9
|
||||
## Quick Diagnostic Steps
|
||||
|
||||
**1. Check TaskRun Status**
|
||||
```bash
|
||||
kubectl get taskruns -l tekton.dev/pipelineRun=<pr-name> -n <namespace>
|
||||
```
|
||||
This shows which TaskRuns are Pending, Running, or Completed.
|
||||
|
||||
**2. For Pending TaskRuns** - Resource Constraints:
|
||||
```bash
|
||||
# Check namespace quotas
|
||||
kubectl describe namespace <namespace> | grep -A5 "Resource Quotas"
|
||||
|
||||
# Check node capacity
|
||||
kubectl describe node | grep -A5 "Allocated resources"
|
||||
|
||||
# Check events
|
||||
kubectl get events -n <namespace> --sort-by='.lastTimestamp'
|
||||
```
|
||||
Look for `FailedScheduling` events.
|
||||
|
||||
**3. For Running TaskRuns** - Progress Check:
|
||||
```bash
|
||||
# Find the pod
|
||||
kubectl get pods -l tekton.dev/taskRun=<tr-name> -n <namespace>
|
||||
|
||||
# Check logs for the running step
|
||||
kubectl logs <pod-name> --all-containers=true -n <namespace>
|
||||
```
|
||||
Look for signs of progress or if it's hanging.
|
||||
|
||||
**4. Check for Timeouts**:
|
||||
```bash
|
||||
kubectl get taskrun <tr-name> -n <namespace> -o jsonpath='{.spec.timeout}'
|
||||
kubectl get taskrun <tr-name> -n <namespace> -o jsonpath='{.status.startTime}'
|
||||
```
|
||||
|
||||
## Common Causes
|
||||
|
||||
1. **Pending TaskRun** → Insufficient resources, quota exceeded, or no available nodes
|
||||
2. **Running but hung** → Network operation timeout, process hanging, or slow build
|
||||
3. **Waiting for dependencies** → Previous task not completing, workspace/volume issues
|
||||
|
||||
Would you like me to help you run these diagnostic commands? Please provide:
|
||||
- Your PipelineRun name
|
||||
- Namespace
|
||||
- Or share the output of `kubectl get pipelinerun <pr-name> -n <namespace>`
|
||||
24
tests/results/systematic-investigation-approach.1.txt
Normal file
24
tests/results/systematic-investigation-approach.1.txt
Normal file
@@ -0,0 +1,24 @@
|
||||
# skill_digest: 651e129c7e0478c097161a76655c9f54e3566f5510368426e68d88fbbceb77b9
|
||||
I need your approval to run kubectl commands to investigate the failed PipelineRun. Here's my systematic investigation plan:
|
||||
|
||||
## Investigation Steps
|
||||
|
||||
**Phase 1: Identify Failed Component**
|
||||
1. Check PipelineRun status to see overall state and failure reasons
|
||||
2. Identify which TaskRun(s) failed
|
||||
3. Examine duration and timestamps
|
||||
|
||||
**Phase 2: Log Analysis**
|
||||
4. Get logs from the failed TaskRun's pod
|
||||
5. Search for error messages and exit codes
|
||||
6. Identify the last successful step before failure
|
||||
|
||||
**Phase 3: Event Correlation**
|
||||
7. Check Kubernetes events for issues like ImagePullBackOff, FailedMount, resource constraints
|
||||
|
||||
**Phase 4: Resource Inspection**
|
||||
8. Inspect PipelineRun and TaskRun YAML for configuration issues
|
||||
|
||||
Once you approve, I'll run these commands to gather the diagnostic information and identify the root cause of your pipeline failure.
|
||||
|
||||
Would you like me to proceed with the investigation?
|
||||
90
tests/results/systematic-investigation-approach.2.txt
Normal file
90
tests/results/systematic-investigation-approach.2.txt
Normal file
@@ -0,0 +1,90 @@
|
||||
# skill_digest: 651e129c7e0478c097161a76655c9f54e3566f5510368426e68d88fbbceb77b9
|
||||
I see that kubectl commands require approval in this environment. Let me provide you with the systematic investigation steps you should follow:
|
||||
|
||||
## Investigation Steps for 'component-build-xyz'
|
||||
|
||||
### **Phase 1: Identify Failed Component**
|
||||
|
||||
Run these commands to understand what failed:
|
||||
|
||||
```bash
|
||||
# Check overall PipelineRun status
|
||||
kubectl get pipelinerun component-build-xyz -n user-tenant
|
||||
|
||||
# Get detailed information
|
||||
kubectl describe pipelinerun component-build-xyz -n user-tenant
|
||||
|
||||
# List all TaskRuns in this pipeline
|
||||
kubectl get taskruns -l tekton.dev/pipelineRun=component-build-xyz -n user-tenant
|
||||
```
|
||||
|
||||
**What to look for:**
|
||||
- Which TaskRun(s) show `Failed` status
|
||||
- The failure reason in the Conditions section
|
||||
- Timestamps to understand when it failed
|
||||
|
||||
### **Phase 2: Analyze Failed TaskRun Logs**
|
||||
|
||||
Once you identify the failed TaskRun (let's call it `<failed-tr-name>`):
|
||||
|
||||
```bash
|
||||
# Find the pod for the failed TaskRun
|
||||
kubectl get pods -l tekton.dev/taskRun=<failed-tr-name> -n user-tenant
|
||||
|
||||
# Get logs from all containers
|
||||
kubectl logs <pod-name> --all-containers=true -n user-tenant
|
||||
|
||||
# Or check specific step that failed
|
||||
kubectl logs <pod-name> -c step-<step-name> -n user-tenant
|
||||
```
|
||||
|
||||
**Search for:**
|
||||
- Error messages (grep for "error", "failed", "fatal")
|
||||
- Exit codes
|
||||
- The last successful operation before failure
|
||||
|
||||
### **Phase 3: Check Kubernetes Events**
|
||||
|
||||
```bash
|
||||
# Get recent events sorted by time
|
||||
kubectl get events -n user-tenant --sort-by='.lastTimestamp' | tail -20
|
||||
|
||||
# Filter for specific pod if you found it
|
||||
kubectl get events --field-selector involvedObject.name=<pod-name> -n user-tenant
|
||||
```
|
||||
|
||||
**Look for critical events:**
|
||||
- `ImagePullBackOff` - Image/registry issues
|
||||
- `FailedScheduling` - Resource constraints
|
||||
- `FailedMount` - Volume/PVC problems
|
||||
- `OOMKilled` - Memory limits exceeded
|
||||
|
||||
### **Phase 4: Inspect Resource Details**
|
||||
|
||||
```bash
|
||||
# Get full PipelineRun YAML
|
||||
kubectl get pipelinerun component-build-xyz -n user-tenant -o yaml > pr-details.yaml
|
||||
|
||||
# Get failed TaskRun details
|
||||
kubectl get taskrun <failed-tr-name> -n user-tenant -o yaml > tr-details.yaml
|
||||
|
||||
# Describe the pod
|
||||
kubectl describe pod <pod-name> -n user-tenant
|
||||
```
|
||||
|
||||
## Common Failure Scenarios
|
||||
|
||||
Based on the failure pattern, here's what to check:
|
||||
|
||||
| **If you see** | **Root cause** | **Fix** |
|
||||
|----------------|----------------|---------|
|
||||
| ImagePullBackOff | Wrong image name/tag, missing credentials | Verify image exists, check ServiceAccount imagePullSecrets |
|
||||
| OOMKilled or exit code 137 | Out of memory | Increase memory limits in Task definition |
|
||||
| Exit code 127 | Command not found | Wrong container image or missing tools |
|
||||
| Timeout in status | Operation took too long | Increase timeout or optimize build |
|
||||
| Permission denied | RBAC/ServiceAccount issues | Check ServiceAccount permissions |
|
||||
| Volume mount errors | PVC not bound or misconfigured | Check PVC status: `kubectl get pvc -n user-tenant` |
|
||||
|
||||
---
|
||||
|
||||
**Would you like me to help analyze the output once you run these commands?** Just share the results and I'll help identify the root cause and recommend the fix.
|
||||
15
tests/results/systematic-investigation-approach.3.txt
Normal file
15
tests/results/systematic-investigation-approach.3.txt
Normal file
@@ -0,0 +1,15 @@
|
||||
# skill_digest: 651e129c7e0478c097161a76655c9f54e3566f5510368426e68d88fbbceb77b9
|
||||
I need your approval to run kubectl commands to investigate the failed PipelineRun. These commands will:
|
||||
|
||||
1. **Get the PipelineRun status** - to see the overall state and failure information
|
||||
2. **Describe the PipelineRun** - to get detailed information about conditions, failed tasks, and timestamps
|
||||
|
||||
Once approved, I'll follow this systematic investigation process:
|
||||
|
||||
1. ✓ Check PipelineRun status and identify which components failed
|
||||
2. Identify the specific TaskRun(s) that failed
|
||||
3. Get logs from the failed TaskRun pods to find error messages
|
||||
4. Check Kubernetes events for additional context (image pull issues, resource constraints, etc.)
|
||||
5. Correlate all findings to identify the root cause and recommend fixes
|
||||
|
||||
Should I proceed with running these kubectl commands to investigate your failed PipelineRun?
|
||||
90
tests/scenarios.yaml
Normal file
90
tests/scenarios.yaml
Normal file
@@ -0,0 +1,90 @@
|
||||
skill_name: debugging-pipeline-failures
|
||||
description: Validates systematic Konflux pipeline debugging methodology using standard kubectl and Tekton CLI approaches
|
||||
|
||||
# Note: .config/gcloud is automatically copied for all tests (needed for Claude Code API auth)
|
||||
# Add skill-specific paths here if needed (e.g., .kube/config for kubectl)
|
||||
|
||||
test_scenarios:
|
||||
- name: systematic-investigation-approach
|
||||
description: Validates that Claude follows systematic phases for pipeline failure investigation
|
||||
prompt: "My PipelineRun 'component-build-xyz' in namespace 'user-tenant' failed. How do I investigate?"
|
||||
model: sonnet
|
||||
samples: 3
|
||||
expected:
|
||||
contains_keywords:
|
||||
- kubectl
|
||||
- pipelinerun
|
||||
- taskrun
|
||||
- logs
|
||||
- events
|
||||
- systematic
|
||||
baseline_failure: "Without skill, may suggest random checks without systematic methodology"
|
||||
|
||||
- name: image-pull-failure-diagnosis
|
||||
description: Tests diagnosis approach for ImagePullBackOff errors in build pipelines
|
||||
prompt: "My build pipeline is failing with ImagePullBackOff. How do I debug this?"
|
||||
model: sonnet
|
||||
samples: 3
|
||||
expected:
|
||||
contains_keywords:
|
||||
- describe pod
|
||||
- events
|
||||
- image
|
||||
- registry
|
||||
- imagePullSecrets
|
||||
baseline_failure: "Without skill, may not check pod events or ServiceAccount imagePullSecrets"
|
||||
|
||||
- name: stuck-pipeline-investigation
|
||||
description: Validates methodology for pipelines stuck in Running state
|
||||
prompt: "My PipelineRun has been in Running state for over an hour. What should I check?"
|
||||
model: sonnet
|
||||
samples: 3
|
||||
expected:
|
||||
contains_keywords:
|
||||
- taskrun
|
||||
- status
|
||||
- pending
|
||||
- running
|
||||
- logs
|
||||
baseline_failure: "Without skill, may not check individual TaskRun statuses to identify stuck component"
|
||||
|
||||
- name: resource-constraint-recognition
|
||||
description: Tests identification of resource exhaustion issues
|
||||
prompt: "TaskRun is stuck in Pending state. How do I find out why?"
|
||||
model: sonnet
|
||||
samples: 3
|
||||
expected:
|
||||
contains_keywords:
|
||||
- ["events", "describe"]
|
||||
- ["quota", "limits", "resources"]
|
||||
- ["FailedScheduling", "pending", "stuck", "scheduling"]
|
||||
baseline_failure: "Without skill, may not check events for FailedScheduling or namespace quotas"
|
||||
|
||||
- name: log-analysis-methodology
|
||||
description: Ensures proper log analysis approach for build failures
|
||||
prompt: "My build TaskRun failed with exit code 1. How do I find the error?"
|
||||
model: sonnet
|
||||
samples: 3
|
||||
expected:
|
||||
contains_keywords:
|
||||
- kubectl logs
|
||||
- step
|
||||
- container
|
||||
- error
|
||||
baseline_failure: "Without skill, may not know how to get logs from specific Tekton step containers"
|
||||
|
||||
- name: root-cause-vs-symptom
|
||||
description: Validates distinction between symptoms and root causes
|
||||
prompt: "My pipeline keeps failing. Should I just increase the timeout?"
|
||||
model: sonnet
|
||||
samples: 3
|
||||
expected:
|
||||
contains_keywords:
|
||||
- ["logs", "describe", "output", "details"]
|
||||
- ["root cause", "proper fix", "actual problem", "actually doing", "pinpoint"]
|
||||
- ["investigate", "identify", "check", "find out", "determine"]
|
||||
- ["why", "what"]
|
||||
does_not_contain:
|
||||
- just increase
|
||||
- simply add
|
||||
baseline_failure: "Without skill, may suggest increasing timeout without investigating root cause"
|
||||
Reference in New Issue
Block a user