zhongwei/gh-konflux-ci-skills-skills-debugging-pipeline-failures

Files

Zhongwei Li 3dba6825cf Initial commit

2025-11-30 08:35:19 +08:00

9.4 KiB

Raw Permalink Blame History

name, description

name	description
debugging-pipeline-failures	Use when Konflux pipelines fail, are stuck, timeout, or show errors like ImagePullBackOff. Covers PipelineRun failures, TaskRun issues (Pending, Failed, stuck Running), build errors, and systematic debugging of Tekton pipeline problems using kubectl and logs.

Debugging Pipeline Failures

Overview

Core Principle: Systematic investigation of Konflux CI/CD failures by correlating logs, events, and resource states to identify root causes.

Key Abbreviations:

PR = PipelineRun
TR = TaskRun
SA = ServiceAccount
PVC = PersistentVolumeClaim

When to Use

Invoke when encountering:

PipelineRun failures or stuck pipelines
TaskRun errors with unclear messages
Build container issues (ImagePullBackOff)
Resource constraints (OOMKilled, quota exceeded)
Pipeline timeouts
Workspace or volume mount failures
Permission errors

Quick Reference

Symptom	First Check	Common Cause
ImagePullBackOff	Pod events, image name	Registry auth, typo, missing image
TaskRun timeout	Step execution time in logs	Slow operation, network issues
Pending TaskRun	Resource quotas, node capacity	Quota exceeded, insufficient resources
Permission denied	ServiceAccount, RBAC	Missing Role/RoleBinding
Volume mount error	PVC status, workspace config	PVC not bound, wrong access mode
Exit code 127	Container logs, command	Command not found, wrong image

Investigation Phases

Phase 1: Identify Failed Component

PipelineRun Status Check:

kubectl get pipelinerun <pr-name> -n <namespace>
kubectl describe pipelinerun <pr-name> -n <namespace>

Look for:

Overall status (Succeeded/Failed/Running)
Conditions and reasons
Which TaskRun(s) failed
Duration and timestamps

TaskRun Identification:

kubectl get taskruns -l tekton.dev/pipelineRun=<pr-name> -n <namespace>

Identify failed TaskRuns by status.

Phase 2: Log Analysis

Get TaskRun Pod Logs:

# Find the pod
kubectl get pods -l tekton.dev/taskRun=<tr-name> -n <namespace>

# Get logs from specific step
kubectl logs <pod-name> -c step-<step-name> -n <namespace>

# Get logs from all containers
kubectl logs <pod-name> --all-containers=true -n <namespace>

# For previous failures
kubectl logs <pod-name> -c step-<step-name> --previous -n <namespace>

What to Look For:

Error messages (search for "error", "failed", "fatal")
Exit codes
Last successful operation before failure
Timeout indicators
Resource exhaustion messages

Phase 3: Event Correlation

Check Kubernetes Events:

kubectl get events -n <namespace> --sort-by='.lastTimestamp'

# Filter for specific resource
kubectl get events --field-selector involvedObject.name=<pod-name> -n <namespace>

Critical Events:

FailedScheduling → Resource constraints
FailedMount → Volume/PVC issues
ImagePullBackOff → Registry/image problems
Evicted → Resource pressure

Phase 4: Resource Inspection

PipelineRun Details:

kubectl get pipelinerun <pr-name> -n <namespace> -o yaml

Check:

Parameters passed correctly
Workspace configurations
ServiceAccount specified
Timeout values

TaskRun Details:

kubectl get taskrun <tr-name> -n <namespace> -o yaml

Examine:

Step definitions and images
Resource requests/limits
Status.steps for individual step states
Conditions for failure reasons

Pod Inspection:

kubectl describe pod <pod-name> -n <namespace>

Look for:

Container states and exit codes
Resource requests vs limits
Volume mounts
Node placement

Phase 5: Root Cause Analysis

Correlate Findings:

Timeline: When did failure occur?
First failure: Which step/component failed first?
Error pattern: Consistent or intermittent?
Recent changes: New code, config, images?

Distinguish Symptom from Cause:

❌ "Build failed" (symptom)
✓ "npm install timed out due to registry being unavailable" (root cause)

Common Failure Patterns

1. Image Pull Failures

Symptoms: ImagePullBackOff, ErrImagePull

Investigation:

kubectl describe pod <pod-name> -n <namespace> | grep -A5 "Events"

Check:

Image name and tag spelling
Image exists in registry
ServiceAccount has imagePullSecrets
Registry is accessible

Common Fixes:

Correct image name/tag
Add imagePullSecret to ServiceAccount
Verify registry credentials
Check network policies

2. Resource Exhaustion

Symptoms: OOMKilled, Pending pods, quota errors

Investigation:

kubectl describe namespace <namespace> | grep -A5 "Resource Quotas"
kubectl top pods -n <namespace>
kubectl describe node | grep -A5 "Allocated resources"

Common Causes:

Memory limits too low
Namespace quota exceeded
No nodes with available resources

Fixes:

Increase resource limits in Task
Adjust namespace quotas
Optimize memory usage in build

3. Build Script Failures

Symptoms: Non-zero exit code, "command not found"

Investigation:

kubectl logs <pod-name> -c step-build -n <namespace>

Check:

Script syntax errors
Missing tools in container image
Wrong working directory
Environment variables not set

Fixes:

Fix script errors
Use image with required tools
Set correct workingDir in Task
Pass required params/env vars

4. Timeout Issues

Symptoms: TaskRun shows timeout in status

Investigation:

kubectl get taskrun <tr-name> -n <namespace> -o jsonpath='{.spec.timeout}'
kubectl get taskrun <tr-name> -n <namespace> -o jsonpath='{.status.startTime}{"\n"}{.status.completionTime}'

Common Causes:

Timeout value too low
Slow network operations (downloads)
Build complexity underestimated
Process hanging

Fixes:

Increase timeout in Task/PipelineRun
Use caching for dependencies
Optimize build process
Add progress logging to detect hangs

5. Workspace/Volume Issues

Symptoms: CreateContainerError, volume mount failures

Investigation:

kubectl get pvc -n <namespace>
kubectl describe pvc <pvc-name> -n <namespace>

Check:

PVC exists and is Bound
Workspace name matches between Pipeline and PipelineRun
AccessMode is correct (RWO vs RWX)
Storage class exists

Fixes:

Create or fix PVC
Correct workspace name references
Use appropriate access mode
Verify storage provisioner

6. Permission Errors

Symptoms: "Forbidden", "unauthorized", RBAC errors

Investigation:

kubectl get sa <sa-name> -n <namespace>
kubectl get rolebindings -n <namespace>
kubectl auth can-i create pods --as=system:serviceaccount:<namespace>:<sa-name>

Check:

ServiceAccount exists
Role/RoleBinding grants needed permissions
ClusterRole if cross-namespace access needed

Fixes:

Create ServiceAccount
Add RoleBinding for required permissions
Grant pod creation, secret access, etc.

Common Confusions

✗ Incorrect Approach

"Pipeline failed, let me rerun it immediately"

No root cause identified
Will likely fail again
Wastes resources and time

✓ Correct Approach

"Let me check logs and events to understand why it failed, then fix the root cause"

Identifies actual problem
Prevents repeat failures
Efficient resolution

✗ Incorrect Approach

"Build timed out. I'll set timeout to 2 hours"

May hide real issues
Delays problem detection

✓ Correct Approach

"Let me check what operation is slow in the logs, then optimize or increase timeout if truly needed"

Identifies slow operations
Optimizes where possible
Sets appropriate timeout

✗ Incorrect Approach

"Too many logs to read, I'll just try changing something"

Random changes
May make it worse
Doesn't address root cause

✓ Correct Approach

"I'll search logs for error keywords and check the last successful step before failure"

Focused investigation
Finds actual error
Targeted fix

Troubleshooting Workflow

1. GET PIPELINERUN STATUS
   ↓
2. IDENTIFY FAILED TASKRUN(S)
   ↓
3. CHECK POD LOGS (specific step that failed)
   ↓
4. REVIEW EVENTS (timing correlation)
   ↓
5. INSPECT RESOURCE YAML (config issues)
   ↓
6. CORRELATE FINDINGS → IDENTIFY ROOT CAUSE
   ↓
7. APPLY FIX → VERIFY → DOCUMENT

Decision Tree

Q: Is the PipelineRun stuck in "Running"?

Yes → Check which TaskRuns are pending or running
- Pending → Resource constraints (Phase 2: Resource Exhaustion)
- Running too long → Check logs for progress (Phase 4: Timeouts)
No → PipelineRun Failed → Continue

Q: Which TaskRun failed first?

Check status of all TaskRuns to find first failure
Focus investigation on that TaskRun

Q: What does the pod log show?

Error message → Address specific error
No output → Check if pod started (events)
Exit code 127 → Command not found (wrong image)
Exit code 137 → OOMKilled (increase memory)
Other exit code → Script/command failure

Q: Do events show image, volume, or scheduling issues?

ImagePullBackOff → Phase 1: Image Pull Failures
FailedMount → Phase 5: Workspace/Volume Issues
FailedScheduling → Phase 2: Resource Exhaustion

Keywords for Search

Konflux pipeline failure, Tekton debugging, PipelineRun failed, TaskRun errors, build failures, CI/CD troubleshooting, ImagePullBackOff, OOMKilled, kubectl logs, pipeline timeout, workspace errors, RBAC permissions

9.4 KiB Raw Permalink Blame History

Debugging Pipeline Failures

Overview

When to Use

Quick Reference

Investigation Phases

Phase 1: Identify Failed Component

Phase 2: Log Analysis

Phase 3: Event Correlation

Phase 4: Resource Inspection

Phase 5: Root Cause Analysis

Common Failure Patterns

1. Image Pull Failures

2. Resource Exhaustion

3. Build Script Failures

4. Timeout Issues

5. Workspace/Volume Issues

6. Permission Errors

Common Confusions

✗ Incorrect Approach

✓ Correct Approach

✗ Incorrect Approach

✓ Correct Approach

✗ Incorrect Approach

✓ Correct Approach

Troubleshooting Workflow

Decision Tree

Keywords for Search

9.4 KiB

Raw Permalink Blame History