Files
gh-konflux-ci-skills-skills…/tests/results/systematic-investigation-approach.2.txt
2025-11-30 08:35:19 +08:00

91 lines
3.1 KiB
Plaintext

# skill_digest: 651e129c7e0478c097161a76655c9f54e3566f5510368426e68d88fbbceb77b9
I see that kubectl commands require approval in this environment. Let me provide you with the systematic investigation steps you should follow:
## Investigation Steps for 'component-build-xyz'
### **Phase 1: Identify Failed Component**
Run these commands to understand what failed:
```bash
# Check overall PipelineRun status
kubectl get pipelinerun component-build-xyz -n user-tenant
# Get detailed information
kubectl describe pipelinerun component-build-xyz -n user-tenant
# List all TaskRuns in this pipeline
kubectl get taskruns -l tekton.dev/pipelineRun=component-build-xyz -n user-tenant
```
**What to look for:**
- Which TaskRun(s) show `Failed` status
- The failure reason in the Conditions section
- Timestamps to understand when it failed
### **Phase 2: Analyze Failed TaskRun Logs**
Once you identify the failed TaskRun (let's call it `<failed-tr-name>`):
```bash
# Find the pod for the failed TaskRun
kubectl get pods -l tekton.dev/taskRun=<failed-tr-name> -n user-tenant
# Get logs from all containers
kubectl logs <pod-name> --all-containers=true -n user-tenant
# Or check specific step that failed
kubectl logs <pod-name> -c step-<step-name> -n user-tenant
```
**Search for:**
- Error messages (grep for "error", "failed", "fatal")
- Exit codes
- The last successful operation before failure
### **Phase 3: Check Kubernetes Events**
```bash
# Get recent events sorted by time
kubectl get events -n user-tenant --sort-by='.lastTimestamp' | tail -20
# Filter for specific pod if you found it
kubectl get events --field-selector involvedObject.name=<pod-name> -n user-tenant
```
**Look for critical events:**
- `ImagePullBackOff` - Image/registry issues
- `FailedScheduling` - Resource constraints
- `FailedMount` - Volume/PVC problems
- `OOMKilled` - Memory limits exceeded
### **Phase 4: Inspect Resource Details**
```bash
# Get full PipelineRun YAML
kubectl get pipelinerun component-build-xyz -n user-tenant -o yaml > pr-details.yaml
# Get failed TaskRun details
kubectl get taskrun <failed-tr-name> -n user-tenant -o yaml > tr-details.yaml
# Describe the pod
kubectl describe pod <pod-name> -n user-tenant
```
## Common Failure Scenarios
Based on the failure pattern, here's what to check:
| **If you see** | **Root cause** | **Fix** |
|----------------|----------------|---------|
| ImagePullBackOff | Wrong image name/tag, missing credentials | Verify image exists, check ServiceAccount imagePullSecrets |
| OOMKilled or exit code 137 | Out of memory | Increase memory limits in Task definition |
| Exit code 127 | Command not found | Wrong container image or missing tools |
| Timeout in status | Operation took too long | Increase timeout or optimize build |
| Permission denied | RBAC/ServiceAccount issues | Check ServiceAccount permissions |
| Volume mount errors | PVC not bound or misconfigured | Check PVC status: `kubectl get pvc -n user-tenant` |
---
**Would you like me to help analyze the output once you run these commands?** Just share the results and I'll help identify the root cause and recommend the fix.