91 lines
3.1 KiB
Plaintext
91 lines
3.1 KiB
Plaintext
# skill_digest: 651e129c7e0478c097161a76655c9f54e3566f5510368426e68d88fbbceb77b9
|
|
I see that kubectl commands require approval in this environment. Let me provide you with the systematic investigation steps you should follow:
|
|
|
|
## Investigation Steps for 'component-build-xyz'
|
|
|
|
### **Phase 1: Identify Failed Component**
|
|
|
|
Run these commands to understand what failed:
|
|
|
|
```bash
|
|
# Check overall PipelineRun status
|
|
kubectl get pipelinerun component-build-xyz -n user-tenant
|
|
|
|
# Get detailed information
|
|
kubectl describe pipelinerun component-build-xyz -n user-tenant
|
|
|
|
# List all TaskRuns in this pipeline
|
|
kubectl get taskruns -l tekton.dev/pipelineRun=component-build-xyz -n user-tenant
|
|
```
|
|
|
|
**What to look for:**
|
|
- Which TaskRun(s) show `Failed` status
|
|
- The failure reason in the Conditions section
|
|
- Timestamps to understand when it failed
|
|
|
|
### **Phase 2: Analyze Failed TaskRun Logs**
|
|
|
|
Once you identify the failed TaskRun (let's call it `<failed-tr-name>`):
|
|
|
|
```bash
|
|
# Find the pod for the failed TaskRun
|
|
kubectl get pods -l tekton.dev/taskRun=<failed-tr-name> -n user-tenant
|
|
|
|
# Get logs from all containers
|
|
kubectl logs <pod-name> --all-containers=true -n user-tenant
|
|
|
|
# Or check specific step that failed
|
|
kubectl logs <pod-name> -c step-<step-name> -n user-tenant
|
|
```
|
|
|
|
**Search for:**
|
|
- Error messages (grep for "error", "failed", "fatal")
|
|
- Exit codes
|
|
- The last successful operation before failure
|
|
|
|
### **Phase 3: Check Kubernetes Events**
|
|
|
|
```bash
|
|
# Get recent events sorted by time
|
|
kubectl get events -n user-tenant --sort-by='.lastTimestamp' | tail -20
|
|
|
|
# Filter for specific pod if you found it
|
|
kubectl get events --field-selector involvedObject.name=<pod-name> -n user-tenant
|
|
```
|
|
|
|
**Look for critical events:**
|
|
- `ImagePullBackOff` - Image/registry issues
|
|
- `FailedScheduling` - Resource constraints
|
|
- `FailedMount` - Volume/PVC problems
|
|
- `OOMKilled` - Memory limits exceeded
|
|
|
|
### **Phase 4: Inspect Resource Details**
|
|
|
|
```bash
|
|
# Get full PipelineRun YAML
|
|
kubectl get pipelinerun component-build-xyz -n user-tenant -o yaml > pr-details.yaml
|
|
|
|
# Get failed TaskRun details
|
|
kubectl get taskrun <failed-tr-name> -n user-tenant -o yaml > tr-details.yaml
|
|
|
|
# Describe the pod
|
|
kubectl describe pod <pod-name> -n user-tenant
|
|
```
|
|
|
|
## Common Failure Scenarios
|
|
|
|
Based on the failure pattern, here's what to check:
|
|
|
|
| **If you see** | **Root cause** | **Fix** |
|
|
|----------------|----------------|---------|
|
|
| ImagePullBackOff | Wrong image name/tag, missing credentials | Verify image exists, check ServiceAccount imagePullSecrets |
|
|
| OOMKilled or exit code 137 | Out of memory | Increase memory limits in Task definition |
|
|
| Exit code 127 | Command not found | Wrong container image or missing tools |
|
|
| Timeout in status | Operation took too long | Increase timeout or optimize build |
|
|
| Permission denied | RBAC/ServiceAccount issues | Check ServiceAccount permissions |
|
|
| Volume mount errors | PVC not bound or misconfigured | Check PVC status: `kubectl get pvc -n user-tenant` |
|
|
|
|
---
|
|
|
|
**Would you like me to help analyze the output once you run these commands?** Just share the results and I'll help identify the root cause and recommend the fix.
|