Files
gh-ahmedasmar-devops-claude…/skills/references/common_issues.md
2025-11-29 17:51:20 +08:00

583 lines
17 KiB
Markdown

# Common Kubernetes Issues and Solutions
This reference provides detailed information about common Kubernetes issues, their root causes, and remediation steps.
## Table of Contents
1. [Pod Issues](#pod-issues)
2. [Node Issues](#node-issues)
3. [Networking Issues](#networking-issues)
4. [Storage Issues](#storage-issues)
5. [Resource Issues](#resource-issues)
6. [Security Issues](#security-issues)
---
## Pod Issues
### ImagePullBackOff / ErrImagePull
**Symptoms:**
- Pod stuck in `ImagePullBackOff` or `ErrImagePull` state
- Events show image pull errors
**Common Causes:**
1. Image doesn't exist or tag is wrong
2. Registry authentication failure
3. Network issues reaching registry
4. Rate limiting from registry
5. Private registry without imagePullSecrets
**Diagnostic Commands:**
```bash
kubectl describe pod <pod-name> -n <namespace>
kubectl get events -n <namespace> --field-selector involvedObject.name=<pod-name>
```
**Remediation Steps:**
1. Verify image name and tag: `docker pull <image>` from local machine
2. Check imagePullSecrets exist: `kubectl get secrets -n <namespace>`
3. Verify secret has correct registry credentials
4. Check if registry is accessible from cluster
5. For rate limiting: implement imagePullSecrets with authenticated account
6. Update deployment with correct image path
**Prevention:**
- Use specific image tags instead of `latest`
- Implement image pull secrets for private registries
- Set up local registry or cache to reduce external pulls
- Use image validation in CI/CD pipeline
---
### CrashLoopBackOff
**Symptoms:**
- Pod repeatedly crashes and restarts
- Increasing restart count
- Container exits shortly after starting
**Common Causes:**
1. Application error on startup
2. Missing environment variables or config
3. Resource limits too restrictive
4. Liveness probe too aggressive
5. Missing dependencies (DB, cache, etc.)
6. Command/args misconfiguration
**Diagnostic Commands:**
```bash
kubectl logs <pod-name> -n <namespace>
kubectl logs <pod-name> -n <namespace> --previous
kubectl describe pod <pod-name> -n <namespace>
kubectl get pod <pod-name> -n <namespace> -o yaml
```
**Remediation Steps:**
1. Check current logs: `kubectl logs <pod-name>`
2. Check previous container logs: `kubectl logs <pod-name> --previous`
3. Look for startup errors, stack traces, or missing config messages
4. Verify environment variables are set correctly
5. Check if external dependencies are accessible
6. Review and adjust resource limits if OOMKilled
7. Adjust liveness probe initialDelaySeconds if failing too early
8. Verify command and args in pod spec
**Prevention:**
- Implement proper application health checks and readiness
- Use init containers for dependency checks
- Set appropriate resource requests/limits based on profiling
- Configure liveness probes with sufficient delay
- Add retry logic and graceful degradation to applications
---
### Pending Pods
**Symptoms:**
- Pod stuck in `Pending` state
- Not scheduled to any node
**Common Causes:**
1. Insufficient cluster resources (CPU/memory)
2. Node selectors/affinity rules can't be satisfied
3. No nodes match taints/tolerations
4. PersistentVolume not available
5. Resource quotas exceeded
6. Scheduler not running or misconfigured
**Diagnostic Commands:**
```bash
kubectl describe pod <pod-name> -n <namespace>
kubectl get nodes -o wide
kubectl describe nodes
kubectl get pv,pvc -n <namespace>
kubectl get resourcequota -n <namespace>
```
**Remediation Steps:**
1. Check pod events for scheduling failure reason
2. Verify node resources: `kubectl describe nodes | grep -A 5 "Allocated resources"`
3. Check if pod has node selectors: verify nodes have required labels
4. Review taints on nodes and tolerations on pod
5. For PVC issues: verify PV exists and is in `Available` state
6. Check namespace resource quota: `kubectl describe resourcequota -n <namespace>`
7. If no resources: scale cluster or reduce resource requests
8. If affinity issue: adjust affinity rules or add matching nodes
**Prevention:**
- Set appropriate resource requests (not just limits)
- Monitor cluster capacity and scale proactively
- Use pod disruption budgets to prevent total unavailability
- Implement cluster autoscaling
- Use multiple node pools for different workload types
---
### OOMKilled Pods
**Symptoms:**
- Pod terminates with exit code 137
- Container status shows `OOMKilled` reason
- Frequent restarts due to memory exhaustion
**Common Causes:**
1. Memory limit set too low
2. Memory leak in application
3. Unexpected load increase
4. No memory limits (using node's memory)
**Diagnostic Commands:**
```bash
kubectl describe pod <pod-name> -n <namespace>
kubectl logs <pod-name> --previous -n <namespace>
kubectl top pod <pod-name> -n <namespace>
```
**Remediation Steps:**
1. Confirm OOMKilled in pod events or status
2. Check memory usage before crash if metrics available
3. Review application logs for memory-intensive operations
4. Increase memory limit if application legitimately needs more
5. Profile application to identify memory leaks
6. Implement memory limits with requests = limits for guaranteed QoS
7. Consider horizontal scaling instead of vertical
**Prevention:**
- Profile applications to determine actual memory needs
- Set memory requests based on normal usage, limits with headroom
- Implement application-level memory monitoring
- Use horizontal pod autoscaling based on memory metrics
- Regular load testing to understand resource requirements
---
## Node Issues
### Node NotReady
**Symptoms:**
- Node status shows `NotReady`
- Pods on node may be evicted
- Node may be cordoned automatically
**Common Causes:**
1. Kubelet stopped or crashed
2. Network connectivity issues
3. Disk pressure
4. Memory pressure
5. PID pressure
6. Container runtime issues
**Diagnostic Commands:**
```bash
kubectl describe node <node-name>
kubectl get nodes -o wide
ssh <node> "systemctl status kubelet"
ssh <node> "journalctl -u kubelet -n 100"
ssh <node> "df -h"
ssh <node> "free -m"
```
**Remediation Steps:**
1. Check node conditions in describe output
2. Verify kubelet is running: `systemctl status kubelet`
3. Check kubelet logs: `journalctl -u kubelet`
4. For disk pressure: clean up unused images/containers
5. For memory pressure: identify and stop memory-hogging processes
6. Restart kubelet if crashed: `systemctl restart kubelet`
7. Check container runtime: `systemctl status docker` or `containerd`
8. Verify network connectivity to API server
**Prevention:**
- Monitor node resources with alerts
- Implement log rotation and image cleanup
- Set up node problem detector
- Use resource quotas to prevent resource exhaustion
- Regular maintenance windows for updates
---
### Disk Pressure
**Symptoms:**
- Node condition `DiskPressure` is True
- Pod evictions may occur
- Node may become NotReady
**Common Causes:**
1. Docker/containerd image cache filling disk
2. Container logs consuming space
3. Ephemeral storage usage by pods
4. System logs filling up
**Diagnostic Commands:**
```bash
kubectl describe node <node-name> | grep -A 10 Conditions
ssh <node> "df -h"
ssh <node> "du -sh /var/lib/docker/*"
ssh <node> "du -sh /var/lib/containerd/*"
ssh <node> "docker system df"
```
**Remediation Steps:**
1. Clean up unused images: `docker image prune -a`
2. Clean up stopped containers: `docker container prune`
3. Clean up volumes: `docker volume prune`
4. Rotate and compress logs
5. Check for pods with excessive ephemeral storage usage
6. Expand disk if consistently running out of space
7. Configure kubelet garbage collection parameters
**Prevention:**
- Set imagefs.available threshold in kubelet config
- Implement automated image pruning
- Use log rotation for container logs
- Monitor disk usage with alerts
- Set ephemeral-storage limits on pods
- Size nodes appropriately for workload
---
## Networking Issues
### Pod-to-Pod Communication Failure
**Symptoms:**
- Services can't reach other services
- Connection timeouts between pods
- DNS resolution may or may not work
**Common Causes:**
1. Network policy blocking traffic
2. CNI plugin issues
3. Firewall rules blocking traffic
4. Service misconfiguration
5. Pod CIDR exhaustion
**Diagnostic Commands:**
```bash
kubectl get networkpolicies --all-namespaces
kubectl exec -it <pod> -- ping <target-pod-ip>
kubectl exec -it <pod> -- nslookup <service-name>
kubectl get svc -n <namespace>
kubectl describe svc <service-name> -n <namespace>
kubectl get endpoints <service-name> -n <namespace>
```
**Remediation Steps:**
1. Test basic connectivity: exec into pod and ping target
2. Check network policies: look for policies affecting source/dest namespaces
3. Verify service has endpoints: `kubectl get endpoints`
4. Check if pod labels match service selector
5. Verify CNI plugin pods are healthy (usually in kube-system)
6. Check node-level firewall rules
7. Verify pod CIDR hasn't been exhausted
**Prevention:**
- Document network policies and their intent
- Use namespace labels for network policy management
- Monitor CNI plugin health
- Regularly audit network policies
- Implement network policy testing in CI/CD
---
### Service Not Accessible
**Symptoms:**
- Cannot connect to service
- LoadBalancer stuck in Pending
- NodePort not accessible
**Common Causes:**
1. Service has no endpoints (no matching pods)
2. Pods not passing readiness checks
3. LoadBalancer controller not working
4. Cloud provider integration issues
5. Port conflicts
**Diagnostic Commands:**
```bash
kubectl get svc <service-name> -n <namespace>
kubectl describe svc <service-name> -n <namespace>
kubectl get endpoints <service-name> -n <namespace>
kubectl get pods -l <service-selector> -n <namespace>
kubectl logs -l <service-selector> -n <namespace>
```
**Remediation Steps:**
1. Verify service type and ports are correct
2. Check if service has endpoints: `kubectl get endpoints`
3. If no endpoints: verify pod selector matches pod labels
4. Check pod readiness: `kubectl describe pod`
5. For LoadBalancer: check cloud provider controller logs
6. For NodePort: verify node firewall allows the port
7. Test from within cluster first, then external access
**Prevention:**
- Use meaningful service selectors and pod labels
- Implement proper readiness probes
- Test services in staging before production
- Monitor service endpoint counts
- Document external access requirements
---
## Storage Issues
### PVC Pending
**Symptoms:**
- PersistentVolumeClaim stuck in `Pending` state
- Pod can't start due to volume not available
**Common Causes:**
1. No PV matches PVC requirements
2. StorageClass doesn't exist or is misconfigured
3. Dynamic provisioner not working
4. Insufficient permissions for provisioner
5. Volume capacity exhausted in storage backend
**Diagnostic Commands:**
```bash
kubectl describe pvc <pvc-name> -n <namespace>
kubectl get pv
kubectl get storageclass
kubectl describe storageclass <storage-class-name>
kubectl get pods -n <provisioner-namespace>
```
**Remediation Steps:**
1. Check PVC events for specific error message
2. Verify StorageClass exists: `kubectl get sc`
3. Check if dynamic provisioner pod is running
4. For static provisioning: ensure PV exists with matching size/access mode
5. Verify provisioner has correct cloud credentials/permissions
6. Check storage backend capacity
7. Review StorageClass parameters for typos
**Prevention:**
- Use dynamic provisioning with reliable StorageClass
- Monitor storage backend capacity
- Set up alerts for PVC binding failures
- Test storage provisioning in non-production first
- Document storage requirements and limitations
---
### Volume Mount Failures
**Symptoms:**
- Pod fails to start with volume mount errors
- Events show mounting failures
- Container create errors related to volumes
**Common Causes:**
1. Volume already mounted to different node (RWO with multi-attach)
2. Volume doesn't exist
3. Insufficient permissions
4. Node can't reach storage backend
5. Filesystem issues on volume
**Diagnostic Commands:**
```bash
kubectl describe pod <pod-name> -n <namespace>
kubectl describe pvc <pvc-name> -n <namespace>
kubectl describe pv <pv-name>
kubectl get volumeattachments
```
**Remediation Steps:**
1. Check pod events for specific mount error
2. For RWO volumes: ensure pod is scheduled to node with volume attached
3. Verify PVC is bound to a PV
4. Check node can reach storage backend (cloud/NFS/iSCSI)
5. For CSI volumes: check CSI driver pods are healthy
6. Delete and recreate pod if volume is stuck in multi-attach state
7. Check filesystem on volume if accessible
**Prevention:**
- Use ReadWriteMany for multi-pod access scenarios
- Implement pod disruption budgets to prevent scheduling conflicts
- Monitor volume attachment status
- Use StatefulSets for stateful workloads with volumes
- Regular backup and restore testing
---
## Resource Issues
### Resource Quota Exceeded
**Symptoms:**
- New pods fail to schedule
- Error: "exceeded quota"
- ResourceQuota limits preventing resource allocation
**Common Causes:**
1. Namespace resource quota exceeded
2. Not enough resource budget available
3. Resource requests not specified on pods
4. Quota misconfiguration
**Diagnostic Commands:**
```bash
kubectl describe resourcequota -n <namespace>
kubectl describe limitrange -n <namespace>
kubectl get pods -n <namespace> -o json | jq '.items[].spec.containers[].resources'
kubectl describe namespace <namespace>
```
**Remediation Steps:**
1. Check current quota usage: `kubectl describe resourcequota`
2. Identify pods consuming resources
3. Either increase quota limits or reduce resource requests
4. Delete unused resources to free up quota
5. Optimize pod resource requests based on actual usage
6. Consider splitting workloads across namespaces
**Prevention:**
- Set quotas based on actual usage patterns
- Monitor quota usage with alerts
- Right-size pod resource requests
- Implement automatic cleanup of completed jobs/pods
- Regular quota review and adjustment
---
### CPU Throttling
**Symptoms:**
- Application performance degradation
- High CPU throttling metrics
- Services responding slowly despite available CPU
**Common Causes:**
1. CPU limits set too low
2. Burst workloads hitting limits
3. Noisy neighbor effects
4. CPU limits set without understanding workload
**Diagnostic Commands:**
```bash
kubectl top pod <pod-name> -n <namespace>
kubectl describe pod <pod-name> -n <namespace> | grep -A 5 Limits
kubectl exec <pod-name> -- cat /sys/fs/cgroup/cpu/cpu.cfs_throttled_time
```
**Remediation Steps:**
1. Check current CPU usage vs limits
2. Review throttling metrics if available
3. Increase CPU limits if application legitimately needs more
4. Remove CPU limits if workload is bursty (keep requests)
5. Use HPA if load varies significantly
6. Profile application to identify CPU-intensive operations
**Prevention:**
- Set CPU requests based on average usage
- Set CPU limits with 50-100% headroom for bursts
- Use HPA for variable workloads
- Monitor CPU throttling metrics
- Regular performance testing and profiling
---
## Security Issues
### Image Vulnerability
**Symptoms:**
- Security scanner reports vulnerabilities
- Compliance violations
- Known CVEs in running images
**Diagnostic Commands:**
```bash
kubectl get pods -o jsonpath='{.items[*].spec.containers[*].image}' | tr ' ' '\n' | sort -u
trivy image <image-name>
```
**Remediation Steps:**
1. Identify vulnerable images with scanner
2. Update base images to patched versions
3. Rebuild application images with updated dependencies
4. Update deployments with new image tags
5. Implement admission controller to block vulnerable images
**Prevention:**
- Scan images in CI/CD pipeline
- Regular base image updates
- Use minimal base images
- Implement admission controllers (OPA, Kyverno)
- Automated image updates and testing
---
### RBAC Permission Denied
**Symptoms:**
- Users or service accounts can't perform operations
- "forbidden" errors in logs
- API calls fail with 403 errors
**Common Causes:**
1. Missing Role or ClusterRole binding
2. Overly restrictive RBAC policies
3. Wrong service account in pod
4. Namespace-scoped role for cluster-wide resource
**Diagnostic Commands:**
```bash
kubectl auth can-i <verb> <resource> --as=<user/sa>
kubectl get rolebindings -n <namespace>
kubectl get clusterrolebindings
kubectl describe role <role-name> -n <namespace>
kubectl describe serviceaccount <sa-name> -n <namespace>
```
**Remediation Steps:**
1. Identify exact permission needed from error message
2. Check what user/SA can do: `kubectl auth can-i --list`
3. Create appropriate Role/ClusterRole with needed permissions
4. Create RoleBinding/ClusterRoleBinding
5. Verify service account is correctly set in pod spec
6. Test with `kubectl auth can-i` before deploying
**Prevention:**
- Follow principle of least privilege
- Use namespace-scoped roles where possible
- Document RBAC policies and their purpose
- Regular RBAC audits
- Use pre-defined roles when possible
---
This reference covers the most common Kubernetes issues. For each issue, always:
1. Gather information (describe, logs, events)
2. Form hypothesis based on symptoms
3. Test hypothesis with targeted diagnostics
4. Apply remediation
5. Verify fix
6. Document for future reference